Icon A g e n t   b a n a n a :

High-Fidelity Image Editing with Agentic Thinking and Tooling

Ruijie Ye1,2†, Jiayi Zhang1,3†, Zhuoxin Liu4†, Zihao Zhu1, Siyuan Yang1,
Li Li5, Tianfu Fu6, Franck Dernoncourt7, Yue Zhao5, Jiacheng Zhu,
Ryan Rossi7, Wenhao Chai9, Zhengzhong Tu1*
1TAMU   2Brown University   3UW-Madison   4UCSD   5USC   6xAI
7Adobe Research   8Meta AI   9Princeton University
*Corresponding Author

Equal contributions

§Work done outside of Meta
Image

We present Agent Banana, an agentic editing system that enables high-fidelity, native-resolution image editing through reasoning-based natural-language interaction, where each edit is context-aware, logically dependent, and locally precise.

In this example, the user provides a vague yet complex editing prompt, and Agent Banana iteratively refines a scene in native high resolution (546×3640)—from a stylistic replacement (Turn 1), to attribute decoupling that preserves non-target dynamics (changing the bottle color without affecting the pouring liquid; Turn 2), and finally to retrieving prior state and adding fine details (Turn 3).

The result is a professional-style workflow that resists over-editing and background drift, while faithfully preserving what should remain unchanged.

Pipeline
Image

The system operates in a multi-turn loop (Left), comprising two core agents: a Planner that decomposes user queries into executable editing plans, and an Executor that selects tools via the MCP Server. Crucially, the Executor incorporates a self-correction mechanism (Quality Test), reiterating the editing process if the quality check fails before presenting the result to the user. (Right) Our Evaluator assesses performance by analyzing the transition between Turn n-1 and Turn n, utilizing instruction adherence checks and state tracking (JSON) to derive the final score.

Image

Scalable Data Pipeline for Multi-turn Editing. This diagram illustrates the process of generating aligned (State, Instruction) pairs from HD images.

Quantitative Comparison of Image Editing Performance

We evaluate models on HDD-Bench focusing on image fidelity (PSNROM, SSIMOM, LPIPSOM), instruction adherence (Instruct-Following, Image Consistency), and support for high-resolution (4K) editing. Agent Banana achieves state-of-the-art performance, balancing precise instruction execution with high visual fidelity, and is natively capable of processing at 4K resolution.

Model HDD Bench ImgEdit 4K
PSNROM SSIMOM LPIPSOM IF ↑ IC ↑ Add ↑ Adj. ↑ Repl. ↑ Rem. ↑
ICEdit 29.210.800.140.5950.6873.583.393.152.93--
Qwen-Image 23.620.800.140.8450.8074.384.164.664.14--
OmniGen2 23.590.720.230.5450.6553.573.063.743.20--
BAGEL 26.930.790.170.6760.7233.563.313.32.62--
Step1X-Edit 25.820.770.190.8080.7973.883.143.402.41
FLUX.1 Kontext [Pro] 25.980.740.170.8450.7024.254.154.563.57--
GPT Image 1 [High] 19.200.540.330.8820.7274.614.334.353.66--
Nano Banana Pro 26.620.720.140.9110.8614.584.564.554.39
Agent Banana (ours) 28.400.840.120.8490.8714.584.594.624.60
Gallery
(Please refresh the page if not display normally.)

BibTeX

@article{ye2026agentbanana,
        title={Agent banana: High-Fidelity Image Editing with Agentic Thinking and Tooling}, 
        author={Ruijie Ye and Jiayi Zhang and Zhuoxin Liu and Zihao Zhu and Siyuan Yang and Li Li and Tianfu Fu and Franck Dernoncourt and Yue Zhao and Jiacheng Zhu and Ryan Rossi and Wenhao Chai and Zhengzhong Tu},
        year={2026},
        eprint={2602.09084},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2602.09084}, 
}