Icon A g e n t   b a n a n a :

High-Fidelity Image Editing with Agentic Thinking and Tooling

Ruijie Ye1,2†, Jiayi Zhang3†, Zhuoxin Liu4†, Zihao Zhu1, Siyuan Yang1,
Li Li5, Tianfu Fu6, Franck Dernoncourt7‡, Yue Zhao5, Jiacheng Zhu,
Ryan Rossi7‡, Wenhao Chai9, Zhengzhong Tu1*
1TAMU   2Brown University   3UW-Madison   4UCSD   5USC   6xAI
7Adobe Research   8Meta AI   9Princeton University
*Corresponding Author

Equal contributions

Work not done at Adobe.

§Work done outside of Meta.
Image

We present Agent Banana, an agentic editing system that enables high-fidelity, native-resolution image editing through reasoning-based natural-language interaction, where each edit is context-aware, logically dependent, and locally precise.

In this example, the user provides a vague yet complex editing prompt, and Agent Banana iteratively refines a scene in native high resolution (546×3640)—from a stylistic replacement (Turn 1), to attribute decoupling that preserves non-target dynamics (changing the bottle color without affecting the pouring liquid; Turn 2), and finally to retrieving prior state and adding fine details (Turn 3).

The result is a professional-style workflow that resists over-editing and background drift, while faithfully preserving what should remain unchanged.

Pipeline
Image

The system operates in a multi-turn loop (Left), comprising two core agents: a Planner that decomposes user queries into executable editing plans, and an Executor that selects tools via the MCP Server. Crucially, the Executor incorporates a self-correction mechanism (Quality Test), reiterating the editing process if the quality check fails before presenting the result to the user. (Right) Our Evaluator assesses performance by analyzing the transition between Turn n-1 and Turn n, utilizing instruction adherence checks and state tracking (JSON) to derive the final score.

Image

Scalable Data Pipeline for Multi-turn Editing. This diagram illustrates the process of generating aligned (State, Instruction) pairs from HD images.

Quantitative Comparison of Image Editing Performance

We evaluate models on HDD-Bench focusing on image fidelity (PSNROM, SSIMOM, LPIPSOM), instruction adherence (Instruct-Following, Image Consistency), and support for high-resolution (4K) editing. Agent Banana achieves state-of-the-art performance, balancing precise instruction execution with high visual fidelity, and is natively capable of processing at 4K resolution.

Model HDD Bench ImgEdit 4K
PSNROM SSIMOM LPIPSOM IF ↑ IC ↑ Add ↑ Adj. ↑ Repl. ↑ Rem. ↑
ICEdit 29.210.800.140.5950.6873.583.393.152.93--
Qwen-Image 23.620.800.140.8450.8074.384.164.664.14--
OmniGen2 23.590.720.230.5450.6553.573.063.743.20--
BAGEL 26.930.790.170.6760.7233.563.313.32.62--
Step1X-Edit 25.820.770.190.8080.7973.883.143.402.41
FLUX.1 Kontext [Pro] 25.980.740.170.8450.7024.254.154.563.57--
GPT Image 1 [High] 19.200.540.330.8820.7274.614.334.353.66--
Nano Banana Pro 26.620.720.140.9110.8614.584.564.554.39
Agent Banana (ours) 28.400.840.120.8490.8714.584.594.624.60
Gallery
(Please refresh the page if not display normally.)

Could you kind of tidy up that little vignette—maybe drop a small rustic lantern in the grass just a bit to the right of the white table, take away (or quietly hide) those two floral metal buckets in front of it, and plop a tiny mason jar with a few pale pink wildflowers on the table beside the birdcage? Make it feel more romantic and balanced, subtle but a little eye-catching—maybe even give the lantern a soft warm glow so it looks cozy, but don’t overdo it.

Could you take away that little lantern by the table and maybe lean a small chalkboard sign up against the right side of the white table — a bit off-kilter so it looks casual — and drop a low wooden crate on the grass in front of the left stool? Make the scene feel more lived-in and balanced but not too staged; don’t make it glaringly new, and maybe leave a tiny trace of the lantern so it doesn’t look like it was never there. Also, if you can, make the chalkboard look hand-lettered and natural without being too bold.

Can you just lose that chalkboard you put there before — like, take it away but don’t make it look like it was never touched — and then drop a little stack of old books on the crate in front of the stool, you know, a neat-ish vintage pile (not too perfect, but not sloppy), and pop a small bouquet into the mason jar on the table by the birdcage so it looks charming and natural? Make the books feel a bit worn and the flowers soft and fresh, but subtle — nothing screaming “new prop.

 

BibTeX

@article{ye2026agentbanana,
        title={Agent banana: High-Fidelity Image Editing with Agentic Thinking and Tooling}, 
        author={Ruijie Ye and Jiayi Zhang and Zhuoxin Liu and Zihao Zhu and Siyuan Yang and Li Li and Tianfu Fu and Franck Dernoncourt and Yue Zhao and Jiacheng Zhu and Ryan Rossi and Wenhao Chai and Zhengzhong Tu},
        year={2026},
        eprint={2602.09084},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2602.09084}, 
}