Agent Banana

A g e n t b a n a n a :

High-Fidelity Image Editing with Agentic Thinking and Tooling

Ruijie Ye^1,2†, Jiayi Zhang^1,3†, Zhuoxin Liu^4†, Zihao Zhu¹, Siyuan Yang¹,
Li Li⁵, Tianfu Fu⁶, Franck Dernoncourt⁷, Yue Zhao⁵, Jiacheng Zhu^8§,
Ryan Rossi⁷, Wenhao Chai⁹, Zhengzhong Tu^1*

¹TAMU ²Brown University ³UW-Madison ⁴UCSD ⁵USC ⁶xAI
⁷Adobe Research ⁸Meta AI ⁹Princeton University
^*Corresponding Author
^†Equal contributions
^§Work done outside of Meta

Paper Code

We present Agent Banana, an agentic editing system that enables high-fidelity, native-resolution image editing through reasoning-based natural-language interaction, where each edit is context-aware, logically dependent, and locally precise.

In this example, the user provides a vague yet complex editing prompt, and Agent Banana iteratively refines a scene in native high resolution (546×3640)—from a stylistic replacement (Turn 1), to attribute decoupling that preserves non-target dynamics (changing the bottle color without affecting the pouring liquid; Turn 2), and finally to retrieving prior state and adding fine details (Turn 3).

The result is a professional-style workflow that resists over-editing and background drift, while faithfully preserving what should remain unchanged.

Pipeline

The system operates in a multi-turn loop (Left), comprising two core agents: a Planner that decomposes user queries into executable editing plans, and an Executor that selects tools via the MCP Server. Crucially, the Executor incorporates a self-correction mechanism (Quality Test), reiterating the editing process if the quality check fails before presenting the result to the user. (Right) Our Evaluator assesses performance by analyzing the transition between Turn n-1 and Turn n, utilizing instruction adherence checks and state tracking (JSON) to derive the final score.

Scalable Data Pipeline for Multi-turn Editing. This diagram illustrates the process of generating aligned (State, Instruction) pairs from HD images.

Quantitative Comparison of Image Editing Performance

We evaluate models on HDD-Bench focusing on image fidelity (PSNR_OM, SSIM_OM, LPIPS_OM), instruction adherence (Instruct-Following, Image Consistency), and support for high-resolution (4K) editing. Agent Banana achieves state-of-the-art performance, balancing precise instruction execution with high visual fidelity, and is natively capable of processing at 4K resolution.

Model	HDD Bench					ImgEdit				4K
Model	PSNR_OM ↑	SSIM_OM ↑	LPIPS_OM ↓	IF ↑	IC ↑	Add ↑	Adj. ↑	Repl. ↑	Rem. ↑	4K
ICEdit	29.21	0.80	0.14	0.595	0.687	3.58	3.39	3.15	2.93	--
Qwen-Image	23.62	0.80	0.14	0.845	0.807	4.38	4.16	4.66	4.14	--
OmniGen2	23.59	0.72	0.23	0.545	0.655	3.57	3.06	3.74	3.20	--
BAGEL	26.93	0.79	0.17	0.676	0.723	3.56	3.31	3.3	2.62	--
Step1X-Edit	25.82	0.77	0.19	0.808	0.797	3.88	3.14	3.40	2.41
FLUX.1 Kontext [Pro]	25.98	0.74	0.17	0.845	0.702	4.25	4.15	4.56	3.57	--
GPT Image 1 [High]	19.20	0.54	0.33	0.882	0.727	4.61	4.33	4.35	3.66	--
Nano Banana Pro	26.62	0.72	0.14	0.911	0.861	4.58	4.56	4.55	4.39
Agent Banana (ours)	28.40	0.84	0.12	0.849	0.871	4.58	4.59	4.62	4.60

Gallery

(Please refresh the page if not display normally.)

Could you swap out that rectangular wood table for a round, more boho rattan one — something that feels lighter and hand‑made but not super wicker-y — and pop a small potted plant somewhere by the seating (either on the deck next to the sofa or perched near the chairs, whatever looks best)? Also, get rid of the two little gray bowl candles on the table and replace them with a single nicer decorative piece, like a tray or something that ties it together. Make it cozy and balanced, but keep the same overall vibe

Can you kind of swap that candle thing on the far right for a frosted‑glass hurricane with a beeswax pillar in it — matte but still showing a warm little flame — then tuck a small, low side table at the right end of the sofa (something simple and useful, not flashy) and drop a low outdoor pouf near the coffee table so it looks casually usable and not perfectly staged?

Can you swap that frosted candle thing on the right for an old metal lantern — like vintage-looking but not all rusty — and change the navy patterned throw pillows to simple linen ones (neutral, not boring)? Also turn the little tray on the table into a matte black metal tray with a thin gold rim so it feels a bit dressed up but still natural and cozy — make it look real and a touch aged, but classy, not shiny.

Could you kind of tidy up that little vignette—maybe drop a small rustic lantern in the grass just a bit to the right of the white table, take away (or quietly hide) those two floral metal buckets in front of it, and plop a tiny mason jar with a few pale pink wildflowers on the table beside the birdcage? Make it feel more romantic and balanced, subtle but a little eye-catching—maybe even give the lantern a soft warm glow so it looks cozy, but don’t overdo it.

Could you take away that little lantern by the table and maybe lean a small chalkboard sign up against the right side of the white table — a bit off-kilter so it looks casual — and drop a low wooden crate on the grass in front of the left stool? Make the scene feel more lived-in and balanced but not too staged; don’t make it glaringly new, and maybe leave a tiny trace of the lantern so it doesn’t look like it was never there. Also, if you can, make the chalkboard look hand-lettered and natural without being too bold.

Can you just lose that chalkboard you put there before — like, take it away but don’t make it look like it was never touched — and then drop a little stack of old books on the crate in front of the stool, you know, a neat-ish vintage pile (not too perfect, but not sloppy), and pop a small bouquet into the mason jar on the table by the birdcage so it looks charming and natural? Make the books feel a bit worn and the flowers soft and fresh, but subtle — nothing screaming “new prop.

BibTeX

@article{ye2026agentbanana,
        title={Agent banana: High-Fidelity Image Editing with Agentic Thinking and Tooling}, 
        author={Ruijie Ye and Jiayi Zhang and Zhuoxin Liu and Zihao Zhu and Siyuan Yang and Li Li and Tianfu Fu and Franck Dernoncourt and Yue Zhao and Jiacheng Zhu and Ryan Rossi and Wenhao Chai and Zhengzhong Tu},
        year={2026},
        eprint={2602.09084},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2602.09084}, 
}