GenEvolve

Self-Evolving Image Generation Agents via
Tool-Orchestrated Visual Experience Distillation

1The Hong Kong University of Science and Technology (Guangzhou) 2Meituan 3The Hong Kong University of Science and Technology 4National University of Singapore Project Leader: Junfeng Luo (Meituan)
GenEvolve Generated Results
๐ŸŽจ Representative images generated by GenEvolve agents with Nano Banana Pro & Qwen-Image-Edit โ€” spanning architecture, nature, creative transfer, street scenes, scientific illustration, and more.

Open-ended image generation is no longer a simple prompt-to-image problem. High-quality generation often requires an agent to combine a model's internal generative ability with external resources. As requests become more diverse and demanding, we aim to develop a general image-generation agent that can self-evolve through trajectories and use tools more effectively across varied generation challenges.

To this end, we propose GenEvolve, a self-evolving framework based on Tool-Orchestrated Visual Experience Distillation. In GenEvolve, each generation attempt is modeled as a tool-orchestrated trajectory, where the agent gathers evidence, selects references, invokes generation skills, and composes them into a prompt-reference program. Unlike existing methods that rely on image-level scalar rewards, GenEvolve compares multiple trajectories and abstracts best-worst differences into structured visual experience, provided only to a privileged teacher branch. Inspired by on-policy self-distillation, Visual Experience Distillation provides dense token-level supervision for better tool orchestration.

We further construct GenEvolve-Data and GenEvolve-Bench. Experiments show substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks.

Key Highlights

Tool-Orchestrated Trajectories

Models generation as multi-turn visual trajectories with external search, visual references, and callable generation knowledge.

Visual Experience Distillation

Converts best-worst trajectory differences into structured experience, enabling dense token-level supervision via privileged teacher distillation.

Self-Evolving Loop

GRPO + VED form a closed loop: stronger policies โ†’ better trajectories โ†’ richer experience โ†’ future improvements.

GenEvolve-Data & Bench

Comprehensive trajectory dataset and diagnostic benchmark spanning Knowledge-Anchored and Quality-Anchored challenges.

Framework Overview

GenEvolve Framework

Overview of GenEvolve. The student agent orchestrates search, references, and generation knowledge to produce a prompt-reference program. Multiple trajectories are judged; best-worst differences form visual experience injected into the teacher. GRPO + Visual Experience Self-Distillation closes the self-evolving loop.

1

Tool-Orchestrated Trajectories

The agent samples trajectories with three tool families: search(q) for textual evidence, image_search(q) for visual references, and query_knowledge(skill) for callable generation skills (text rendering, layout, anatomy, aesthetics, material, counting, etc.)

2

Prompt-Reference Program Synthesis

The trajectory outputs z=(g,R) โ€” a targeted instruction with ordinal references, binding constraints from the user request, retrieved facts, selected references, and activated generation knowledge.

3

Visual Experience Extraction

Best-worst trajectory pairs are summarized into five structured experience slots: search strategy, knowledge activation, reference selection, prompt construction, and failure avoidance.

4

Visual Experience Self-Distillation

The teacher receives experience-patched context; student sees only normal context. Importance-weighted reverse-KL distillation provides dense token-level guidance without changing inference.

GenEvolve-Data & GenEvolve-Bench

A complete trajectory-level dataset for training and evaluating image-generation agents โ€” from diverse prompts through tool-orchestrated teacher trajectories to filtered GT images.

Data Pipeline

Data construction pipeline. Diverse prompts โ†’ tool-orchestrated teacher trajectories โ†’ VLM audit โ†’ GT image rendering & filtering โ†’ splits for SFT cold-start, self-evolution, and held-out evaluation.

๐Ÿ“‹

Prompt Pool

20K structured recipes spanning Knowledge-Anchored (entities, events, places) and Quality-Anchored (text, layout, counting, anatomy, material, aesthetics) generation challenges.

๐Ÿ”„

Teacher Trajectories

Each prompt is solved through a real multi-turn tool loop by strong teacher models (Seed2.0, Gemini 3 Pro), recording search queries, selected references, activated skills, and final programs.

๐Ÿ”

Filtering & Audit

Programmatic checks + VLM judge remove incomplete tool loops, invalid reference selections, and underspecified programs. Only high-quality trajectories survive.

๐ŸŽฏ

GT Images & Splits

Teacher programs are rendered into GT images by Nano Banana Pro, filtered for quality, then split into SFT (trajectories), self-evolution (image feedback), and GenEvolve-Bench (held-out eval).

๐Ÿ“ฆ Data Examples

Each entry contains a user request, full tool trajectory, selected references, and GT image.

Trajectory Visualization

See how GenEvolve orchestrates tools step-by-step: from user request through search, reference selection, skill activation, to the final generated image.

Results

Main Comparison on GenEvolve-Bench
MethodGeneratorJudge DimensionsBenchmark Overall
Faith.Vis.TextAesth.KScoreKnow.Qual.
Direct Generator Baselines
Lumina-Image 2.0Lumina-Image 2.00.10440.00000.33080.26940.16970.15280.1915
BAGELBAGEL0.12120.00590.37210.40820.20410.16840.2504
SD-3.5-LargeSD-3.50.14560.01350.38720.48650.22350.19430.2612
FLUX.1-devFLUX.10.15740.00590.41500.55560.23960.20970.2784
FLUX.2 Klein 4BFLUX.20.25250.00590.38470.56480.23800.20040.2865
Z-Image-TurboZ-Image0.28370.03960.43690.61870.28080.23400.3413
FLUX.2 Klein 9BFLUX.20.36620.02100.41920.65990.27870.23270.3382
Z-ImageZ-Image0.33330.02780.43520.54290.27280.22030.3407
Qwen-ImageQwen-Image0.37290.06230.42260.67510.29870.23840.3768
Nano Banana ProNano Banana Pro0.77610.28370.61780.91580.52980.51600.5477
Agentic Image-Generation Workflows
Gen-Searcher 8BQwen-Image-Edit-25110.52840.10500.47680.63770.34930.32930.3745
Gen-Searcher 8BNano Banana Pro0.74650.33780.61980.90360.54810.54720.5492
GenEvolve (Ours)Qwen-Image-Edit-25110.53030.13380.49070.63470.36630.34100.3990
GenEvolve (Ours)Nano Banana Pro0.79700.38320.62180.92220.57390.56690.5830
External Evaluation on WISE Benchmark
ModelCulturalTimeSpaceBiologyPhysicsChemistryOverall
Direct Generator Baselines
Emu30.340.450.480.410.450.270.39
FLUX.1-schnell0.390.440.500.310.440.260.40
SD-3-Medium0.420.440.480.390.470.290.42
SD-3.5-Medium0.430.500.520.410.530.330.45
SD-3.5-Large0.440.500.580.440.520.310.46
FLUX.1-dev0.480.580.620.420.510.350.50
Hunyuan-Image 3.00.580.570.700.560.630.310.57
UniWorld-V20.600.610.700.530.640.320.58
Qwen-Image0.620.630.770.570.750.400.62
NextFlow-RL0.630.630.770.580.670.390.62
LongCat-Image0.660.610.720.660.720.490.65
DeepGen1.00.720.810.700.670.820.660.73
GPT-4o0.810.710.890.830.790.740.80
Agentic Image-Generation Workflows
GenAgent0.780.670.780.720.770.550.72
Gen-Searcher-8B + Qwen-Image0.800.710.820.760.740.750.77
Mind-Brush0.830.690.840.710.850.680.78
GenEvolve + Qwen-Image-Edit (Ours)0.840.740.870.830.810.830.82
Visual Comparison

Visual comparison on GenEvolve-Bench. Orange = external knowledge requirements. Blue = internal generation-knowledge requirements.

BibTeX

@article{chen2026genevolve,
  title   = {GenEvolve: Self-Evolving Image Generation Agents via 
             Tool-Orchestrated Visual Experience Distillation},
  author  = {Chen, Sixiang and Xing, Zhaohu and Ye, Tian and Geng, Xinyu 
             and Lin, Yunlong and Lai, Jianyu and He, Xuanhua and Zhai, Fuxiang 
             and Gao, Jialin and Zhu, Lei},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}