Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Accepted by CVPR 2026

Your star means a lot to us in developing this project! ⭐⭐⭐

Shiyi Zhang^1,2*, Yiji Cheng^2*, Tiankai Hang^2*, Zijin Yin², Runze He², Yu Xu², Wenxun Dai^1,2, Yunlong Lin², Chunyu Wang², Qinglin Lu², Yansong Tang^1†

¹Tsinghua University ²Hunyuan, Tencent

^*Equal contribution ^†Corresponding author

TL;DR

We propose Meta-CoT, a two-level Chain-of-Thought decomposition paradigm for image editing that decomposes editing intentions into a (task, target, understanding ability) triplet and further into five fundamental meta-tasks, enabling strong generalization across 21+ editing operations. Combined with a CoT-Editing Consistency (CEC) Reward via Flow-GRPO, Meta-CoT achieves +15.8% overall improvement across 21 editing tasks.

🔥 Update Log

[2026/4/27] 📢 📢 Meta-CoT is released, a two-level CoT decomposition framework that enhances both granularity and generalization in image editing.
[2026/4/27] 📢 📢 Our 21-Tasks-Bench are released.

📋 TODO

Release Meta-CoT checkpoints.
Release 21-Tasks-Bench.
Release Example training data.
Release SFT training and inference code.
Release RL training code.

🌟 Overview

Key Features

Triplet Decomposition — Any editing intention is decomposed into (task, target, understanding ability), enhancing understanding granularity and guiding the model to learn each element during training.
Meta-task Generalization — All editing tasks are broken down into 5 fundamental meta-tasks (Addition, Deletion, Replacement, Camera Motion, Position Change). Training on these 5 meta-tasks generalizes to 21+ diverse editing tasks.
CoT-Editing Consistency (CEC) Reward — A VLM-based reward that measures consistency between CoT reasoning and actual editing output, integrated into GRPO framework.

Key Results

+15.8% overall improvement across 21 editing tasks on the 21-Task Benchmark
+13.0% improvement on ImgEdit benchmark vs. BAGEL (w/ think)
5 meta-tasks are sufficient to generalize to 21+ unseen editing tasks

Repository Structure

.
├── modeling/                    # Core model architecture
│   ├── bagel/                   # BAGEL MoT model, SFT variant, Qwen2-NaViT, SigLIP-NaViT
│   ├── diffusion/               # SDE-based denoising sampler
│   ├── autoencoder.py           # FLUX VAE loader
│   ├── qwen2/                   # Qwen2 backbone
│   └── siglip/                  # SigLIP vision encoder
├── data/                        # Dataset classes and configs
│   ├── configs/                 # YAML dataset mixing configs
│   ├── interleave_datasets/     # Edit/T2I interleaved dataset classes
│   ├── dataset_base.py          # Base iterable dataset (parquet shards)
│   ├── dataset_info.py          # Dataset registry and path config
│   └── transforms.py            # NaViT-aware image transforms
├── train/                       # Training entry points
│   └── sft/                     # SFT training (train_sft.py)
├── inference/                   # Inference scripts and benchmarks
│   ├── inferencer.py            # Core interleaved inference class
│   ├── edit_single.py           # Single-image editing inference
│   ├── eval_benchmark.py        # Checkpoint evaluation on editing benchmarks
│   └── benchmark/               # Editing benchmark datasets (CSV)
├── scripts/                     # Launch scripts
│   ├── train_sft_edit.sh        # SFT for editing (multi-node)
│   ├── train_sft_edit_single_node.sh
│   ├── download_ckpt.sh         # Download BAGEL base + Meta-CoT checkpoint
│   ├── download_bench.sh        # Download benchmark data
│   └── download_example_data.sh # Download example SFT training data
├── assets/                      # Figures for README
├── env.sh                       # One-command environment setup
└── requirements.txt

🚀 Getting Started

Environment Requirement 🔧

git clone https://github.com/shiyi-zh0408/Meta-CoT.git
cd Meta-CoT
source env.sh

env.sh handles conda environment creation (metacot, Python 3.10), requirements.txt installation, flash_attn==2.5.8, and any system dependencies.

Data Preparation ⏬

Benchmark data and example SFT training data can be downloaded directly via:

bash scripts/download_bench.sh
bash scripts/download_example_data.sh

Training data should be placed under data/ in Parquet shard format (for T2I/editing) or JSONL + image folders (for VLM tasks). Register your dataset paths in data/dataset_info.py and configure the dataset mix in the appropriate data/configs/*.yaml file.

Example configs:

data/configs/reasonedit_example.yaml — Meta-CoT SFT data mix

Checkpoints 📊

bash scripts/download_ckpt.sh

This downloads the BAGEL-7B-MoT base weights and the Meta-CoT fine-tuned checkpoint into ckpts/bagel/ and ckpts/metacot/ respectively.

Note: You can skip this step if you plan to run training — the training script auto-downloads the checkpoint to ckpts/ if not already present.

🏃🏼 Running Scripts

Training 🤯

SFT Training:

bash scripts/train_sft_edit.sh

Inference 📜

Single-image editing:

python inference/edit_single.py --image <your-image-path> --instruction <editing-instruction>

Evaluation on 21-Tasks-Bench

python inference/eval_benchmark.py

Inference Parameters

Parameter	Description	Typical Range
`cfg_text_scale`	Text prompt guidance strength	4.0 - 8.0
`cfg_image_scale`	Input image preservation strength	1.0 - 2.0
`cfg_interval`	Fraction of steps with CFG	[0.4, 1.0]
`timestep_shift`	Denoising step distribution shift	1.0 - 5.0
`num_timesteps`	Total denoising steps	50
`cfg_renorm_min`	CFG-Renorm minimum (1.0 disables)	0.0
`cfg_renorm_type`	`global` / `channel` / `text_channel`	`global`

Tip: If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min, or decrease cfg_scale.

🤝🏼 Cite Us

@article{zhang2026meta,
  title={Meta-CoT: Enhancing Granularity and Generalization in Image Editing},
  author={Zhang, Shiyi and Cheng, Yiji and Hang, Tiankai and Yin, Zijin and He, Runze and Xu, Yu and Dai, Wenxun and Lin, Yunlong and Wang, Chunyu and Lu, Qinglin and others},
  journal={arXiv preprint arXiv:2604.24625},
  year={2026}
}

🙏 Acknowledgments

Our code is modified based on Bagel, thanks to all the contributors!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

TL;DR

📖 Table of Contents

🔥 Update Log

📋 TODO

🌟 Overview

Key Features

Key Results

Repository Structure

🚀 Getting Started

🏃🏼 Running Scripts

🤝🏼 Cite Us

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
inference		inference
modeling		modeling
scripts		scripts
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.sh		env.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

TL;DR

📖 Table of Contents

🔥 Update Log

📋 TODO

🌟 Overview

Key Features

Key Results

Repository Structure

🚀 Getting Started

🏃🏼 Running Scripts

🤝🏼 Cite Us

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages