Accepted by CVPR 2026
Your star means a lot to us in developing this project! ⭐⭐⭐
Shiyi Zhang1,2*, Yiji Cheng2*, Tiankai Hang2*, Zijin Yin2, Runze He2, Yu Xu2, Wenxun Dai1,2, Yunlong Lin2, Chunyu Wang2, Qinglin Lu2, Yansong Tang1†
1Tsinghua University 2Hunyuan, Tencent
*Equal contribution †Corresponding author
We propose Meta-CoT, a two-level Chain-of-Thought decomposition paradigm for image editing that decomposes editing intentions into a (task, target, understanding ability) triplet and further into five fundamental meta-tasks, enabling strong generalization across 21+ editing operations. Combined with a CoT-Editing Consistency (CEC) Reward via Flow-GRPO, Meta-CoT achieves +15.8% overall improvement across 21 editing tasks.
- [2026/4/27] 📢 📢 Meta-CoT is released, a two-level CoT decomposition framework that enhances both granularity and generalization in image editing.
- [2026/4/27] 📢 📢 Our 21-Tasks-Bench are released.
- Release Meta-CoT checkpoints.
- Release 21-Tasks-Bench.
- Release Example training data.
- Release SFT training and inference code.
- Release RL training code.
- Triplet Decomposition — Any editing intention is decomposed into (task, target, understanding ability), enhancing understanding granularity and guiding the model to learn each element during training.
- Meta-task Generalization — All editing tasks are broken down into 5 fundamental meta-tasks (Addition, Deletion, Replacement, Camera Motion, Position Change). Training on these 5 meta-tasks generalizes to 21+ diverse editing tasks.
- CoT-Editing Consistency (CEC) Reward — A VLM-based reward that measures consistency between CoT reasoning and actual editing output, integrated into GRPO framework.
- +15.8% overall improvement across 21 editing tasks on the 21-Task Benchmark
- +13.0% improvement on ImgEdit benchmark vs. BAGEL (w/ think)
- 5 meta-tasks are sufficient to generalize to 21+ unseen editing tasks
.
├── modeling/ # Core model architecture
│ ├── bagel/ # BAGEL MoT model, SFT variant, Qwen2-NaViT, SigLIP-NaViT
│ ├── diffusion/ # SDE-based denoising sampler
│ ├── autoencoder.py # FLUX VAE loader
│ ├── qwen2/ # Qwen2 backbone
│ └── siglip/ # SigLIP vision encoder
├── data/ # Dataset classes and configs
│ ├── configs/ # YAML dataset mixing configs
│ ├── interleave_datasets/ # Edit/T2I interleaved dataset classes
│ ├── dataset_base.py # Base iterable dataset (parquet shards)
│ ├── dataset_info.py # Dataset registry and path config
│ └── transforms.py # NaViT-aware image transforms
├── train/ # Training entry points
│ └── sft/ # SFT training (train_sft.py)
├── inference/ # Inference scripts and benchmarks
│ ├── inferencer.py # Core interleaved inference class
│ ├── edit_single.py # Single-image editing inference
│ ├── eval_benchmark.py # Checkpoint evaluation on editing benchmarks
│ └── benchmark/ # Editing benchmark datasets (CSV)
├── scripts/ # Launch scripts
│ ├── train_sft_edit.sh # SFT for editing (multi-node)
│ ├── train_sft_edit_single_node.sh
│ ├── download_ckpt.sh # Download BAGEL base + Meta-CoT checkpoint
│ ├── download_bench.sh # Download benchmark data
│ └── download_example_data.sh # Download example SFT training data
├── assets/ # Figures for README
├── env.sh # One-command environment setup
└── requirements.txt
Environment Requirement 🔧
git clone https://github.com/shiyi-zh0408/Meta-CoT.git
cd Meta-CoT
source env.shenv.sh handles conda environment creation (metacot, Python 3.10), requirements.txt installation, flash_attn==2.5.8, and any system dependencies.
Data Preparation ⏬
Benchmark data and example SFT training data can be downloaded directly via:
bash scripts/download_bench.sh
bash scripts/download_example_data.shTraining data should be placed under data/ in Parquet shard format (for T2I/editing) or JSONL + image folders (for VLM tasks). Register your dataset paths in data/dataset_info.py and configure the dataset mix in the appropriate data/configs/*.yaml file.
Example configs:
data/configs/reasonedit_example.yaml— Meta-CoT SFT data mix
Checkpoints 📊
bash scripts/download_ckpt.shThis downloads the BAGEL-7B-MoT base weights and the Meta-CoT fine-tuned checkpoint into ckpts/bagel/ and ckpts/metacot/ respectively.
Note: You can skip this step if you plan to run training — the training script auto-downloads the checkpoint to
ckpts/if not already present.
Training 🤯
SFT Training:
bash scripts/train_sft_edit.shInference 📜
Single-image editing:
python inference/edit_single.py --image <your-image-path> --instruction <editing-instruction>Evaluation on 21-Tasks-Bench
python inference/eval_benchmark.pyInference Parameters
| Parameter | Description | Typical Range |
|---|---|---|
cfg_text_scale |
Text prompt guidance strength | 4.0 - 8.0 |
cfg_image_scale |
Input image preservation strength | 1.0 - 2.0 |
cfg_interval |
Fraction of steps with CFG | [0.4, 1.0] |
timestep_shift |
Denoising step distribution shift | 1.0 - 5.0 |
num_timesteps |
Total denoising steps | 50 |
cfg_renorm_min |
CFG-Renorm minimum (1.0 disables) | 0.0 |
cfg_renorm_type |
global / channel / text_channel |
global |
Tip: If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min, or decrease cfg_scale.
@article{zhang2026meta,
title={Meta-CoT: Enhancing Granularity and Generalization in Image Editing},
author={Zhang, Shiyi and Cheng, Yiji and Hang, Tiankai and Yin, Zijin and He, Runze and Xu, Yu and Dai, Wenxun and Lin, Yunlong and Wang, Chunyu and Lu, Qinglin and others},
journal={arXiv preprint arXiv:2604.24625},
year={2026}
}Our code is modified based on Bagel, thanks to all the contributors!
