Skip to content

shiyi-zh0408/Meta-CoT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Project Page Paper Code Model Benchmark

Accepted by CVPR 2026

Your star means a lot to us in developing this project! ⭐⭐⭐

Shiyi Zhang1,2*, Yiji Cheng2*, Tiankai Hang2*, Zijin Yin2, Runze He2, Yu Xu2, Wenxun Dai1,2, Yunlong Lin2, Chunyu Wang2, Qinglin Lu2, Yansong Tang1†

1Tsinghua University    2Hunyuan, Tencent

*Equal contribution    Corresponding author

TL;DR

We propose Meta-CoT, a two-level Chain-of-Thought decomposition paradigm for image editing that decomposes editing intentions into a (task, target, understanding ability) triplet and further into five fundamental meta-tasks, enabling strong generalization across 21+ editing operations. Combined with a CoT-Editing Consistency (CEC) Reward via Flow-GRPO, Meta-CoT achieves +15.8% overall improvement across 21 editing tasks.

Meta-CoT Teaser

📖 Table of Contents

🔥 Update Log

  • [2026/4/27] 📢 📢 Meta-CoT is released, a two-level CoT decomposition framework that enhances both granularity and generalization in image editing.
  • [2026/4/27] 📢 📢 Our 21-Tasks-Bench are released.

📋 TODO

🌟 Overview

Key Features

  • Triplet Decomposition — Any editing intention is decomposed into (task, target, understanding ability), enhancing understanding granularity and guiding the model to learn each element during training.
  • Meta-task Generalization — All editing tasks are broken down into 5 fundamental meta-tasks (Addition, Deletion, Replacement, Camera Motion, Position Change). Training on these 5 meta-tasks generalizes to 21+ diverse editing tasks.
  • CoT-Editing Consistency (CEC) Reward — A VLM-based reward that measures consistency between CoT reasoning and actual editing output, integrated into GRPO framework.

Key Results

  • +15.8% overall improvement across 21 editing tasks on the 21-Task Benchmark
  • +13.0% improvement on ImgEdit benchmark vs. BAGEL (w/ think)
  • 5 meta-tasks are sufficient to generalize to 21+ unseen editing tasks

Repository Structure

.
├── modeling/                    # Core model architecture
│   ├── bagel/                   # BAGEL MoT model, SFT variant, Qwen2-NaViT, SigLIP-NaViT
│   ├── diffusion/               # SDE-based denoising sampler
│   ├── autoencoder.py           # FLUX VAE loader
│   ├── qwen2/                   # Qwen2 backbone
│   └── siglip/                  # SigLIP vision encoder
├── data/                        # Dataset classes and configs
│   ├── configs/                 # YAML dataset mixing configs
│   ├── interleave_datasets/     # Edit/T2I interleaved dataset classes
│   ├── dataset_base.py          # Base iterable dataset (parquet shards)
│   ├── dataset_info.py          # Dataset registry and path config
│   └── transforms.py            # NaViT-aware image transforms
├── train/                       # Training entry points
│   └── sft/                     # SFT training (train_sft.py)
├── inference/                   # Inference scripts and benchmarks
│   ├── inferencer.py            # Core interleaved inference class
│   ├── edit_single.py           # Single-image editing inference
│   ├── eval_benchmark.py        # Checkpoint evaluation on editing benchmarks
│   └── benchmark/               # Editing benchmark datasets (CSV)
├── scripts/                     # Launch scripts
│   ├── train_sft_edit.sh        # SFT for editing (multi-node)
│   ├── train_sft_edit_single_node.sh
│   ├── download_ckpt.sh         # Download BAGEL base + Meta-CoT checkpoint
│   ├── download_bench.sh        # Download benchmark data
│   └── download_example_data.sh # Download example SFT training data
├── assets/                      # Figures for README
├── env.sh                       # One-command environment setup
└── requirements.txt

🚀 Getting Started

Environment Requirement 🔧
git clone https://github.com/shiyi-zh0408/Meta-CoT.git
cd Meta-CoT
source env.sh

env.sh handles conda environment creation (metacot, Python 3.10), requirements.txt installation, flash_attn==2.5.8, and any system dependencies.

Data Preparation ⏬

Benchmark data and example SFT training data can be downloaded directly via:

bash scripts/download_bench.sh
bash scripts/download_example_data.sh

Training data should be placed under data/ in Parquet shard format (for T2I/editing) or JSONL + image folders (for VLM tasks). Register your dataset paths in data/dataset_info.py and configure the dataset mix in the appropriate data/configs/*.yaml file.

Example configs:

  • data/configs/reasonedit_example.yaml — Meta-CoT SFT data mix
Checkpoints 📊
bash scripts/download_ckpt.sh

This downloads the BAGEL-7B-MoT base weights and the Meta-CoT fine-tuned checkpoint into ckpts/bagel/ and ckpts/metacot/ respectively.

Note: You can skip this step if you plan to run training — the training script auto-downloads the checkpoint to ckpts/ if not already present.

🏃🏼 Running Scripts

Training 🤯

SFT Training:

bash scripts/train_sft_edit.sh
Inference 📜

Single-image editing:

python inference/edit_single.py --image <your-image-path> --instruction <editing-instruction>

Evaluation on 21-Tasks-Bench

python inference/eval_benchmark.py

Inference Parameters

Parameter Description Typical Range
cfg_text_scale Text prompt guidance strength 4.0 - 8.0
cfg_image_scale Input image preservation strength 1.0 - 2.0
cfg_interval Fraction of steps with CFG [0.4, 1.0]
timestep_shift Denoising step distribution shift 1.0 - 5.0
num_timesteps Total denoising steps 50
cfg_renorm_min CFG-Renorm minimum (1.0 disables) 0.0
cfg_renorm_type global / channel / text_channel global

Tip: If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min, or decrease cfg_scale.

🤝🏼 Cite Us

@article{zhang2026meta,
  title={Meta-CoT: Enhancing Granularity and Generalization in Image Editing},
  author={Zhang, Shiyi and Cheng, Yiji and Hang, Tiankai and Yin, Zijin and He, Runze and Xu, Yu and Dai, Wenxun and Lin, Yunlong and Wang, Chunyu and Lu, Qinglin and others},
  journal={arXiv preprint arXiv:2604.24625},
  year={2026}
}

🙏 Acknowledgments

Our code is modified based on Bagel, thanks to all the contributors!

About

[CVPR 2026] Official code of the paper "Meta-CoT: Enhancing Granularity and Generalization in Image Editing"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors