Skip to content

Hongyang-Du/VideoGPA

Repository files navigation

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du*1,2 Β· Junjie Ye*1Β· Xiaoyan Cong*2 Β· Runhao Li1 Β· Jingcheng Ni2
Aman Agarwal2 Β· Zeqi Zhou2 Β· Zekun Li2 Β· Randall Balestriero2 Β· Yue Wang1

1Physical SuperIntelligence Lab, University of Southern California
2Department of Computer Science, Brown University
* Equal Contribution

License: MIT

Pipeline

πŸ”₯ News

  • πŸ† VideoGPA(Wan2.2-TI2V-5B Base Model) won πŸ₯‰ place in the eBay-sponsored Image-to-Video Consistent Generation Challenge at CVPR 2026 VGBE Workshop
CVPR 2026 Image-to-Video Consistent Generation Challenge Third Place Award
  • We release the VideoGPA-Wan2.2-TI2V DPO LoRA checkpoint! Download via python download_ckpt.py ti2v and generate with generate/Wan2.2-TI2V-5B.py.
  • We release VideoGPA-I2V-1K β€” we find that only 1,000 steps already achieves surprisingly strong visual quality and benchmark scores. We're releasing it so everyone can play around with it! Download via python download_ckpt.py i2v-1k.
  • We release our DL3DV video captions generated with CogVLM. Check them out in dl3dv_video_captions.
  • We release the training code for Wan2.2-TI2V-5B! Check it out in train/Wan2.2-TI2V-5B.

Quick Start

πŸ“‹ Requirements

Python 3.10 – 3.12.

pip install -r requirements.txt

πŸ”˜ Checkpoint Download

# Download all VideoGPA LoRA checkpoints
python download_ckpt.py all

# Or download specific ones
python download_ckpt.py i2v      # CogVideoX-I2V-5B
python download_ckpt.py i2v-1k   # CogVideoX-I2V-5B (1K steps, lightweight)
python download_ckpt.py t2v      # CogVideoX-5B
python download_ckpt.py t2v15    # CogVideoX1.5-5B
python download_ckpt.py ti2v     # Wan2.2-TI2V-5B
checkpoints/
β”œβ”€β”€ VideoGPA-I2V-lora/
β”‚   └── adapter_model.safetensors
β”œβ”€β”€ VideoGPA-I2V-1K-lora/
β”‚   └── adapter_model.safetensors
β”œβ”€β”€ VideoGPA-T2V-lora/
β”‚   └── adapter_model.safetensors
β”œβ”€β”€ VideoGPA-T2V1.5-lora/
β”‚   └── adapter_model.safetensors
└── VideoGPA-Wan2.2TI2V-lora/
    └── adapter_model.safetensors

🎬 Video Generation

All scripts share the same interface: --prompt_json (required), --output_dir (required), --lora_path (optional for DPO), --gpu_id, --seed.

CogVideoX-5B Text-to-Video

# Baseline (no LoRA)
python generate/CogVideoX-5B.py \
    --prompt_json prompts.json \
    --output_dir outputs/t2v_baseline

# With VideoGPA DPO LoRA
python generate/CogVideoX-5B.py \
    --prompt_json prompts.json \
    --output_dir outputs/t2v_dpo \
    --lora_path checkpoints/VideoGPA-T2V-lora

CogVideoX-5B Image-to-Video

# Baseline
python generate/CogVideoX-5B-I2V.py \
    --prompt_json prompts.json \
    --output_dir outputs/i2v_baseline

# With VideoGPA DPO LoRA
python generate/CogVideoX-5B-I2V.py \
    --prompt_json prompts.json \
    --output_dir outputs/i2v_dpo \
    --lora_path checkpoints/VideoGPA-I2V-lora

# With VideoGPA-I2V-1K LoRA (lightweight, 1K steps)
python generate/CogVideoX-5B-I2V.py \
    --prompt_json prompts.json \
    --output_dir outputs/i2v_1k \
    --lora_path checkpoints/VideoGPA-I2V-1K-lora

CogVideoX1.5-5B Text-to-Video

# Baseline
python generate/CogVideoX1.5-5B.py \
    --prompt_json prompts.json \
    --output_dir outputs/t2v15_baseline

# With VideoGPA DPO LoRA
python generate/CogVideoX1.5-5B.py \
    --prompt_json prompts.json \
    --output_dir outputs/t2v15_dpo \
    --lora_path checkpoints/VideoGPA-T2V1.5-lora

Wan2.2-TI2V-5B Text-Image-to-Video

Unlike the CogVideoX scripts, generate/Wan2.2-TI2V-5B.py requires --model_path pointing to the base Wan2.2-TI2V-5B weights.

# Baseline
python generate/Wan2.2-TI2V-5B.py \
    --model_path /path/to/Wan2.2-TI2V-5B \
    --prompt_json prompts.json \
    --output_dir outputs/ti2v_baseline

# With VideoGPA DPO LoRA (LoRA strength defaults to --lora_weight 0.2)
python generate/Wan2.2-TI2V-5B.py \
    --model_path /path/to/Wan2.2-TI2V-5B \
    --prompt_json prompts.json \
    --output_dir outputs/ti2v_dpo \
    --lora_path checkpoints/VideoGPA-Wan2.2TI2V-lora

LoRA strength: both Wan2.2-TI2V-5B.py and CogVideoX1.5-5B.py apply the LoRA at --lora_weight 0.2 by default. Pass a different value to tune it.

Common Arguments

Argument Description Default
--prompt_json JSON file with prompts (required) β€”
--output_dir Output directory (required) β€”
--lora_path Path to LoRA adapter None
--gpu_id GPU device ID 0
--seed Random seed 42
--num_prompts Limit number of prompts all

Prompt JSON Format

{
  "scene_001": {"text_prompt": "Camera pans left", "image_prompt": "/path/to/frame.png"},
  "scene_002": {"text_prompt": "Zoom into the building", "image_prompt": "/path/to/frame2.png"}
}

For T2V, image_prompt can be omitted. See data_prep/generate_i2v_prompts.py to auto-generate prompts from a folder of first frames.

πŸ“ Code Structure

VideoGPA/
β”œβ”€β”€ generate/                    # Video generation scripts
β”‚   β”œβ”€β”€ CogVideoX-5B.py              # T2V
β”‚   β”œβ”€β”€ CogVideoX-5B-I2V.py          # I2V
β”‚   β”œβ”€β”€ CogVideoX1.5-5B.py           # T2V 1.5
β”‚   └── Wan2.2-TI2V-5B.py            # Wan TI2V
β”œβ”€β”€ train/                       # DPO training pipeline
β”‚   β”œβ”€β”€ 01_preference_pair.py        # Video scoring
β”‚   β”œβ”€β”€ dataset.py                   # DPO dataset (CogVideo + Wan)
β”‚   β”œβ”€β”€ loss.py                      # DPO loss
β”‚   β”œβ”€β”€ CogVideoX-5B/                # encode & train
β”‚   β”œβ”€β”€ CogVideoX-I2V-5B/            # encode & train
β”‚   β”œβ”€β”€ CogVideoX1.5-5B/             # encode & train
β”‚   └── Wan2.2-TI2V-5B/              # encode & train
β”œβ”€β”€ dl3dv_video_captions/        # Benchmark captions (1K / 8K / 9K / 10K / 11K)
β”œβ”€β”€ data_prep/                   # Scripts to prepare prompt JSONs
β”œβ”€β”€ checkpoints/                 # VideoGPA LoRA weights
β”œβ”€β”€ metrics/                     # Evaluation metrics (MSE, SSIM, LPIPS, epipolar, …)
β”œβ”€β”€ pipelines/                   # Shared video processing pipeline
β”œβ”€β”€ utils/                       # Utility functions
β”œβ”€β”€ replicate.py                 # Multi-GPU I2V generation for benchmarking
β”œβ”€β”€ replicate_scorer.py          # Multi-GPU DA3 scoring
└── replicate.sh                 # End-to-end generation + scoring script

πŸ”§ DPO Training

VideoGPA uses DPO (Direct Preference Optimization) to improve 3D consistency in video generation. The training pipeline has 3 steps:

Step 1: Score Generated Videos

python train/01_preference_pair.py

Step 2: Encode Videos to Latent Space

# CogVideoX models
python train/CogVideoX-I2V-5B/02_encode.py
python train/CogVideoX-5B/02_encode.py
python train/CogVideoX1.5-5B/02_encode.py

# Wan2.2 (requires --base_path and --model_path)
python train/Wan2.2-TI2V-5B/02_encode.py \
    --base_path /path/to/dataset \
    --model_path /path/to/Wan2.2-TI2V-5B \
    --input_json /path/to/scored.json \
    --output_json /path/to/encoded.json

Step 3: Run DPO Training

# CogVideoX models
python train/CogVideoX-I2V-5B/03_train.py --base_path /path/to/dataset
python train/CogVideoX-5B/03_train.py --base_path /path/to/dataset
python train/CogVideoX1.5-5B/03_train.py --base_path /path/to/dataset

# Wan2.2
python train/Wan2.2-TI2V-5B/03_train.py \
    --base_path /path/to/dataset \
    --model_path /path/to/Wan2.2-TI2V-5B

Shared components (train/dataset.py, train/loss.py) work across all models β€” CogVideoX uses v-prediction, Wan uses flow matching, but the DPO loss operates on model-agnostic (prediction, target) pairs.

Data Format: Training requires JSON metadata with preference pairs. See dataset.py for the expected format.

πŸ“Š Benchmark Replication

replicate.sh runs generation and scoring end-to-end. Requires DL3DV-10K first frames; text captions are provided in dl3dv_video_captions/captions_1K.json.

bash replicate.sh \
  --dl3dv_dir /path/to/DL3DV-10K \
  --lora_path checkpoints/VideoGPA-I2V-lora \
  --output_dir output/i2v_dpo \
  --devices 0,1,2,3,4,5,6,7

Scores are saved to <output_dir>/scores.csv. Run bash replicate.sh --help for all options.

Note: Scores may differ slightly from the paper due to non-deterministic CUDA operators in inference and hardware variation across machines.

πŸ™ Acknowledgements

We would like to express our gratitude to the following projects and researchers:

  • CogVideoX - Text/Image-to-video generation model.
  • Wan2.2 - State-of-the-art video generation model.
  • PEFT - Parameter-efficient fine-tuning with LoRA.
  • Diffusion DPO - Direct Preference Optimization in the diffusion latent space.

Thanks to Dawei Liu for the amazing website design!

🌟 Citation

@misc{du2026videogpadistillinggeometrypriors,
      title={VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation}, 
      author={Hongyang Du and Junjie Ye and Xiaoyan Cong and Runhao Li and Jingcheng Ni and Aman Agarwal and Zeqi Zhou and Zekun Li and Randall Balestriero and Yue Wang},
      year={2026},
      eprint={2601.23286},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.23286}, 
}

About

[ICML'26] VideoGPA is a self-supervised framework that enhances 3D consistency in Video Diffusion Models.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors