VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du^*1,2 · Junjie Ye^*1· Xiaoyan Cong^*2 · Runhao Li¹ · Jingcheng Ni²
Aman Agarwal² · Zeqi Zhou² · Zekun Li² · Randall Balestriero² · Yue Wang¹

¹Physical SuperIntelligence Lab, University of Southern California
²Department of Computer Science, Brown University
^* Equal Contribution

🔥 News

🏆 VideoGPA(Wan2.2-TI2V-5B Base Model) won 🥉 place in the eBay-sponsored Image-to-Video Consistent Generation Challenge at CVPR 2026 VGBE Workshop

CVPR 2026 Image-to-Video Consistent Generation Challenge Third Place Award

We release the VideoGPA-Wan2.2-TI2V DPO LoRA checkpoint! Download via python download_ckpt.py ti2v and generate with generate/Wan2.2-TI2V-5B.py.
We release VideoGPA-I2V-1K — we find that only 1,000 steps already achieves surprisingly strong visual quality and benchmark scores. We're releasing it so everyone can play around with it! Download via python download_ckpt.py i2v-1k.
We release our DL3DV video captions generated with CogVLM. Check them out in dl3dv_video_captions.
We release the training code for Wan2.2-TI2V-5B! Check it out in train/Wan2.2-TI2V-5B.

Quick Start

📋 Requirements

Python 3.10 – 3.12.

pip install -r requirements.txt

🔘 Checkpoint Download

# Download all VideoGPA LoRA checkpoints
python download_ckpt.py all

# Or download specific ones
python download_ckpt.py i2v      # CogVideoX-I2V-5B
python download_ckpt.py i2v-1k   # CogVideoX-I2V-5B (1K steps, lightweight)
python download_ckpt.py t2v      # CogVideoX-5B
python download_ckpt.py t2v15    # CogVideoX1.5-5B
python download_ckpt.py ti2v     # Wan2.2-TI2V-5B

checkpoints/
├── VideoGPA-I2V-lora/
│   └── adapter_model.safetensors
├── VideoGPA-I2V-1K-lora/
│   └── adapter_model.safetensors
├── VideoGPA-T2V-lora/
│   └── adapter_model.safetensors
├── VideoGPA-T2V1.5-lora/
│   └── adapter_model.safetensors
└── VideoGPA-Wan2.2TI2V-lora/
    └── adapter_model.safetensors

🎬 Video Generation

All scripts share the same interface: --prompt_json (required), --output_dir (required), --lora_path (optional for DPO), --gpu_id, --seed.

CogVideoX-5B Text-to-Video

# Baseline (no LoRA)
python generate/CogVideoX-5B.py \
    --prompt_json prompts.json \
    --output_dir outputs/t2v_baseline

# With VideoGPA DPO LoRA
python generate/CogVideoX-5B.py \
    --prompt_json prompts.json \
    --output_dir outputs/t2v_dpo \
    --lora_path checkpoints/VideoGPA-T2V-lora

CogVideoX-5B Image-to-Video

# Baseline
python generate/CogVideoX-5B-I2V.py \
    --prompt_json prompts.json \
    --output_dir outputs/i2v_baseline

# With VideoGPA DPO LoRA
python generate/CogVideoX-5B-I2V.py \
    --prompt_json prompts.json \
    --output_dir outputs/i2v_dpo \
    --lora_path checkpoints/VideoGPA-I2V-lora

# With VideoGPA-I2V-1K LoRA (lightweight, 1K steps)
python generate/CogVideoX-5B-I2V.py \
    --prompt_json prompts.json \
    --output_dir outputs/i2v_1k \
    --lora_path checkpoints/VideoGPA-I2V-1K-lora

CogVideoX1.5-5B Text-to-Video

# Baseline
python generate/CogVideoX1.5-5B.py \
    --prompt_json prompts.json \
    --output_dir outputs/t2v15_baseline

# With VideoGPA DPO LoRA
python generate/CogVideoX1.5-5B.py \
    --prompt_json prompts.json \
    --output_dir outputs/t2v15_dpo \
    --lora_path checkpoints/VideoGPA-T2V1.5-lora

Wan2.2-TI2V-5B Text-Image-to-Video

Unlike the CogVideoX scripts, generate/Wan2.2-TI2V-5B.py requires --model_path pointing to the base Wan2.2-TI2V-5B weights.

# Baseline
python generate/Wan2.2-TI2V-5B.py \
    --model_path /path/to/Wan2.2-TI2V-5B \
    --prompt_json prompts.json \
    --output_dir outputs/ti2v_baseline

# With VideoGPA DPO LoRA (LoRA strength defaults to --lora_weight 0.2)
python generate/Wan2.2-TI2V-5B.py \
    --model_path /path/to/Wan2.2-TI2V-5B \
    --prompt_json prompts.json \
    --output_dir outputs/ti2v_dpo \
    --lora_path checkpoints/VideoGPA-Wan2.2TI2V-lora

LoRA strength: both Wan2.2-TI2V-5B.py and CogVideoX1.5-5B.py apply the LoRA at --lora_weight 0.2 by default. Pass a different value to tune it.

Common Arguments

Argument	Description	Default
`--prompt_json`	JSON file with prompts (required)	—
`--output_dir`	Output directory (required)	—
`--lora_path`	Path to LoRA adapter	`None`
`--gpu_id`	GPU device ID	`0`
`--seed`	Random seed	`42`
`--num_prompts`	Limit number of prompts	all

Prompt JSON Format

{
  "scene_001": {"text_prompt": "Camera pans left", "image_prompt": "/path/to/frame.png"},
  "scene_002": {"text_prompt": "Zoom into the building", "image_prompt": "/path/to/frame2.png"}
}

For T2V, image_prompt can be omitted. See data_prep/generate_i2v_prompts.py to auto-generate prompts from a folder of first frames.

📁 Code Structure

VideoGPA/
├── generate/                    # Video generation scripts
│   ├── CogVideoX-5B.py              # T2V
│   ├── CogVideoX-5B-I2V.py          # I2V
│   ├── CogVideoX1.5-5B.py           # T2V 1.5
│   └── Wan2.2-TI2V-5B.py            # Wan TI2V
├── train/                       # DPO training pipeline
│   ├── 01_preference_pair.py        # Video scoring
│   ├── dataset.py                   # DPO dataset (CogVideo + Wan)
│   ├── loss.py                      # DPO loss
│   ├── CogVideoX-5B/                # encode & train
│   ├── CogVideoX-I2V-5B/            # encode & train
│   ├── CogVideoX1.5-5B/             # encode & train
│   └── Wan2.2-TI2V-5B/              # encode & train
├── dl3dv_video_captions/        # Benchmark captions (1K / 8K / 9K / 10K / 11K)
├── data_prep/                   # Scripts to prepare prompt JSONs
├── checkpoints/                 # VideoGPA LoRA weights
├── metrics/                     # Evaluation metrics (MSE, SSIM, LPIPS, epipolar, …)
├── pipelines/                   # Shared video processing pipeline
├── utils/                       # Utility functions
├── replicate.py                 # Multi-GPU I2V generation for benchmarking
├── replicate_scorer.py          # Multi-GPU DA3 scoring
└── replicate.sh                 # End-to-end generation + scoring script

🔧 DPO Training

VideoGPA uses DPO (Direct Preference Optimization) to improve 3D consistency in video generation. The training pipeline has 3 steps:

Step 1: Score Generated Videos

python train/01_preference_pair.py

Step 2: Encode Videos to Latent Space

# CogVideoX models
python train/CogVideoX-I2V-5B/02_encode.py
python train/CogVideoX-5B/02_encode.py
python train/CogVideoX1.5-5B/02_encode.py

# Wan2.2 (requires --base_path and --model_path)
python train/Wan2.2-TI2V-5B/02_encode.py \
    --base_path /path/to/dataset \
    --model_path /path/to/Wan2.2-TI2V-5B \
    --input_json /path/to/scored.json \
    --output_json /path/to/encoded.json

Step 3: Run DPO Training

# CogVideoX models
python train/CogVideoX-I2V-5B/03_train.py --base_path /path/to/dataset
python train/CogVideoX-5B/03_train.py --base_path /path/to/dataset
python train/CogVideoX1.5-5B/03_train.py --base_path /path/to/dataset

# Wan2.2
python train/Wan2.2-TI2V-5B/03_train.py \
    --base_path /path/to/dataset \
    --model_path /path/to/Wan2.2-TI2V-5B

Shared components (train/dataset.py, train/loss.py) work across all models — CogVideoX uses v-prediction, Wan uses flow matching, but the DPO loss operates on model-agnostic (prediction, target) pairs.

Data Format: Training requires JSON metadata with preference pairs. See dataset.py for the expected format.

📊 Benchmark Replication

replicate.sh runs generation and scoring end-to-end. Requires DL3DV-10K first frames; text captions are provided in dl3dv_video_captions/captions_1K.json.

bash replicate.sh \
  --dl3dv_dir /path/to/DL3DV-10K \
  --lora_path checkpoints/VideoGPA-I2V-lora \
  --output_dir output/i2v_dpo \
  --devices 0,1,2,3,4,5,6,7

Scores are saved to <output_dir>/scores.csv. Run bash replicate.sh --help for all options.

Note: Scores may differ slightly from the paper due to non-deterministic CUDA operators in inference and hardware variation across machines.

🙏 Acknowledgements

We would like to express our gratitude to the following projects and researchers:

CogVideoX - Text/Image-to-video generation model.
Wan2.2 - State-of-the-art video generation model.
PEFT - Parameter-efficient fine-tuning with LoRA.
Diffusion DPO - Direct Preference Optimization in the diffusion latent space.

Thanks to Dawei Liu for the amazing website design!

🌟 Citation

@misc{du2026videogpadistillinggeometrypriors,
      title={VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation}, 
      author={Hongyang Du and Junjie Ye and Xiaoyan Cong and Runhao Li and Jingcheng Ni and Aman Agarwal and Zeqi Zhou and Zekun Li and Randall Balestriero and Yue Wang},
      year={2026},
      eprint={2601.23286},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.23286}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

🔥 News

Quick Start

📋 Requirements

🔘 Checkpoint Download

🎬 Video Generation

CogVideoX-5B Text-to-Video

CogVideoX-5B Image-to-Video

CogVideoX1.5-5B Text-to-Video

Wan2.2-TI2V-5B Text-Image-to-Video

Common Arguments

Prompt JSON Format

📁 Code Structure

🔧 DPO Training

Step 1: Score Generated Videos

Step 2: Encode Videos to Latent Space

Step 3: Run DPO Training

📊 Benchmark Replication

🙏 Acknowledgements

🌟 Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
checkpoints		checkpoints
data_prep		data_prep
depth_anything_3		depth_anything_3
dl3dv_video_captions		dl3dv_video_captions
generate		generate
metrics		metrics
pipelines		pipelines
train		train
utils		utils
vggt		vggt
.gitignore		.gitignore
README.md		README.md
cvpr2026_i2v_challenge_award.png		cvpr2026_i2v_challenge_award.png
download_ckpt.py		download_ckpt.py
image.png		image.png
pipeline.png		pipeline.png
replicate.py		replicate.py
replicate.sh		replicate.sh
replicate_scorer.py		replicate_scorer.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

🔥 News

Quick Start

📋 Requirements

🔘 Checkpoint Download

🎬 Video Generation

CogVideoX-5B Text-to-Video

CogVideoX-5B Image-to-Video

CogVideoX1.5-5B Text-to-Video

Wan2.2-TI2V-5B Text-Image-to-Video

Common Arguments

Prompt JSON Format

📁 Code Structure

🔧 DPO Training

Step 1: Score Generated Videos

Step 2: Encode Videos to Latent Space

Step 3: Run DPO Training

📊 Benchmark Replication

🙏 Acknowledgements

🌟 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages