Hongyang Du*1,2 Β· Junjie Ye*1Β· Xiaoyan Cong*2 Β· Runhao Li1 Β· Jingcheng Ni2
Aman Agarwal2 Β· Zeqi Zhou2 Β· Zekun Li2 Β· Randall Balestriero2 Β· Yue Wang1
1Physical SuperIntelligence Lab, University of Southern California
2Department of Computer Science, Brown University
* Equal Contribution
- π VideoGPA(Wan2.2-TI2V-5B Base Model) won π₯ place in the eBay-sponsored Image-to-Video Consistent Generation Challenge at CVPR 2026 VGBE Workshop
- We release the VideoGPA-Wan2.2-TI2V DPO LoRA checkpoint! Download via
python download_ckpt.py ti2vand generate withgenerate/Wan2.2-TI2V-5B.py. - We release VideoGPA-I2V-1K β we find that only 1,000 steps already achieves surprisingly strong visual quality and benchmark scores. We're releasing it so everyone can play around with it! Download via
python download_ckpt.py i2v-1k. - We release our DL3DV video captions generated with CogVLM. Check them out in
dl3dv_video_captions. - We release the training code for Wan2.2-TI2V-5B! Check it out in
train/Wan2.2-TI2V-5B.
Python 3.10 β 3.12.
pip install -r requirements.txt# Download all VideoGPA LoRA checkpoints
python download_ckpt.py all
# Or download specific ones
python download_ckpt.py i2v # CogVideoX-I2V-5B
python download_ckpt.py i2v-1k # CogVideoX-I2V-5B (1K steps, lightweight)
python download_ckpt.py t2v # CogVideoX-5B
python download_ckpt.py t2v15 # CogVideoX1.5-5B
python download_ckpt.py ti2v # Wan2.2-TI2V-5Bcheckpoints/
βββ VideoGPA-I2V-lora/
β βββ adapter_model.safetensors
βββ VideoGPA-I2V-1K-lora/
β βββ adapter_model.safetensors
βββ VideoGPA-T2V-lora/
β βββ adapter_model.safetensors
βββ VideoGPA-T2V1.5-lora/
β βββ adapter_model.safetensors
βββ VideoGPA-Wan2.2TI2V-lora/
βββ adapter_model.safetensors
All scripts share the same interface: --prompt_json (required), --output_dir (required), --lora_path (optional for DPO), --gpu_id, --seed.
# Baseline (no LoRA)
python generate/CogVideoX-5B.py \
--prompt_json prompts.json \
--output_dir outputs/t2v_baseline
# With VideoGPA DPO LoRA
python generate/CogVideoX-5B.py \
--prompt_json prompts.json \
--output_dir outputs/t2v_dpo \
--lora_path checkpoints/VideoGPA-T2V-lora# Baseline
python generate/CogVideoX-5B-I2V.py \
--prompt_json prompts.json \
--output_dir outputs/i2v_baseline
# With VideoGPA DPO LoRA
python generate/CogVideoX-5B-I2V.py \
--prompt_json prompts.json \
--output_dir outputs/i2v_dpo \
--lora_path checkpoints/VideoGPA-I2V-lora
# With VideoGPA-I2V-1K LoRA (lightweight, 1K steps)
python generate/CogVideoX-5B-I2V.py \
--prompt_json prompts.json \
--output_dir outputs/i2v_1k \
--lora_path checkpoints/VideoGPA-I2V-1K-lora# Baseline
python generate/CogVideoX1.5-5B.py \
--prompt_json prompts.json \
--output_dir outputs/t2v15_baseline
# With VideoGPA DPO LoRA
python generate/CogVideoX1.5-5B.py \
--prompt_json prompts.json \
--output_dir outputs/t2v15_dpo \
--lora_path checkpoints/VideoGPA-T2V1.5-loraUnlike the CogVideoX scripts, generate/Wan2.2-TI2V-5B.py requires --model_path pointing to the base Wan2.2-TI2V-5B weights.
# Baseline
python generate/Wan2.2-TI2V-5B.py \
--model_path /path/to/Wan2.2-TI2V-5B \
--prompt_json prompts.json \
--output_dir outputs/ti2v_baseline
# With VideoGPA DPO LoRA (LoRA strength defaults to --lora_weight 0.2)
python generate/Wan2.2-TI2V-5B.py \
--model_path /path/to/Wan2.2-TI2V-5B \
--prompt_json prompts.json \
--output_dir outputs/ti2v_dpo \
--lora_path checkpoints/VideoGPA-Wan2.2TI2V-loraLoRA strength: both
Wan2.2-TI2V-5B.pyandCogVideoX1.5-5B.pyapply the LoRA at--lora_weight 0.2by default. Pass a different value to tune it.
| Argument | Description | Default |
|---|---|---|
--prompt_json |
JSON file with prompts (required) | β |
--output_dir |
Output directory (required) | β |
--lora_path |
Path to LoRA adapter | None |
--gpu_id |
GPU device ID | 0 |
--seed |
Random seed | 42 |
--num_prompts |
Limit number of prompts | all |
{
"scene_001": {"text_prompt": "Camera pans left", "image_prompt": "/path/to/frame.png"},
"scene_002": {"text_prompt": "Zoom into the building", "image_prompt": "/path/to/frame2.png"}
}For T2V, image_prompt can be omitted. See data_prep/generate_i2v_prompts.py to auto-generate prompts from a folder of first frames.
VideoGPA/
βββ generate/ # Video generation scripts
β βββ CogVideoX-5B.py # T2V
β βββ CogVideoX-5B-I2V.py # I2V
β βββ CogVideoX1.5-5B.py # T2V 1.5
β βββ Wan2.2-TI2V-5B.py # Wan TI2V
βββ train/ # DPO training pipeline
β βββ 01_preference_pair.py # Video scoring
β βββ dataset.py # DPO dataset (CogVideo + Wan)
β βββ loss.py # DPO loss
β βββ CogVideoX-5B/ # encode & train
β βββ CogVideoX-I2V-5B/ # encode & train
β βββ CogVideoX1.5-5B/ # encode & train
β βββ Wan2.2-TI2V-5B/ # encode & train
βββ dl3dv_video_captions/ # Benchmark captions (1K / 8K / 9K / 10K / 11K)
βββ data_prep/ # Scripts to prepare prompt JSONs
βββ checkpoints/ # VideoGPA LoRA weights
βββ metrics/ # Evaluation metrics (MSE, SSIM, LPIPS, epipolar, β¦)
βββ pipelines/ # Shared video processing pipeline
βββ utils/ # Utility functions
βββ replicate.py # Multi-GPU I2V generation for benchmarking
βββ replicate_scorer.py # Multi-GPU DA3 scoring
βββ replicate.sh # End-to-end generation + scoring script
VideoGPA uses DPO (Direct Preference Optimization) to improve 3D consistency in video generation. The training pipeline has 3 steps:
python train/01_preference_pair.py# CogVideoX models
python train/CogVideoX-I2V-5B/02_encode.py
python train/CogVideoX-5B/02_encode.py
python train/CogVideoX1.5-5B/02_encode.py
# Wan2.2 (requires --base_path and --model_path)
python train/Wan2.2-TI2V-5B/02_encode.py \
--base_path /path/to/dataset \
--model_path /path/to/Wan2.2-TI2V-5B \
--input_json /path/to/scored.json \
--output_json /path/to/encoded.json# CogVideoX models
python train/CogVideoX-I2V-5B/03_train.py --base_path /path/to/dataset
python train/CogVideoX-5B/03_train.py --base_path /path/to/dataset
python train/CogVideoX1.5-5B/03_train.py --base_path /path/to/dataset
# Wan2.2
python train/Wan2.2-TI2V-5B/03_train.py \
--base_path /path/to/dataset \
--model_path /path/to/Wan2.2-TI2V-5BShared components (train/dataset.py, train/loss.py) work across all models β CogVideoX uses v-prediction, Wan uses flow matching, but the DPO loss operates on model-agnostic (prediction, target) pairs.
Data Format: Training requires JSON metadata with preference pairs. See dataset.py for the expected format.
replicate.sh runs generation and scoring end-to-end. Requires DL3DV-10K first frames; text captions are provided in dl3dv_video_captions/captions_1K.json.
bash replicate.sh \
--dl3dv_dir /path/to/DL3DV-10K \
--lora_path checkpoints/VideoGPA-I2V-lora \
--output_dir output/i2v_dpo \
--devices 0,1,2,3,4,5,6,7Scores are saved to <output_dir>/scores.csv. Run bash replicate.sh --help for all options.
Note: Scores may differ slightly from the paper due to non-deterministic CUDA operators in inference and hardware variation across machines.
We would like to express our gratitude to the following projects and researchers:
- CogVideoX - Text/Image-to-video generation model.
- Wan2.2 - State-of-the-art video generation model.
- PEFT - Parameter-efficient fine-tuning with LoRA.
- Diffusion DPO - Direct Preference Optimization in the diffusion latent space.
Thanks to Dawei Liu for the amazing website design!
@misc{du2026videogpadistillinggeometrypriors,
title={VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation},
author={Hongyang Du and Junjie Ye and Xiaoyan Cong and Runhao Li and Jingcheng Ni and Aman Agarwal and Zeqi Zhou and Zekun Li and Randall Balestriero and Yue Wang},
year={2026},
eprint={2601.23286},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.23286},
}
