A Physical AI system that automatically analyzes dashcam videos — detecting incidents, classifying severity, and explaining causes — through a multi-stage hierarchical reasoning pipeline. All inference is performed by a single model (Cosmos-Reason2-8B) with no fine-tuning.
Reference (original approach): https://arxiv.org/abs/2510.12190
- What this repo does: MP4 → frames → captions → incident frame detection → 3-stage reasoning → CSV
- What you need: Cosmos-Reason2-8B weights + Kaggle 2COOOL dataset
Analyze a single dashcam video via browser interface. See 100_app/README.md for details.
- Install FFmpeg + uv
- Run
vllm_cosmos_reason2/setup.sh - Place model weights under
/data/models/nvidia/Cosmos-Reason2-8B - Download dataset
- Run stages in order:
001 → server → 002 → 003 → 004 - Compare submissions with
005_2coool-studio
| Category | Details |
|---|---|
| OS | Ubuntu 22.04.4 LTS |
| GPU (Recommended) | NVIDIA H100 80GB HBM3 × 8 |
| CUDA / Driver | CUDA 12.8 / NVIDIA driver 535.x |
| Key dependencies | vLLM (Cosmos-Reason2-8B), FFmpeg (ffmpeg, ffprobe) |
Note:
vllm_cosmos_reason2provisions its own virtual environment viauv.
# FFmpeg (Ubuntu)
sudo apt-get update && sudo apt-get install -y ffmpeg
# uv (Ubuntu)
curl -LsSf https://astral.sh/uv/install.sh | shcd vllm_cosmos_reason2
bash setup.shDefault model path expected by scripts:
Cosmos-Reason2-8B: /data/models/nvidia/Cosmos-Reason2-8B
Place model weights at the above path, or update run_server*.sh to point to your local path (or create a symlink under /data/models).
Download from the official Kaggle page: https://www.kaggle.com/competitions/2coool/data
source ./vllm_cosmos_reason2/.venv/bin/activateConvert videos into frames. If you have gaze heatmaps, also prepare vertically stacked MP4s and their frame PNGs.
cd 001_video2frames
# MP4 → PNG (dashcam videos and/or heatmaps depending on --input-dirs)
python ./src/mp4_to_png.py \
--input-dirs <Competition Data> \
--output-root gdrive_png
# vertically stack heatmaps + videos into a single MP4 per video_id
python ./src/vstack_mp4_pairs_ffprobe.py \
--left-dir /data/dataset/2coool/gdrive/heatmaps/ \
--right-dir /data/dataset/2coool/gdrive/videos/ \
--out-dir mp4_vstack
# MP4(vstack) → PNG
python ./src/mp4_to_png.py \
--input-dirs mp4_vstack \
--output-root mp4_vstack_pngExpected directory layout:
001_video2frames/
|-gdrive_png/
|-videos/<video_id>/000001.png ...
|-heatmaps/<video_id>/000001.png ...
|-mp4_vstack/<video_id>.mp4
|-mp4_vstack_png/<video_id>/000001.png ...In a separate terminal:
cd ./vllm_cosmos_reason2
bash ./run_server_8gpu.sh # 8 GPUs (ports 8000-8007)Captions are generated every 10 frames.
cd 002_frame_captioning
bash ./run_cosmos_frame_captioning_parallel.sh # uses 8 GPUsAnalyze the generated captions and identify incident or hazard frames.
cd 003_frame_detection
bash ./run_cosmos_frame_detection_parallel.sh # uses 8 GPUsRuns Count → Text → Reconcile over frames around the detected incident frame.
cd 004_description
bash ./run_cosmos_description_from_csv_parallel.sh # uses 8 GPUsCompare multiple submission.csv candidates with: 005_2coool-studio
Final CSV is created at: 004_description/results/run_cosmos_reason2_parallel/submit_filled.csv
Columns:
| Column | Type | Description |
|---|---|---|
video |
int | Video ID |
Incident window start frame |
int | Frame number where the incident begins |
Incident Detection |
int (-1, 0, 1) | -1 = no incident, 0 = hazard, 1 = accident |
Crash Severity |
string | Severity label |
Ego-car involved |
int (0, 1) | 0 = not involved, 1 = involved |
Label |
string | Incident type label |
Number of Bicyclists/Scooters |
int | Count of involved cyclists/scooters |
Number of animals involved |
int | Count of involved animals |
Number of pedestrians involved |
int | Count of involved pedestrians |
Number of vehicles involved (excluding ego-car) |
int | Count of other vehicles involved |
Caption Before Incident |
string | Scene description before the incident |
Reason of Incident |
string | Cause-and-effect explanation |
This pipeline can be applied to other dashcam video datasets for incident analysis. The input is MP4 dashcam video files with optional gaze heatmap videos.
Note: The current configuration expects gaze heatmap videos (vertically stacked with dashcam frames) in the Frame Captioning stage. To run without heatmaps, modify the prompt in
002_frame_captioning/configs/default.yamlto remove heatmap-specific instructions and skip thevstack_mp4_pairs_ffprobe.pystep.
All VLM prompts live in YAML configs:
| Stage | Config file |
|---|---|
| Frame Captioning | 002_frame_captioning/configs/default.yaml |
| Frame Detection | 003_frame_detection/configs/default.yaml |
| Incident Description | 004_description/configs/default.yaml |
Typical workflow:
- Copy config:
cp default.yaml my_domain.yaml - Edit prompts (incident taxonomy, severity rules, counting rules)
- Run with
--config configs/my_domain.yaml
If you change output schema (column names, constraints, required fields), update:
| File | What to change |
|---|---|
002_frame_captioning/src/cosmos_frame_captioning_vllm.py |
preferred_keys in export_text_json_to_csv() |
004_description/src/cosmos_multi_image_infer_vllm_sc.py |
JSON_KEYS, COUNT_KEYS, TEXT_KEYS, detect_contradictions() |
004_description/src/aggregate_ans_jsons_to_csv.py |
HEADER, NUMERIC_KEYS |
004_description/src/postprocess_fill_nulls.py |
HEADER, FALLBACK_FIELDS |