Hierarchical Reasoning Framework for Dashcam Incident Analysis with Cosmos-Reason2-8B

A Physical AI system that automatically analyzes dashcam videos — detecting incidents, classifying severity, and explaining causes — through a multi-stage hierarchical reasoning pipeline. All inference is performed by a single model (Cosmos-Reason2-8B) with no fine-tuning.

Features & Details.

Reference (original approach): https://arxiv.org/abs/2510.12190

What this repo does: MP4 → frames → captions → incident frame detection → 3-stage reasoning → CSV
What you need: Cosmos-Reason2-8B weights + Kaggle 2COOOL dataset

Quick Start

Option A: Web App (single video analysis)

Analyze a single dashcam video via browser interface. See 100_app/README.md for details.

Option B: Batch Pipeline (dataset-scale processing)

Install FFmpeg + uv
Run vllm_cosmos_reason2/setup.sh
Place model weights under /data/models/nvidia/Cosmos-Reason2-8B
Download dataset
Run stages in order: 001 → server → 002 → 003 → 004
Compare submissions with 005_2coool-studio

Requirements

Category	Details
OS	Ubuntu 22.04.4 LTS
GPU (Recommended)	NVIDIA H100 80GB HBM3 × 8
CUDA / Driver	CUDA 12.8 / NVIDIA driver 535.x
Key dependencies	vLLM (Cosmos-Reason2-8B), FFmpeg (`ffmpeg`, `ffprobe`)

Note: vllm_cosmos_reason2 provisions its own virtual environment via uv.

Setup

1) Install prerequisites

# FFmpeg (Ubuntu)
sudo apt-get update && sudo apt-get install -y ffmpeg
# uv (Ubuntu)
curl -LsSf https://astral.sh/uv/install.sh | sh

2) Create the project environment

cd vllm_cosmos_reason2
bash setup.sh

3) Download and place model weights

Default model path expected by scripts:

Cosmos-Reason2-8B: /data/models/nvidia/Cosmos-Reason2-8B

Place model weights at the above path, or update run_server*.sh to point to your local path (or create a symlink under /data/models).

4) Download the dataset (2COOOL)

Download from the official Kaggle page: https://www.kaggle.com/competitions/2coool/data

Running the Pipeline

0) Activate environment

source ./vllm_cosmos_reason2/.venv/bin/activate

1) Video → Frames

Convert videos into frames. If you have gaze heatmaps, also prepare vertically stacked MP4s and their frame PNGs.

cd 001_video2frames

# MP4 → PNG (dashcam videos and/or heatmaps depending on --input-dirs)
python ./src/mp4_to_png.py \
    --input-dirs <Competition Data> \
    --output-root gdrive_png

# vertically stack heatmaps + videos into a single MP4 per video_id
python ./src/vstack_mp4_pairs_ffprobe.py \
    --left-dir /data/dataset/2coool/gdrive/heatmaps/ \
    --right-dir /data/dataset/2coool/gdrive/videos/ \
    --out-dir mp4_vstack

# MP4(vstack) → PNG
python ./src/mp4_to_png.py \
    --input-dirs mp4_vstack \
    --output-root mp4_vstack_png

Expected directory layout:

001_video2frames/
|-gdrive_png/
  |-videos/<video_id>/000001.png ...
  |-heatmaps/<video_id>/000001.png ...
|-mp4_vstack/<video_id>.mp4
|-mp4_vstack_png/<video_id>/000001.png ...

2) Start vLLM Server (8 GPUs)

In a separate terminal:

cd ./vllm_cosmos_reason2
bash ./run_server_8gpu.sh   # 8 GPUs (ports 8000-8007)

3) Stage 2: Frame Captioning

Captions are generated every 10 frames.

cd 002_frame_captioning
bash ./run_cosmos_frame_captioning_parallel.sh   # uses 8 GPUs

4) Stage 3: Incident/Hazard Frame Detection

Analyze the generated captions and identify incident or hazard frames.

cd 003_frame_detection
bash ./run_cosmos_frame_detection_parallel.sh   # uses 8 GPUs

5) Stage 4: Incident/Hazard Description (3-stage reasoning)

Runs Count → Text → Reconcile over frames around the detected incident frame.

cd 004_description
bash ./run_cosmos_description_from_csv_parallel.sh   # uses 8 GPUs

6) Stage 5: Blind A/B Test (optional)

Compare multiple submission.csv candidates with: 005_2coool-studio

Output

Final CSV is created at: 004_description/results/run_cosmos_reason2_parallel/submit_filled.csv

Columns:

Column	Type	Description
`video`	int	Video ID
`Incident window start frame`	int	Frame number where the incident begins
`Incident Detection`	int (-1, 0, 1)	-1 = no incident, 0 = hazard, 1 = accident
`Crash Severity`	string	Severity label
`Ego-car involved`	int (0, 1)	0 = not involved, 1 = involved
`Label`	string	Incident type label
`Number of Bicyclists/Scooters`	int	Count of involved cyclists/scooters
`Number of animals involved`	int	Count of involved animals
`Number of pedestrians involved`	int	Count of involved pedestrians
`Number of vehicles involved (excluding ego-car)`	int	Count of other vehicles involved
`Caption Before Incident`	string	Scene description before the incident
`Reason of Incident`	string	Cause-and-effect explanation

Adapting to Other Dashcam Datasets

This pipeline can be applied to other dashcam video datasets for incident analysis. The input is MP4 dashcam video files with optional gaze heatmap videos.

Note: The current configuration expects gaze heatmap videos (vertically stacked with dashcam frames) in the Frame Captioning stage. To run without heatmaps, modify the prompt in 002_frame_captioning/configs/default.yaml to remove heatmap-specific instructions and skip the vstack_mp4_pairs_ffprobe.py step.

What you usually change (prompts)

All VLM prompts live in YAML configs:

Stage	Config file
Frame Captioning	`002_frame_captioning/configs/default.yaml`
Frame Detection	`003_frame_detection/configs/default.yaml`
Incident Description	`004_description/configs/default.yaml`

Typical workflow:

Copy config: cp default.yaml my_domain.yaml
Edit prompts (incident taxonomy, severity rules, counting rules)
Run with --config configs/my_domain.yaml

When you need code changes (schema / keys)

If you change output schema (column names, constraints, required fields), update:

File	What to change
`002_frame_captioning/src/cosmos_frame_captioning_vllm.py`	`preferred_keys` in `export_text_json_to_csv()`
`004_description/src/cosmos_multi_image_infer_vllm_sc.py`	`JSON_KEYS`, `COUNT_KEYS`, `TEXT_KEYS`, `detect_contradictions()`
`004_description/src/aggregate_ans_jsons_to_csv.py`	`HEADER`, `NUMERIC_KEYS`
`004_description/src/postprocess_fill_nulls.py`	`HEADER`, `FALLBACK_FIELDS`

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
001_video2frames/src		001_video2frames/src
002_frame_captioning		002_frame_captioning
003_frame_detection		003_frame_detection
004_description		004_description
005_2coool-studio		005_2coool-studio
100_app		100_app
assets		assets
vllm_cosmos_reason2		vllm_cosmos_reason2
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
project_features.md		project_features.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hierarchical Reasoning Framework for Dashcam Incident Analysis with Cosmos-Reason2-8B

Quick Start

Option A: Web App (single video analysis)

Option B: Batch Pipeline (dataset-scale processing)

Requirements

Setup

1) Install prerequisites

2) Create the project environment

3) Download and place model weights

4) Download the dataset (2COOOL)

Running the Pipeline

0) Activate environment

1) Video → Frames

2) Start vLLM Server (8 GPUs)

3) Stage 2: Frame Captioning

4) Stage 3: Incident/Hazard Frame Detection

5) Stage 4: Incident/Hazard Description (3-stage reasoning)

6) Stage 5: Blind A/B Test (optional)

Output

Adapting to Other Dashcam Datasets

What you usually change (prompts)

When you need code changes (schema / keys)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Reasoning Framework for Dashcam Incident Analysis with Cosmos-Reason2-8B

Quick Start

Option A: Web App (single video analysis)

Option B: Batch Pipeline (dataset-scale processing)

Requirements

Setup

1) Install prerequisites

2) Create the project environment

3) Download and place model weights

4) Download the dataset (2COOOL)

Running the Pipeline

0) Activate environment

1) Video → Frames

2) Start vLLM Server (8 GPUs)

3) Stage 2: Frame Captioning

4) Stage 3: Incident/Hazard Frame Detection

5) Stage 4: Incident/Hazard Description (3-stage reasoning)

6) Stage 5: Blind A/B Test (optional)

Output

Adapting to Other Dashcam Datasets

What you usually change (prompts)

When you need code changes (schema / keys)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages