`Proposal: Reasoning band (1 Hz scenario-description / VLM student–teacher, edge-case handling)`

## Summary

This proposes a concrete, lightweight design for the **Reasoning band** (the yellow @1Hz lane in the 24/06 architecture), and — more importantly — asks the WG to settle the **supervision**, which is the open design choice. It builds on the System-2 causal head already merged in #81.

Flow from the 24/06 design sketch by @m-zain-khawaja:
```
Encoded Visual History (from the World Model, #85/#93)
   --> Scenario Description --> Predicted Scenario
   --> Video-Language-Model loss (student/teacher)        # front camera only, 1 Hz
```
**Objective** (from the 24/06 notes): *help the policy handle **edge cases**.* The band classifies the driving scenario w.r.t. the ODD and emits **(a)** a classification vector to the Trajectory Planner (to modulate the trajectory) and/or **(b)** scenario-description text/tokens, learned **without explicit labels besides the trajectory** (a student/teacher VLM setup). Edge cases to stress-test come from the **KIT long-tail** set.

This is a design sketch for discussion — corrections and advice very welcome.

## Design principles
- **Cheap & 1 Hz.** Small trainable heads on top of the already-computed Encoded Visual History; no extra backbone pass.
- **Reuse.** Lean on #81 (`SceneContext` + causal head, already merged) and #85's Encoded Visual History (896).
- **Opt-in, decoupled.** Default off → the Reactive/World paths are unchanged; the band is additive with its own loss module (matches the "separate loss modules per branch" action item from 24/06).

## Current state — what exists / what's missing
**Exists in `main`:**
- **#81** — typed `SceneContext` + System-2 causal head (structured classification output, not free-form text). **Merged but not wired** to the planner.
- **#85** (open) — the Encoded Visual History (896) this band consumes.

**Missing (this proposal):**
- Nothing of the Reasoning band is wired today. The 24/06 meeting redefined its supervision to a **scenario-description VLM (student/teacher)**, which is not yet specified or built.

## Proposed design (modular, opt-in)
- **C1 — Scenario encoder.** Consume the Encoded Visual History `[B, 896]` (1 Hz) → small MLP/attention head → scenario latent.
- **C2 — Classification vector (reuse #81).** Map the scenario latent to the typed `SceneContext` from #81 → the planner-facing signal.
- **C3 — Scenario-description head (optional).** A light decoder emitting scenario-description tokens, trained against a teacher (see open questions).
- **C4 — Planner coupling.** Feed C2's vector into the Trajectory Planner via a **zero-init adaptive gate** (FiLM-style; no-op at init so the reactive baseline is unchanged — see decisions).
- **C5 — `reasoning_loss`** module (separate per-branch loss): student/teacher distillation for C3 (+ optional classification supervision for C2).

### Reuse map
| Reasoning piece | Reuse from |
|---|---|
| Classification output (typed) | `SceneContext` + causal head (#81) |
| Input (Encoded Visual History) | World Model (#85 / #93) |
| Loss-module pattern | `losses/` (per-branch modules) |

## Key design decisions — proposed defaults (to confirm with the WG)
Each comes with a SOTA-grounded default so we can converge fast; happy to change any.

| Decision | Options | **Proposed default** | Rationale |
|---|---|---|---|
| **Teacher signal** | frozen open VLM captioner (Qwen2-VL / InternVL) · Alpamayo CoC autolabeler · other | **open-weights VLM, OFFLINE auto-labeller, train-only (removed at inference)**; CoC autolabeler as v2 | deployability (DriveVLM-RL: VLM only at train); open weights; no human labels |
| **Student output v1** | typed classification vector (reuse #81) · free-form text · both | **typed classification vector** (#81); free-form text = v2, never raw into the planner | verifiability/safety (VLA survey); cost; consistent with #81 |
| **Loss** | CE distillation · contrastive / feature-matching | **KL distillation on the typed vector + auxiliary CLIP-style image–text alignment**; separate weighted `reasoning_loss`, 1:1 start | label-free alignment (CLG / VLM-RL); "separate loss modules" (24/06); same policy as JEPA (#13) |
| **Planner coupling** | concat · adaptive gate (think-vs-act) | **adaptive gate, zero-init (FiLM-style)** → no-op by default, modulates only on edge cases | Counterfactual VLA (think-vs-act); FiLM; repo's `ResidualMapFusion` alpha=0 pattern (won't destabilise) |
| **Scope** | front cam only, 1 Hz, KIT long-tail | **confirmed for v1**; multi-cam / longer horizon = v2 | LINGO front-cam; 1 Hz aligns with the World Model; KIT = the edge cases |

## Implementation plan (phased, additive)
1. **Module skeleton + synthetic test** — `ReasoningHead` with the I/O contract, tested on random tensors; default off / zeros fallback so the rest is unchanged.
2. **C1–C2** — scenario encoder + classification vector (reuse #81) → planner-facing output.
3. **C4** — wire the classification vector into the planner behind a flag.
4. **C3 + C5** — scenario-description head + student/teacher `reasoning_loss`, **once the teacher/supervision is fixed** (open questions below).
5. Edge-case eval on the KIT long-tail set.

Would land as a separate PR after the World Model (#85) merges, the same way the World Model was built.

## Open questions (supervision — need WG / @m-zain-khawaja input)
Defaults proposed above; the points that really need your call:
1. **Teacher signal** — OK with an open-weights VLM (Qwen2-VL / InternVL) as an offline, train-only auto-labeller, or do you prefer the Alpamayo CoC autolabeler from the start?
2. **Student output v1** — agree to start with the typed classification vector (#81) and defer free-form text to v2?
3. **Loss** — KL distillation + auxiliary CLIP-style alignment, weighted 1:1 — any preference on the balance?
4. **Planner coupling** — OK with a zero-init adaptive gate (FiLM-style), or do you want plain concat first?
5. **Scope** — confirm front cam only / 1 Hz / KIT long-tail for v1.

If you confirm the teacher signal + the v1 student output, I'll start with C1–C2 + the (zero-init) planner gate, reusing #81.

## References
- 24/06 architecture diagram (Reasoning Model @1Hz band); #81 (System-2 causal head, this builds on it); #85/#93 (Encoded Visual History, the input).
- Wayve LINGO line (scenario description → planner, flagged on 24/06), with the actual papers:
  - **Driving with LLMs** (the LINGO-1 basis): https://arxiv.org/abs/2310.01957
  - **SimLingo** — vision-only closed-loop driving with **language–action alignment**: https://arxiv.org/abs/2503.09594 *(closest published analog to this band)*
  - **CarLLaVA** — camera-only closed-loop VLM driving: https://arxiv.org/abs/2406.10165
  - **LINGO-2** (press): https://wayve.ai/press/introducing-lingo-2/
- **LingoQA** (benchmark cited on 24/06): https://arxiv.org/abs/2312.14115
- **DriveVLM**: https://arxiv.org/abs/2402.12289 · **Senna**: https://arxiv.org/abs/2410.22313 — VLM-for-driving / scene description.
- **A Survey on Vision-Language-Action Models for Autonomous Driving**: https://arxiv.org/abs/2506.24044 — taxonomy (explainer → reasoning) + eval protocols for this kind of band. Curated repo: https://github.com/JohnsonJiang1996/Awesome-VLA4AD
- **DriveVLM-RL**: https://arxiv.org/abs/2603.18315 — precedent for keeping the **VLM only at training time and removing it at inference** (asynchronous), which is the deployability argument for the student/teacher setup proposed here (the teacher VLM is train-only; the student stays in the network).
- **Counterfactual VLA** (CVPR'26): adaptive *think-vs-act* gating — relevant to question 4.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`Proposal: Reasoning band (1 Hz scenario-description / VLM student–teacher, edge-case handling)` #98

Summary

Design principles

Current state — what exists / what's missing

Proposed design (modular, opt-in)

Reuse map

Key design decisions — proposed defaults (to confirm with the WG)

Implementation plan (phased, additive)

Open questions (supervision — need WG / @m-zain-khawaja input)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reasoning piece	Reuse from
Classification output (typed)	`SceneContext` + causal head (#81)
Input (Encoded Visual History)	World Model (#85 / #93)
Loss-module pattern	`losses/` (per-branch modules)

Decision	Options	Proposed default	Rationale
Teacher signal	frozen open VLM captioner (Qwen2-VL / InternVL) · Alpamayo CoC autolabeler · other	open-weights VLM, OFFLINE auto-labeller, train-only (removed at inference); CoC autolabeler as v2	deployability (DriveVLM-RL: VLM only at train); open weights; no human labels
Student output v1	typed classification vector (reuse #81) · free-form text · both	typed classification vector (#81); free-form text = v2, never raw into the planner	verifiability/safety (VLA survey); cost; consistent with #81
Loss	CE distillation · contrastive / feature-matching	KL distillation on the typed vector + auxiliary CLIP-style image–text alignment; separate weighted `reasoning_loss`, 1:1 start	label-free alignment (CLG / VLM-RL); "separate loss modules" (24/06); same policy as JEPA (#13)
Planner coupling	concat · adaptive gate (think-vs-act)	adaptive gate, zero-init (FiLM-style) → no-op by default, modulates only on edge cases	Counterfactual VLA (think-vs-act); FiLM; repo's `ResidualMapFusion` alpha=0 pattern (won't destabilise)
Scope	front cam only, 1 Hz, KIT long-tail	confirmed for v1; multi-cam / longer horizon = v2	LINGO front-cam; 1 Hz aligns with the World Model; KIT = the edge cases

Uh oh!

Proposal: Reasoning band (1 Hz scenario-description / VLM student–teacher, edge-case handling) #98

Description

Summary

Design principles

Current state — what exists / what's missing

Proposed design (modular, opt-in)

Reuse map

Key design decisions — proposed defaults (to confirm with the WG)

Implementation plan (phased, additive)

Open questions (supervision — need WG / @m-zain-khawaja input)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`Proposal: Reasoning band (1 Hz scenario-description / VLM student–teacher, edge-case handling)` #98