Skip to content

Proposal: Reasoning band (1 Hz scenario-description / VLM student–teacher, edge-case handling) #98

Description

@gcordova10

Summary

This proposes a concrete, lightweight design for the Reasoning band (the yellow @1Hz lane in the 24/06 architecture), and — more importantly — asks the WG to settle the supervision, which is the open design choice. It builds on the System-2 causal head already merged in #81.

Flow from the 24/06 design sketch by @m-zain-khawaja:

Encoded Visual History (from the World Model, #85/#93)
   --> Scenario Description --> Predicted Scenario
   --> Video-Language-Model loss (student/teacher)        # front camera only, 1 Hz

Objective (from the 24/06 notes): help the policy handle edge cases. The band classifies the driving scenario w.r.t. the ODD and emits (a) a classification vector to the Trajectory Planner (to modulate the trajectory) and/or (b) scenario-description text/tokens, learned without explicit labels besides the trajectory (a student/teacher VLM setup). Edge cases to stress-test come from the KIT long-tail set.

This is a design sketch for discussion — corrections and advice very welcome.

Design principles

Current state — what exists / what's missing

Exists in main:

Missing (this proposal):

  • Nothing of the Reasoning band is wired today. The 24/06 meeting redefined its supervision to a scenario-description VLM (student/teacher), which is not yet specified or built.

Proposed design (modular, opt-in)

  • C1 — Scenario encoder. Consume the Encoded Visual History [B, 896] (1 Hz) → small MLP/attention head → scenario latent.
  • C2 — Classification vector (reuse feat(reasoning): optional System 2 causal head behind a typed SceneContext #81). Map the scenario latent to the typed SceneContext from feat(reasoning): optional System 2 causal head behind a typed SceneContext #81 → the planner-facing signal.
  • C3 — Scenario-description head (optional). A light decoder emitting scenario-description tokens, trained against a teacher (see open questions).
  • C4 — Planner coupling. Feed C2's vector into the Trajectory Planner via a zero-init adaptive gate (FiLM-style; no-op at init so the reactive baseline is unchanged — see decisions).
  • C5 — reasoning_loss module (separate per-branch loss): student/teacher distillation for C3 (+ optional classification supervision for C2).

Reuse map

Reasoning piece Reuse from
Classification output (typed) SceneContext + causal head (#81)
Input (Encoded Visual History) World Model (#85 / #93)
Loss-module pattern losses/ (per-branch modules)

Key design decisions — proposed defaults (to confirm with the WG)

Each comes with a SOTA-grounded default so we can converge fast; happy to change any.

Decision Options Proposed default Rationale
Teacher signal frozen open VLM captioner (Qwen2-VL / InternVL) · Alpamayo CoC autolabeler · other open-weights VLM, OFFLINE auto-labeller, train-only (removed at inference); CoC autolabeler as v2 deployability (DriveVLM-RL: VLM only at train); open weights; no human labels
Student output v1 typed classification vector (reuse #81) · free-form text · both typed classification vector (#81); free-form text = v2, never raw into the planner verifiability/safety (VLA survey); cost; consistent with #81
Loss CE distillation · contrastive / feature-matching KL distillation on the typed vector + auxiliary CLIP-style image–text alignment; separate weighted reasoning_loss, 1:1 start label-free alignment (CLG / VLM-RL); "separate loss modules" (24/06); same policy as JEPA (#13)
Planner coupling concat · adaptive gate (think-vs-act) adaptive gate, zero-init (FiLM-style) → no-op by default, modulates only on edge cases Counterfactual VLA (think-vs-act); FiLM; repo's ResidualMapFusion alpha=0 pattern (won't destabilise)
Scope front cam only, 1 Hz, KIT long-tail confirmed for v1; multi-cam / longer horizon = v2 LINGO front-cam; 1 Hz aligns with the World Model; KIT = the edge cases

Implementation plan (phased, additive)

  1. Module skeleton + synthetic testReasoningHead with the I/O contract, tested on random tensors; default off / zeros fallback so the rest is unchanged.
  2. C1–C2 — scenario encoder + classification vector (reuse feat(reasoning): optional System 2 causal head behind a typed SceneContext #81) → planner-facing output.
  3. C4 — wire the classification vector into the planner behind a flag.
  4. C3 + C5 — scenario-description head + student/teacher reasoning_loss, once the teacher/supervision is fixed (open questions below).
  5. Edge-case eval on the KIT long-tail set.

Would land as a separate PR after the World Model (#85) merges, the same way the World Model was built.

Open questions (supervision — need WG / @m-zain-khawaja input)

Defaults proposed above; the points that really need your call:

  1. Teacher signal — OK with an open-weights VLM (Qwen2-VL / InternVL) as an offline, train-only auto-labeller, or do you prefer the Alpamayo CoC autolabeler from the start?
  2. Student output v1 — agree to start with the typed classification vector (feat(reasoning): optional System 2 causal head behind a typed SceneContext #81) and defer free-form text to v2?
  3. Loss — KL distillation + auxiliary CLIP-style alignment, weighted 1:1 — any preference on the balance?
  4. Planner coupling — OK with a zero-init adaptive gate (FiLM-style), or do you want plain concat first?
  5. Scope — confirm front cam only / 1 Hz / KIT long-tail for v1.

If you confirm the teacher signal + the v1 student output, I'll start with C1–C2 + the (zero-init) planner gate, reusing #81.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions