feat(data_parsing): L2D 1 Hz sequential multi-view windows for the World Model (#16)#95
Merged
m-zain-khawaja merged 1 commit intoJun 30, 2026
Conversation
…rld Model (autowarefoundation#16) Implements the 'sequential frame access (needed for the feature reconstruction loss autowarefoundation#13)' item of autowarefoundation#16, on the agreed LeRobot L2D dataset. Problem: L2DDataset loaded only the current frame and returned visual_history as zeros, so the World Model had no past context and no future targets for the JEPA loss. - world_model_windows.py (new, dataset-agnostic + unit-testable without lerobot): stride_for_hz (10 Hz -> 1 Hz = stride 10; parametrised for e.g. 30 fps), window_offsets, required_margins, build_windows(load_frame, row, ep_start, ep_end, N, stride) -> (history_frames, future_frames), each [N, V, 3, H, W], oldest->newest, with episode-boundary checks (no cross-episode leakage). - L2DDataset: opt-in include_world_model_windows (default OFF -> byte-identical). Emits history_frames/future_frames [N,7,3,H,W]; valid-index margins take the max of egomotion (64/64) and the World-Model window; _load_multiview_frame refactored and reused. Feeds train_il's JEPA term directly. - 14 tests (pure windowing logic, no dataset download); mypy/ruff clean. Not done (flagged): camera calibration extraction (BEV works via pseudo_projection; L2D calib API unverified offline) and pre-extraction/caching for the heavy multi-view video decode — both follow-ups; functional correctness is independent. Signed-off-by: GABRIELA CORDOVA <100548769@alumnos.uc3m.es>
Contributor
Author
|
CI green . This implements the "Sequential frame access (needed for #13)" item from #16: 1 Hz multi-view past/future windows for the World Model.
Feeds the JEPA directly: #13's Kept honest about scope (details in the PR body): BEV camera calibration is left as a documented TODO in |
This was referenced Jun 29, 2026
m-zain-khawaja
approved these changes
Jun 30, 2026
m-zain-khawaja
left a comment
Member
There was a problem hiding this comment.
approved - thanks @gcordova10
This was referenced Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this adds
L2DDatasetcurrently loads only the current frame and leavesvisual_historyas zeros, with no future frames — so the World Model / JEPA (#13, #85) has neither
history nor targets. This PR adds sequential 1 Hz multi-view windows (the
"Sequential frame access (needed for #13)" checklist item in #16).
Design (testable without
lerobot)data_parsing/l2d/world_model_windows.py(new, pure / dataset-agnostic):stride_for_hz(source_hz, wm_hz)— L2D 10 Hz → 1 Hz = stride 10.window_offsets(N, stride)— past[-(N-1)s..0](oldest→newest, current last)and future
[s..N*s].required_margins(N, stride)=((N-1)*s, N*s).build_windows(load_frame, row, ep_start, ep_end, num_frames=4, stride=10)→(history_frames, future_frames), each[N, V, 3, H, W]; raisesIndexErrorif a window would cross the episode boundary. Takes a
load_frame(row)->[V,3,H,W]callable, so it has no dataset/
lerobotdependency and is unit-tested with asynthetic loader.
data_parsing/l2d/dataset.py(opt-in, default OFF → byte-identical to before):include_world_model_windows=False, wm_num_frames=4, wm_hz=1.0, source_hz=10.0;_load_multiview_frame(row)refactor, reused for the current frame and everywindow frame;
_build_sample_indexmargins take the max of the egomotion window (64/64)and the World Model windows, so a valid frame always has a complete window;
__getitem__emitshistory_frames/future_frameswhen enabled;L2DSamplegains the two optional keys (NotRequired, version-guarded importfor py3.10/3.12).
How it fits
train_il:compute_step_lossconsumes exactlybatch["history_frames"]/
["future_frames"]→encode_history → predict_future → jepa_loss(these weresynthetic before).
history_framesfeedsWorldActionModel;future_framesare the JEPA targets.
data_parsing).1 Hz / 10 Hz
L2D is 10 Hz (
egomotion.py,_DT=0.1), sosource_hz=10.0(stride 10).Parametrized: a 30 FPS source →
source_hz=30.Scope / not in this PR (kept honest)
camera_params): TODO: Dataset Selection and DataLoader Implementation #16 lists it, but I keep this PR tothe sequential windows.
l2d/camera.pyalready hasmake_camera_params_placeholder()(identity) with a documented TODO to parse
intrinsic @ extrinsic → [3,4]per viewfrom L2D's
extrinsic_RDF.yaml; BEV fusion runs on its learnable pseudo-projectionmeanwhile (not blocking). Follow-up: the real YAML parser.
N×2multi-view frames per sample is expensive; for realtraining prefer pre-extraction (
data_parsing/pre_extracted.py) or caching. Thefunctional correctness here is independent.
Tests
pytest Model/tests/test_world_model_windows.py→ 14 passed. ruff + mypy clean.Diff: 2 new files + 2 edited (
dataset.py,__init__.py).