Skip to content

feat (data_parsing): add Kitscenes dataset parser and forward pass test#41

Merged
m-zain-khawaja merged 17 commits into
autowarefoundation:mainfrom
bharatwrrr:kitscenes-data-parsing
Jun 18, 2026
Merged

feat (data_parsing): add Kitscenes dataset parser and forward pass test#41
m-zain-khawaja merged 17 commits into
autowarefoundation:mainfrom
bharatwrrr:kitscenes-data-parsing

Conversation

@bharatwrrr

@bharatwrrr bharatwrrr commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Key difference: Vectorized HD maps

Unlike NVIDIA, KIT Scenes ships production-grade Lanelet2 HD maps as the primary map representation. This raises the question: how to feed vectorized road geometry into a vision model backbone?

Solution: map.py — rasterize maps to semantic RGB tiles. Each scene's Lanelet2 map is queried for lanelets in an ego-centric region, then rasterized using OpenCV into a (H, W, 3) uint8 RGB image. The tile includes:

  • Road borders (green), curbstones (dark/light green), guard rails, fences, walls (browns)
  • Lane dividers (gray background + coloured stroke)
  • Centerlines (dashed darkred)
  • Stop lines (red)
  • Pedestrian crossings (yellow)
    The tile is ego-centric (forward → up, left → left) and rotated so the ego's heading always points straight up. It's then passed through the same backbone transform (resize + ImageNet normalisation) as the camera frames, so all 8 views share identical shape and normalisation.

Design tradeoff: Rasterization vs. alternative representations

Rasterizing is not free. At scale, the per-tile cost adds up. Three directions to consider:

1. Keep current approach (rasterize → backbone image encoder)

  • Pros: Consistent with camera pipeline; maps are just "another view" to the model
  • Cons: Lanelet2 is inherently vectorized; rasterization loses graph structure (lane topology, connectivity)

2. Vectorized map encoder with spatial output (not v1)

  • Keep maps as polylines/graphs; encode via GNN/transformer on lanelet graph, output (3, H, W) directly
  • Lanelet graph (nodes, edges, lane adjacency) → GNN → decoder projects to spatial → (3, H, W) → stacks as view 8
  • The 3 channels encode learned semantics (e.g., drivable area, lane boundaries, regulatory elements) instead of rasterized RGB
  • Pros: No rasterization cost; preserves graph structure; model contract stays clean (still visual_tiles (B, 8, 3, H, W))
  • Cons: Encoder complexity (variable-size graphs → fixed spatial output); need to design lanelet featurization and channel semantics
  • This is data parsing + encoder design, still clean separation from AutoE2E. Viable for v2 if rasterization bottleneck is confirmed.

Current choice & next steps

Current: rasterize to RGB, apply backbone transform, feed as slot 7 to the same CNN backbone. This is simple and works.

For v2, we could collect profiling data on the rasterization bottleneck during training. If rasterization becomes a bottleneck after the first training run, pivot to option 2 (vectorized encoder).

Map Transforms

Currently using timm.data.create_transform (backbone-driven ImageNet normalization). Alternatives:

  • Task-specific transforms: Histogram equalization or CLAHE (contrast) for map clarity (maps are simpler than camera images; might benefit from different preprocessing)
  • Non-ImageNet normalization: ImageNet statistics may not make sense for raster maps.

Keeping ImageNet norm for now (simple, well-studied), but this is tunable.

Testing

forward_pass_test.py validates: dataset → batch → forward pass.

Example smoke test on one scene:

python forward_pass_test.py --dataset_root $KITSCENES_ROOT --scene_id <uuid> --no-pretrained

@bharatwrrr

Copy link
Copy Markdown
Contributor Author

Adding this as a reference to the discussion about removing the map tile from the camera input and having a separate path that feeds the encoded maps to the BEV feature fusion module. The idea is to allow either a rasterized map or a vectorized map based on availability.

Screenshot 2026-06-10 at 10 32 19 AM

@riita10069

Copy link
Copy Markdown
Collaborator

Great work @bharatwrrr — this is a solid implementation. A few highlights and one question:

What I really like

The Lanelet2 BEV map rasterization (map.py) is excellent. In my original comment on #3, I suggested "option 1 (zero placeholder) for initial implementation, with option 2 (rasterize from Lanelet2) as a follow-up" — but you went ahead and delivered the full rasterization with semantic color coding, ego-centric rotation, and graceful fallback to blank tile when the map extra isn't installed. This is exactly what the model needs to reason about road geometry. The design decision to pass it through the same backbone transform as camera frames (so all 8 views share identical shape/normalization) is clean.

The egomotion derivation with np.unwrap(yaw_angle) before differencing is also correct — this avoids the ±π wraparound spikes that would otherwise corrupt the curvature signal.

Question: Camera calibration / extrinsics

KIT Scenes provides full calibration in calib.json (intrinsic K [3×3] and extrinsic T_camera_to_reference [4×4] per camera). I don't see camera_params being computed or returned in this PR.

For fusion_mode="concat" and "cross_attn" this is fine (they don't use camera geometry). But for fusion_mode="bev", the BEV fusion module uses camera_params ([B, V, 3, 4] projection matrices) to project 3D reference points onto camera image planes. Without them, BEV fusion falls back to learned pseudo-projection, which works but loses the geometric prior that real calibration provides.

Is this intentional for v1 (add calibration later), or is it something you'd like to include here? The computation would be:

calib = loader.get_camera_calibration(cam_name)
K_scaled = scale_intrinsic(calib.intrinsic, original_res, target_res=256)
T_ref_to_cam = np.linalg.inv(calib.extrinsic)
P = K_scaled @ T_ref_to_cam[:3, :]  # (3, 4) projection matrix

Note on LiDAR

KIT Scenes ships LiDAR point clouds, which aren't needed for the current forward pass but will be valuable later for BEV segmentation supervision (#17) or depth distillation. Not a blocker for this PR — just noting it as a follow-up opportunity.

Thanks again for the thorough implementation!

@bharatwrrr

Copy link
Copy Markdown
Contributor Author

Thanks @riita10069. The kitscenes SDK had a map renderer but it used matplotlib, which I thought would be an overkill for our use case apart from the symbols they use for traffic lights, etc.

I also think adding camera parameters is a good idea. I have added the implementation for that in my latest commit. Do let me know what you think.

@m-zain-khawaja this PR still retains the map tile together with the camera tiles. Once #47 is closed (or maybe even as part of #47), 'dataset.py and 'camera.py can be refactored to completely exclude the map tile and move the loading to map.py.

Just for reference, these are what the map tiles rendered into a 1024 x 1024 px image from the lanelet2 object looks like (it'll look pixelated at 256 x 256):
image

@bharatwrrr bharatwrrr marked this pull request as ready for review June 11, 2026 00:10
@m-zain-khawaja

Copy link
Copy Markdown
Member

Thanks @bharatwrrr - may I please request that we remove some extra details from the map render.:

  1. Text keys on the right
  2. Title on the top
  3. North pointer on the left

This should give us a clean view of just the road geometry which can fill the entire rendered image and should hopefully allow us to have a better view of the road layout.

Although we will most likely feed this into a vectorized encoder, it may still be interesting to see how a rasterized HD map works with the network.

Also, do you know the dimensions of the rendered view in metres (e.g. forward/back & left/right).

@riita10069

Copy link
Copy Markdown
Collaborator

I think in the actual generate_bev_map_tile(), there are no unnecessary decorations.
The decorations are only on the visualise_bev_tile() side.
The current field of view is 120m × 120m
As far as I have read the source code.

@bharatwrrr Is it correct?

@bharatwrrr

Copy link
Copy Markdown
Contributor Author

Yes @riita10069, that is correct.
The current field of view is 120m x 120m. This is determined by the radius_meters argument generate_bev_map_tile call, where radius_meters is the half-width of the observation window in metres. This forces a square observation window to prevent stretching when resized to a 256 x 256 tile. It is not front-biased like we discussed earlier just to keep it consistent with the implementation of kitscenes. Let me know if you'd like me to make it front biased and I can refactor it.

The legend and the decorations (including the marker for ego position) is only inside the visualize_bev_tile function. The map tile that is fed into the model is clean.

@m-zain-khawaja

@riita10069

Copy link
Copy Markdown
Collaborator

Thanks for the update.
The camera calibration addition is a good step. compute_camera_projection_matrices() now computes projection matrices from KITScenes calibration and the dataset caches/returns camera_params per scene. That addresses the previous concern in the right direction.

I still think a few issues need to be fixed before merging:

  1. ClipSample is missing camera_params.

__getitem__() now returns camera_params, but the ClipSample TypedDict still does not declare it. Please add camera_params: torch.Tensor to the sample contract.

  1. forward_pass_test.py reads camera_params but does not pass it into the model.

The test currently does:

camera_params = batch["camera_params"].to(device)
...
trajectory_, compressed_, future_ = model(
    visual_tiles, visual_history, egomotion_history
)

so the calibrated BEV path is still not exercised. If this PR is adding camera calibration support, the smoke test should include a path that calls:

model(
    visual_tiles,
    visual_history,
    egomotion_history,
    camera_params=camera_params,
)

at least for the BEV fusion case.

  1. Please verify that scale_intrinsic() matches the actual timm preprocessing.

The current implementation scales intrinsics as if the native image is directly resized to a 256x256 square. But the actual image transform comes from timm.data.create_transform(), which may include aspect-preserving resize and center crop depending on the backbone config. If there is any crop or nontrivial resize policy, the projection matrix will not match the transformed image coordinates. We should either derive the exact resize/crop transform and update K accordingly, or document that the transform must be a direct square resize.

  1. README should be updated to the current model contract.

The README still refers to DrivingPolicy and the usage example does not include camera_params. Since current main uses TrajectoryPlanner, and this PR now returns camera_params, the documentation should be updated accordingly.

Overall, I like the direction and the parser is valuable. I would hold merge until the camera_params contract is tested end-to-end and the 7-camera vs 8-tile behavior is made explicit.

@bharatwrrr bharatwrrr marked this pull request as draft June 14, 2026 02:48
@bharatwrrr

Copy link
Copy Markdown
Contributor Author

@riita10069 Thanks for pointing these out. The timm transform for SwinV2 is indeed multi-step. Moving this back to draft until #55 merges to avoid the 7-view vs 8-view back-and-forth, and it makes more sense to write the forward pass test once against the final contract rather than patching it twice.

@m-zain-khawaja

Copy link
Copy Markdown
Member

Hi @bharatwrrr - I have merged your other PR so there should be no conflicts on 7-view vs 8-view, are there any changes you would like to submit

@immel-f

immel-f commented Jun 16, 2026

Copy link
Copy Markdown

Hi, I am Fabian, one of the lead authors of the KITScenes dataset.
It is awesome to see the quick adoption and usage of the dataset and I am excited for the new applications in E2E driving in Autoware!

Our research group has been working with HD maps and online HD map construction models for a long time and for KITScenes, the intended interface for working with the Lanelet2 maps in machine learning use cases is the lanelet2_ml_converter module, which is included in the custom wheel of lanelet2 in the KITScenes API repo. In the near future, lanelet2_ml_converter will also be included in the proper lanelet2 release. The module can not just render images with matplotlib, but you also get all map information in a local area, including relations between elements, cropped and transformed as numpy tensors which are easily usable for downstream tasks. It takes care of polyline merging along paths, many other tricky edge cases and removes most of the "annotation artifacts" that are properties of the Lanelet2 format but not of the road itself (like lanelets for example). It is also fully written in C++ with Python bindings, so it is really fast with negligible overhead for the map processing (~2 ms for one frame). I saw that in your implementation, you used the default version of Lanelet2, is there a specific reason for this? I believe this is a use case where lanelet2_ml_converter would fit really well. It would also fully pass through all features of the map like traffic lights and signs and all line types, which are currently missing from this implementation.

Also in our experience, a vectorized map encoder, while initially more tricky to work with, works very well and fits the original representation of the map better than a rendered image. The area of trajectory prediction (of other traffic participants) has been dealing with the problem of HD map encoding for a long time, and there vectorized map encoders became dominant quickly with works like VectorNet.
Furthermore, "non-road" elements like traffic lights and signs are tricky to represent in a BEV image, but much easier in a vectorized map encoder. Most of the tricky issues when working with vectorized elements are taken care of in lanelet2_ml_converter, and there are various options for the vectorized map encoder module architecture module, just as an example the SD map encoder architecture of SDTagNet (shameless plug 😄 ) could with some slight modifications also be applied to encode the output of lanelet2_ml_converter and would be able to fully pass on all information of the map, including relations.

We want to be more involved with the Autoware community in the future, so please do not hesitate to contact me or the other lead authors if you have any questions!

…ataset

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…taset

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…Multimodal dataset

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…del inputs, and usage examples

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…fix starting canvas size, and removing unused color entries

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…nd forward pass test

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…r transform steps and add camera parameters to dataset

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…ra parameters

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…t to 7

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…s and update dataset structure

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
… and testing modes

Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
@bharatwrrr bharatwrrr force-pushed the kitscenes-data-parsing branch from 5f7185a to 3c53ffd Compare June 17, 2026 05:14
@bharatwrrr

bharatwrrr commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@immel-f Hi Fabian,

Thank you for this, and for the work behind KITScenes.

I agree that a vectorized map encoder would fit the original map representation better than a rendered raster image. The current architectural plan is to accept either a map image or a vectorized map because some prospective datasets that the team is interested in for training the model do not provide HD maps. With the KITScenes dataset, however, I think the plan would be to encode the HD maps using a vectorized encoder, which is yet to be designed. So presently, the only way to test a forward pass with the KITScenes dataset would be by rendering a map tile using the HD map data and encoding it using a rasterized map encoder.

The lanelet2_ml_converter integration looks to me like exactly the right long-term path, especially for passing through relational map structures that rasterization loses. I had checked it out earlier, and the reason I went with the default Lanelet2 in this PR is a hard dependency constraint: the custom wheels in the KITScenes repo require glibc >= 2.31 for Python 3.10+, and our project currently requires Python 3.11+ due to the numpy version. This makes the current wheels uninstallable on machines with an older glibc, and I was wary of changing the package dependency. Now that you mentioned the lanelet2_ml_converter will be included in the standard release, that constraint should go away, and that would be the best time to migrate. The current renderer was largely motivated by the matplotlib renderer that the SDK uses.

Would love to know any thoughts you might have on this.

@m-zain-khawaja I have pushed a few more commits to make it compatible with changes from #55. Here is a summary:

  • Updated camera intrinsic scaling to match timm transforms
  • Enhanced BEV map rendering with increased line thickness
  • Enhanced README with dataset setup instructions and model input specifications
  • Updated docstrings for compatibility with changes in feat(map_encoder): add raster map encoder and map-image feature fusion #55
  • Added data directory to .gitignore
  • Configured 7-view camera setup as default
  • Added runtime rasterization option for BEV map tiles to enable benchmarking with zero tensor map tiles
  • Appended ClipSample structure with camera parameters and map tile

Let me know if you want me to make any further changes.

@bharatwrrr bharatwrrr marked this pull request as ready for review June 17, 2026 05:29
@immel-f

immel-f commented Jun 17, 2026

Copy link
Copy Markdown

Hi @bharatwrrr, thanks for your quick response, it seems we agree on most points 😄

One small thing is that when looking at this implementation, it seems that not all elements of the KITScenes API visualization like traffic lights / signs, symbols / arrows etc. are implemented, do you think it would make sense to also include those here? With the visualization code of the KITScenes API we tried to represent as much of the map information as possible.

Regarding the version conflict of glibc with lanelet2_ml_converter, I am surprised to hear that you do not have this conflict with the regular Lanelet2, as both wheels are built with the same CI (the wheels in the KITScenes API repo are just taken from here. The current main repo wheels (e.g. here) also use manylinux_2_31 and therefore glibc 2.31 already for Python 3.10+. This was required due to some dependency conflicts with these newer Python versions. Do you maybe build lanelet2 from source? In that case you can also try building from source from the source yourself which is also public. However glibc 2.31 is supported starting from Ubuntu 20.04 (which is already EOL 😄 ), do you know why an older glibc is required here?

@bharatwrrr

Copy link
Copy Markdown
Contributor Author

Hi @immel-f,

That is a completely fair point. The current rasterizer implementation is intentionally constrained to the core road topology (boundaries, centerlines, crosswalks, and intersection stop lines), which would provide baseline context to the model about the road structure and serve as a lean baseline for validating our model's initial forward pass and seeing how training responds to the map inputs.

About your point on the lanelet2_ml_converter, you are probably right that one would see a similar conflict with the base Lanelet2 (orpip would simply throw a No distribution found error). So I installed it via conda using the RoboStack channel (conda install -c robostack-staging ros-humble-lanelet2-python), which apparently enforces a very conservative glibc baseline. Regarding the older glibc requirement: while it's definitely an older standard for personal machines, some clusters (like the one I tested this on) rely on a long-term enterprise Linux distribution (like RHEL) which intentionally locks down core libraries like glibc 2.28 for their entire lifecycle to guarantee runtime predictability across shared nodes. While we could compile ml_converter from source locally, the currently implemented rasterizer in this PR is able to generate the core structures needed to validate the model's initial forward pass. The preferred pipeline for training the model with KITScenes would be to use a vectorized map encoder, in which case compiling the ml_converter will be necessary to use the SDK.

Thanks.

@m-zain-khawaja m-zain-khawaja merged commit 575d853 into autowarefoundation:main Jun 18, 2026
1 check passed

@m-zain-khawaja m-zain-khawaja left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved - thanks @bharatwrrr

riita10069 added a commit to riita10069/auto_e2e that referenced this pull request Jun 21, 2026
- Merge upstream/main (kit_scenes parser) and resolve .gitignore conflict
- ruff: remove unused imports, split multi-imports, rename ambiguous 'l'
- Remove superseded modular pipeline files (workflows.py is the canonical,
  registered entry point); they had a Py3.11-invalid nested f-string
- mypy: fix _get_rgb subtype annotation to str|None (pre-existing in autowarefoundation#41)
- Add __all__ to evaluation/__init__.py for intentional re-exports

ruff check: clean | mypy: 63 files clean | pytest: 207 passed, 1 skipped

Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com>
riita10069 added a commit to riita10069/auto_e2e that referenced this pull request Jun 21, 2026
For each dataset parser that can load in the current environment, run a
short optimisation loop on real samples and assert that loss trends
downward. Datasets without data or code are automatically skipped.

Currently exercised: L2D (HuggingFace), nvidia_physical_ai (local).
Kit_scenes skips until PR autowarefoundation#41 merges.

Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com>

Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com>
riita10069 added a commit to riita10069/auto_e2e that referenced this pull request Jun 21, 2026
- Merge upstream/main (kit_scenes parser) and resolve .gitignore conflict
- ruff: remove unused imports, split multi-imports, rename ambiguous 'l'
- Remove superseded modular pipeline files (workflows.py is the canonical,
  registered entry point); they had a Py3.11-invalid nested f-string
- mypy: fix _get_rgb subtype annotation to str|None (pre-existing in autowarefoundation#41)
- Add __all__ to evaluation/__init__.py for intentional re-exports

ruff check: clean | mypy: 63 files clean | pytest: 207 passed, 1 skipped

Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com>

Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants