feat (data_parsing): add Kitscenes dataset parser and forward pass test#41
Conversation
|
Great work @bharatwrrr — this is a solid implementation. A few highlights and one question: What I really likeThe Lanelet2 BEV map rasterization ( The egomotion derivation with Question: Camera calibration / extrinsicsKIT Scenes provides full calibration in For Is this intentional for v1 (add calibration later), or is it something you'd like to include here? The computation would be: calib = loader.get_camera_calibration(cam_name)
K_scaled = scale_intrinsic(calib.intrinsic, original_res, target_res=256)
T_ref_to_cam = np.linalg.inv(calib.extrinsic)
P = K_scaled @ T_ref_to_cam[:3, :] # (3, 4) projection matrixNote on LiDARKIT Scenes ships LiDAR point clouds, which aren't needed for the current forward pass but will be valuable later for BEV segmentation supervision (#17) or depth distillation. Not a blocker for this PR — just noting it as a follow-up opportunity. Thanks again for the thorough implementation! |
|
Thanks @riita10069. The kitscenes SDK had a map renderer but it used matplotlib, which I thought would be an overkill for our use case apart from the symbols they use for traffic lights, etc. I also think adding camera parameters is a good idea. I have added the implementation for that in my latest commit. Do let me know what you think. @m-zain-khawaja this PR still retains the map tile together with the camera tiles. Once #47 is closed (or maybe even as part of #47), 'dataset.py Just for reference, these are what the map tiles rendered into a 1024 x 1024 px image from the lanelet2 object looks like (it'll look pixelated at 256 x 256): |
|
Thanks @bharatwrrr - may I please request that we remove some extra details from the map render.:
This should give us a clean view of just the road geometry which can fill the entire rendered image and should hopefully allow us to have a better view of the road layout. Although we will most likely feed this into a vectorized encoder, it may still be interesting to see how a rasterized HD map works with the network. Also, do you know the dimensions of the rendered view in metres (e.g. forward/back & left/right). |
|
I think in the actual generate_bev_map_tile(), there are no unnecessary decorations. @bharatwrrr Is it correct? |
|
Yes @riita10069, that is correct. The legend and the decorations (including the marker for ego position) is only inside the |
|
Thanks for the update. I still think a few issues need to be fixed before merging:
The test currently does: camera_params = batch["camera_params"].to(device)
...
trajectory_, compressed_, future_ = model(
visual_tiles, visual_history, egomotion_history
)so the calibrated BEV path is still not exercised. If this PR is adding camera calibration support, the smoke test should include a path that calls: model(
visual_tiles,
visual_history,
egomotion_history,
camera_params=camera_params,
)at least for the BEV fusion case.
The current implementation scales intrinsics as if the native image is directly resized to a 256x256 square. But the actual image transform comes from
The README still refers to Overall, I like the direction and the parser is valuable. I would hold merge until the |
|
@riita10069 Thanks for pointing these out. The timm transform for SwinV2 is indeed multi-step. Moving this back to draft until #55 merges to avoid the 7-view vs 8-view back-and-forth, and it makes more sense to write the forward pass test once against the final contract rather than patching it twice. |
|
Hi @bharatwrrr - I have merged your other PR so there should be no conflicts on 7-view vs 8-view, are there any changes you would like to submit |
|
Hi, I am Fabian, one of the lead authors of the KITScenes dataset. Our research group has been working with HD maps and online HD map construction models for a long time and for KITScenes, the intended interface for working with the Lanelet2 maps in machine learning use cases is the Also in our experience, a vectorized map encoder, while initially more tricky to work with, works very well and fits the original representation of the map better than a rendered image. The area of trajectory prediction (of other traffic participants) has been dealing with the problem of HD map encoding for a long time, and there vectorized map encoders became dominant quickly with works like VectorNet. We want to be more involved with the Autoware community in the future, so please do not hesitate to contact me or the other lead authors if you have any questions! |
…ataset Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…taset Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…Multimodal dataset Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…del inputs, and usage examples Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…fix starting canvas size, and removing unused color entries Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…nd forward pass test Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…r transform steps and add camera parameters to dataset Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…ra parameters Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…t to 7 Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
…s and update dataset structure Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
… and testing modes Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
Signed-off-by: bharatwrrr <bharatwrrr@gmail.com>
5f7185a to
3c53ffd
Compare
|
@immel-f Hi Fabian, Thank you for this, and for the work behind KITScenes. I agree that a vectorized map encoder would fit the original map representation better than a rendered raster image. The current architectural plan is to accept either a map image or a vectorized map because some prospective datasets that the team is interested in for training the model do not provide HD maps. With the KITScenes dataset, however, I think the plan would be to encode the HD maps using a vectorized encoder, which is yet to be designed. So presently, the only way to test a forward pass with the KITScenes dataset would be by rendering a map tile using the HD map data and encoding it using a rasterized map encoder. The Would love to know any thoughts you might have on this. @m-zain-khawaja I have pushed a few more commits to make it compatible with changes from #55. Here is a summary:
Let me know if you want me to make any further changes. |
|
Hi @bharatwrrr, thanks for your quick response, it seems we agree on most points 😄 One small thing is that when looking at this implementation, it seems that not all elements of the KITScenes API visualization like traffic lights / signs, symbols / arrows etc. are implemented, do you think it would make sense to also include those here? With the visualization code of the KITScenes API we tried to represent as much of the map information as possible. Regarding the version conflict of glibc with |
|
Hi @immel-f, That is a completely fair point. The current rasterizer implementation is intentionally constrained to the core road topology (boundaries, centerlines, crosswalks, and intersection stop lines), which would provide baseline context to the model about the road structure and serve as a lean baseline for validating our model's initial forward pass and seeing how training responds to the map inputs. About your point on the Thanks. |
m-zain-khawaja
left a comment
There was a problem hiding this comment.
Approved - thanks @bharatwrrr
- Merge upstream/main (kit_scenes parser) and resolve .gitignore conflict - ruff: remove unused imports, split multi-imports, rename ambiguous 'l' - Remove superseded modular pipeline files (workflows.py is the canonical, registered entry point); they had a Py3.11-invalid nested f-string - mypy: fix _get_rgb subtype annotation to str|None (pre-existing in autowarefoundation#41) - Add __all__ to evaluation/__init__.py for intentional re-exports ruff check: clean | mypy: 63 files clean | pytest: 207 passed, 1 skipped Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com>
For each dataset parser that can load in the current environment, run a short optimisation loop on real samples and assert that loss trends downward. Datasets without data or code are automatically skipped. Currently exercised: L2D (HuggingFace), nvidia_physical_ai (local). Kit_scenes skips until PR autowarefoundation#41 merges. Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com> Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com>
- Merge upstream/main (kit_scenes parser) and resolve .gitignore conflict - ruff: remove unused imports, split multi-imports, rename ambiguous 'l' - Remove superseded modular pipeline files (workflows.py is the canonical, registered entry point); they had a Py3.11-invalid nested f-string - mypy: fix _get_rgb subtype annotation to str|None (pre-existing in autowarefoundation#41) - Add __all__ to evaluation/__init__.py for intentional re-exports ruff check: clean | mypy: 63 files clean | pytest: 207 passed, 1 skipped Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com> Signed-off-by: riita10069 <ryota10069.tech5.jizi@gmail.com>


Key difference: Vectorized HD maps
Unlike NVIDIA, KIT Scenes ships production-grade Lanelet2 HD maps as the primary map representation. This raises the question: how to feed vectorized road geometry into a vision model backbone?
Solution:
map.py— rasterize maps to semantic RGB tiles. Each scene's Lanelet2 map is queried for lanelets in an ego-centric region, then rasterized using OpenCV into a(H, W, 3)uint8 RGB image. The tile includes:The tile is ego-centric (forward → up, left → left) and rotated so the ego's heading always points straight up. It's then passed through the same backbone transform (resize + ImageNet normalisation) as the camera frames, so all 8 views share identical shape and normalisation.
Design tradeoff: Rasterization vs. alternative representations
Rasterizing is not free. At scale, the per-tile cost adds up. Three directions to consider:
1. Keep current approach (rasterize → backbone image encoder)
2. Vectorized map encoder with spatial output (not v1)
(3, H, W)directly(3, H, W)→ stacks as view 8visual_tiles (B, 8, 3, H, W))Current choice & next steps
Current: rasterize to RGB, apply backbone transform, feed as slot 7 to the same CNN backbone. This is simple and works.
For v2, we could collect profiling data on the rasterization bottleneck during training. If rasterization becomes a bottleneck after the first training run, pivot to option 2 (vectorized encoder).
Map Transforms
Currently using
timm.data.create_transform(backbone-driven ImageNet normalization). Alternatives:Keeping ImageNet norm for now (simple, well-studied), but this is tunable.
Testing
forward_pass_test.pyvalidates: dataset → batch → forward pass.Example smoke test on one scene: