feat: Add LibriSpeech dataset builder(audio dataset) by Anurag9Dhiman · Pull Request #63 · galilai-group/stable-datasets

Anurag9Dhiman · 2026-04-17T07:36:44Z

Added Arrow-backed BaseDatasetBuilder implementation for LibriSpeech
Supports train-clean-100 and test-clean splits from openslr
Parses both audio (.flac via soundfile) and transcriptions
Added comprehensive unit tests and exposed via timeseries module

What does this PR do?

Fixes # (issue)

This PR adds support for downloading and extracting the LibriSpeech dataset natively into stable-datasets using the optimized BaseDatasetBuilder architecture.

Changes Made:

Added LibriSpeech class in stable_datasets/timeseries/librispeech.py implementing the official base interface.
Audio Decoding: Efficiently parses .flac waveform subsets into float32 arrays using the soundfile library (a lighter, more efficient alternative to full PyTorch audio dependencies).
Transcriptions: Safely handles mapping text transcriptions from internal .trans.txt files inside the multi-directory .tar.gz archive.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

- Added Arrow-backed BaseDatasetBuilder implementation for LibriSpeech - Supports train-clean-100 and test-clean splits from openslr - Parses both audio (.flac via soundfile) and transcriptions - Added comprehensive unit tests and exposed via timeseries module

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add LibriSpeech dataset builder(audio dataset)#63

feat: Add LibriSpeech dataset builder(audio dataset)#63
Anurag9Dhiman wants to merge 1 commit into
galilai-group:mainfrom
Anurag9Dhiman:feature/librispeech

Anurag9Dhiman commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Anurag9Dhiman commented Apr 17, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant