Skip to content

feat: Add LibriSpeech dataset builder(audio dataset)#63

Open
Anurag9Dhiman wants to merge 1 commit into
galilai-group:mainfrom
Anurag9Dhiman:feature/librispeech
Open

feat: Add LibriSpeech dataset builder(audio dataset)#63
Anurag9Dhiman wants to merge 1 commit into
galilai-group:mainfrom
Anurag9Dhiman:feature/librispeech

Conversation

@Anurag9Dhiman

Copy link
Copy Markdown
  • Added Arrow-backed BaseDatasetBuilder implementation for LibriSpeech
  • Supports train-clean-100 and test-clean splits from openslr
  • Parses both audio (.flac via soundfile) and transcriptions
  • Added comprehensive unit tests and exposed via timeseries module

What does this PR do?

Fixes # (issue)

This PR adds support for downloading and extracting the LibriSpeech dataset natively into stable-datasets using the optimized BaseDatasetBuilder architecture.

Changes Made:

Added LibriSpeech class in stable_datasets/timeseries/librispeech.py implementing the official base interface.
Audio Decoding: Efficiently parses .flac waveform subsets into float32 arrays using the soundfile library (a lighter, more efficient alternative to full PyTorch audio dependencies).
Transcriptions: Safely handles mapping text transcriptions from internal .trans.txt files inside the multi-directory .tar.gz archive.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

- Added Arrow-backed BaseDatasetBuilder implementation for LibriSpeech
- Supports train-clean-100 and test-clean splits from openslr
- Parses both audio (.flac via soundfile) and transcriptions
- Added comprehensive unit tests and exposed via timeseries module
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant