Skip to content

datasets/datasets#22

Open
balaboom123 wants to merge 11 commits into
mainfrom
datasets/datasets
Open

datasets/datasets#22
balaboom123 wants to merge 11 commits into
mainfrom
datasets/datasets

Conversation

@balaboom123

Copy link
Copy Markdown
Owner

No description provided.

…tion

  - add first-class LSA64 configs, bundled class map, and documentation
  - implement variant-aware raw/cut resolution with explicit conflict validation
  - resolve dataset-specific videos_dir through PipelineContext for manifest reuse
  - add regression tests for LSA64 adapter, config loading, and pipeline path handling
…tched manifest reuse

  - require an explicit LSA64 release_dir or paths.videos before loader defaults
  - validate reused manifests against the configured LSA64 variant
  - persist SOURCE_VARIANT in generated LSA64 manifests
  - preserve literal class-map labels like "None" in TSV loading
  - add regression coverage for config loading, manifest reuse, and class-map parsing
…ifest handling

  - move LSA64 raw-input validation out of config loader into the dataset adapter base contract
  - simplify LSA64 adapter source/video-dir resolution and pass the resolved video dir into manifest building
  - make manifest building reuse DEFAULT_FPS and clean up logger initialization
  - add typed LSA64 variant handling and centralize mixed-variant manifest checks Shorter one-line
  - direct manifest building with conflicting explicit variant paths
  - loaded manifest validation with conflicting explicit variant paths
  - PipelineRunner execution with direct Config(...) and conflicting variant paths
  - add the CSL dataset adapter and manifest/source handling
  - support local CSL release validation and frame-folder materialization
  - add base/job configs for MediaPipe, MMPose, and video pipelines
  - document CSL setup, configuration, and research references
  - add unit and integration coverage for CSL registration and config loading
…lidation

  - support multiple frame glob patterns during frame-to-video materialization
  - prevent mixed-case CSL frame extensions from dropping frames on case-sensitive filesystems
  - centralize CSL source validation and reject non-positive video_fps values
  - fail early when CSL frame materialization produces no usable videos
  - remove unused CSL dictionary_file config
  - add regression coverage for mixed-case frame folders and shared media materialization
  - restore ordinal fallback for CSL samples whose materialized clip names do not contain numeric signer tokens
  - normalize zero-based signer and variation ids so legacy sample_00_00-style layouts still map to canonical sample ids
  - keep improved signer parsing for single-token and repeated-take CSL filenames
  - add regression coverage for non-numeric sample names, zero-based ids, and mixed explicit/implicit variation numbering
  - simplify CSL dataset documentation and README copy around the official USTC access flow and Split I / Split II manifest rules
  - add AUTSL dataset configs for mediapipe, mmpose, and video jobs
  - complete AUTSL adapter, source validation, and manifest generation
  - resolve AUTSL-specific source override paths in config loading
  - document AUTSL setup and usage in README and dataset/config docs
  - add unit and config-loader coverage for AUTSL
  - fail fast on missing explicit AUTSL label override files
  Restore CSL manifest SAMPLE_ID generation to the legacy signer_sentence_variation format so existing split-spec TSVs and downstream artifacts continue to resolve custom splits correctly. Update CSL unit tests to assert the restored sample ID format.
Copilot AI review requested due to automatic review settings May 7, 2026 08:43
@github-project-automation github-project-automation Bot moved this to Backlog in signdata-slt May 7, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class support for additional manually-downloaded datasets (AUTSL, CSL, LSA64) by introducing dataset adapters, job configs, and manifest/build tooling, plus a small pipeline enhancement to let datasets control video-root resolution and validate reused manifests.

Changes:

  • Added new dataset adapters + manifests/sources for AUTSL, CSL, and LSA64, including an LSA64 bundled class-map asset.
  • Updated pipeline path resolution to use dataset.resolve_videos_dir() and added dataset hook validate_loaded_manifest() for reused manifests.
  • Added job/base configs and expanded unit/integration test coverage and documentation for the new datasets.

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/pipeline/test_registry.py Extends dataset registry assertions to include csl.
tests/unit/pipeline/test_pipeline.py Adds tests for dataset-specific video dir resolution and LSA64 reused-manifest variant validation.
tests/unit/datasets/test_media.py Tests multi-pattern frame materialization helper behavior.
tests/unit/datasets/test_lsa64.py Comprehensive unit tests for LSA64 config validation, download, manifest building, and reused-manifest validation.
tests/unit/datasets/test_csl.py Comprehensive unit tests for CSL preparation/materialization and manifest building logic.
tests/unit/datasets/test_autsl.py Comprehensive unit tests for AUTSL config validation, download validation, and manifest building.
tests/integration/jobs/test_config_loader.py Verifies new job configs load correctly and new dataset source-path fields resolve relative to project root.
src/signdata/pipeline/runner.py Validates reused manifests via dataset.validate_loaded_manifest() after loading.
src/signdata/pipeline/context.py Sets videos_dir via dataset-specific resolve_videos_dir().
src/signdata/datasets/lsa64/source.py Adds typed source config, variant/path inference, and reused-manifest variant validation helpers.
src/signdata/datasets/lsa64/manifest.py Emits SOURCE_VARIANT, uses DEFAULT_FPS, and accepts optional injected video_dir/logger.
src/signdata/datasets/lsa64/adapter.py Implements LSA64 adapter with raw-input validation, variant-aware video-dir resolution, and reused-manifest validation.
src/signdata/datasets/lsa64/init.py Exports LSA64Dataset and LSA64SourceConfig.
src/signdata/datasets/csl/source.py Adds CSL preparation/materialization logic and runtime video-dir selection.
src/signdata/datasets/csl/manifest.py Builds canonical CSL manifest from discovered runtime videos + corpus with split logic and timing columns.
src/signdata/datasets/csl/adapter.py Implements CSL adapter validation, preparation, runtime video-dir resolution, and manifest stats.
src/signdata/datasets/csl/init.py Exports CSLDataset and CSLSourceConfig.
src/signdata/datasets/base.py Adds validate_raw_inputs(), resolve_videos_dir(), and validate_loaded_manifest() hooks.
src/signdata/datasets/autsl/source.py Adds split/modality validation, split-dir discovery, and label override handling.
src/signdata/datasets/autsl/manifest.py Builds AUTSL manifest across selected splits with timing and optional unlabeled handling.
src/signdata/datasets/autsl/adapter.py Implements AUTSL adapter with raw-input validation, config validation, and release-root video-dir resolution.
src/signdata/datasets/autsl/init.py Exports AUTSLDataset and AUTSLSourceConfig.
src/signdata/datasets/_ingestion/media.py Extends frame materialization to support multiple glob patterns deterministically.
src/signdata/datasets/_ingestion/classmap.py Adjusts class-map CSV loading NA handling.
src/signdata/config/loader.py Resolves additional dataset.source path keys and invokes dataset validate_raw_inputs() before applying defaults.
README.md Documents new datasets in the supported list and provides example job invocations.
docs/research-preprocessing.md Adds CSL citation entry.
docs/datasets.md Adds dataset setup/usage docs for AUTSL, CSL, and LSA64.
docs/configuration.md Documents new dataset.source options (release_dir, protocol, prepare_mode, labels/class files, etc.).
configs/jobs/lsa64/video.yaml Adds LSA64 video (video2crop) job config.
configs/jobs/lsa64/mmpose.yaml Adds LSA64 MMPose job config.
configs/jobs/lsa64/mediapipe.yaml Adds LSA64 MediaPipe job config.
configs/jobs/csl/video.yaml Adds CSL video (video2crop) job config.
configs/jobs/csl/mmpose.yaml Adds CSL MMPose job config.
configs/jobs/csl/mediapipe.yaml Adds CSL MediaPipe job config.
configs/jobs/autsl/video.yaml Adds AUTSL video (video2crop) job config.
configs/jobs/autsl/mmpose.yaml Adds AUTSL MMPose job config.
configs/jobs/autsl/mediapipe.yaml Adds AUTSL MediaPipe job config.
configs/base/datasets/lsa64.yaml Adds LSA64 base dataset config.
configs/base/datasets/csl.yaml Adds CSL base dataset config.
configs/base/datasets/autsl.yaml Adds AUTSL base dataset config.
assets/lsa64_class_map.tsv Adds bundled LSA64 class-map asset.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +115 to 120
def resolve_rgb_dir(release_dir: Path, source: CSLSourceConfig) -> Path:
if source.rgb_subdir:
candidate = release_dir / source.rgb_subdir
if candidate.is_dir():
if candidate.exists():
return candidate
return release_dir
Comment on lines +153 to +167
def iter_sample_frame_dirs(root: Path) -> Iterable[tuple[Path, Path]]:
if not root.exists():
return

sentence_dirs = sorted(
path for path in root.iterdir()
if path.is_dir() and _looks_like_sentence_dir(path.name)
)
for sentence_dir in sentence_dirs:
sample_dirs = sorted(
path for path in sentence_dir.iterdir()
if path.is_dir() and _contains_frame_files(path)
)
for sample_dir in sample_dirs:
yield sentence_dir, sample_dir
  - add a new BOBSL dataset adapter with subtitle_slt and isolated_signs views
  - add a thin BSL-1K compatibility adapter backed by the public BOBSL release
  - add base configs and runnable MediaPipe/MMPose/video jobs for both datasets
  - generalize dataset.source path resolution for file/dir/root/json/csv/tsv keys
  - document BOBSL and BSL-1K setup, config options, and job usage
  - add unit and config-loader coverage for the new adapters and paths
  - fix manual subtitle auto-discovery so manual runs do not select the wrong subtitle tree
  - fix isolated-sign CLASS_ID backfill to avoid collisions with existing explicit IDs
…o roots

  - reject split=challenge for current BOBSL/BSL-1K manifest workflows because the public release does not include the required subtitle or isolated-sign annotation files
  - rewrite the implicit BSL-1K videos dir to <release_dir>/videos while preserving explicit paths.videos overrides
  - add regression coverage for BOBSL validation, BSL-1K path handling, and direct base-config loading
  - update dataset docs to reflect supported BOBSL splits and BSL-1K's default video root behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants