Skip to content

[#77] Select multiple expected languages for mixed-language meetings#81

Merged
julien731 merged 12 commits into
mainfrom
feature/77-multilingual-language-selection
Jun 10, 2026
Merged

[#77] Select multiple expected languages for mixed-language meetings#81
julien731 merged 12 commits into
mainfrom
feature/77-multilingual-language-selection

Conversation

@julien731

Copy link
Copy Markdown
Member

Closes #77

Summary

Adds multilingual transcription support so a mixed English/Thai (or any 2+ language) meeting is transcribed passage by passage in the correct language, instead of being forced into one.

  • The upload form's single language dropdown becomes a language multi-select with nothing pre-selected.
  • POST /api/meetings accepts a set of expected_languages; the meeting metadata records it.
  • Transcription routes on the set size (BR-2, BR-3):
    • 0 or 1 language → the existing single-language pipeline, unchanged (no per-chunk detection, no added time — EC-7).
    • 2+ languages → a new per-VAD-chunk path: each speech chunk's language is detected (constrained to the selected set, BR-5) and transcribed in that language; chunks too short or ambiguous fall back to the meeting's dominant (duration-weighted) language (BR-6).
  • Every multilingual transcript segment records the language it was transcribed as.
  • Retrying a meeting created with 2+ languages re-processes under the multilingual path (it reuses the stored set).

Out of scope for this slice (later stories): per-language word alignment, the per-segment language badge UI, and per-segment emotion analysis. Multilingual timestamps stay at chunk level and speakers are left UNKNOWN until the alignment story.

Approach

backend/services/multilingual_transcriber.py reproduces WhisperX's VAD chunking and uses faster-whisper's detect_language (argmax over the selected set, renormalized) plus per-chunk transcribe(..., vad_filter=False). The classification policy (duration ≥ 1.5s, renormalized confidence ≥ 0.70, raw floor ≥ 0.5) is heuristic and tunable. The single- and multilingual pipelines were extracted into helper functions in transcriber.py to keep routing small; the single-language extraction is behavior-preserving. All ML imports stay lazy so the pure classification helpers (and the module) import without whisperx/torch/numpy.

Approach and scope decisions were reviewed by the Argus architect at plan stage (the speaker/diarization scope was deliberately deferred to the alignment story) and the diff passed architect code review. See docs/plans/77-multilingual-language-selection.md.

Verification

  • uv run pytest -q269 passed, 2 skipped. New coverage: tests/unit/test_multilingual_transcriber.py (constrained detection never returns an unselected language, confidence/duration gating, duration-weighted dominant fallback, orchestrator per-segment tagging + timestamp offsetting + failure handling, VAD glue), routing matrix in tests/unit/test_transcriber.py (0/1 force the single path and selected language; 2+ run multilingual with no align/diarize and UNKNOWN speakers; audio analysis receives the dominant language and no diarize turns), schema defaults, and expected_languages storage/sanitization in tests/integration/test_meetings.py.
  • uv run ruff check . and uv run ruff format --check . → clean.

@julien731 julien731 added the feature New feature or enhancement label Jun 8, 2026
@julien731 julien731 self-assigned this Jun 8, 2026
@julien731 julien731 force-pushed the feature/77-multilingual-language-selection branch 2 times, most recently from b32d04c to 10569de Compare June 10, 2026 04:37
@julien731 julien731 force-pushed the feature/77-multilingual-language-selection branch from 10569de to 7dca0bc Compare June 10, 2026 05:17
Replace the language checkbox group with a searchable Tom Select multi-select
(vendored locally, Apache-2.0) and make the auto-detect behavior explicit with
a live mode indicator: empty = auto-detect a single language, one = single
language, two or more = multilingual per-passage transcription.
@julien731 julien731 merged commit b05bc75 into main Jun 10, 2026
2 checks passed
@julien731 julien731 deleted the feature/77-multilingual-language-selection branch June 10, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or enhancement

Projects

None yet

1 participant