[#77] Select multiple expected languages for mixed-language meetings by julien731 · Pull Request #81 · nimblehq/audio-transcriber

julien731 · 2026-06-08T11:45:31Z

Closes #77

Summary

Adds multilingual transcription support so a mixed English/Thai (or any 2+ language) meeting is transcribed passage by passage in the correct language, instead of being forced into one.

The upload form's single language dropdown becomes a language multi-select with nothing pre-selected.
POST /api/meetings accepts a set of expected_languages; the meeting metadata records it.
Transcription routes on the set size (BR-2, BR-3):
- 0 or 1 language → the existing single-language pipeline, unchanged (no per-chunk detection, no added time — EC-7).
- 2+ languages → a new per-VAD-chunk path: each speech chunk's language is detected (constrained to the selected set, BR-5) and transcribed in that language; chunks too short or ambiguous fall back to the meeting's dominant (duration-weighted) language (BR-6).
Every multilingual transcript segment records the language it was transcribed as.
Retrying a meeting created with 2+ languages re-processes under the multilingual path (it reuses the stored set).

Out of scope for this slice (later stories): per-language word alignment, the per-segment language badge UI, and per-segment emotion analysis. Multilingual timestamps stay at chunk level and speakers are left UNKNOWN until the alignment story.

Approach

backend/services/multilingual_transcriber.py reproduces WhisperX's VAD chunking and uses faster-whisper's detect_language (argmax over the selected set, renormalized) plus per-chunk transcribe(..., vad_filter=False). The classification policy (duration ≥ 1.5s, renormalized confidence ≥ 0.70, raw floor ≥ 0.5) is heuristic and tunable. The single- and multilingual pipelines were extracted into helper functions in transcriber.py to keep routing small; the single-language extraction is behavior-preserving. All ML imports stay lazy so the pure classification helpers (and the module) import without whisperx/torch/numpy.

Approach and scope decisions were reviewed by the Argus architect at plan stage (the speaker/diarization scope was deliberately deferred to the alignment story) and the diff passed architect code review. See docs/plans/77-multilingual-language-selection.md.

Verification

uv run pytest -q → 269 passed, 2 skipped. New coverage: tests/unit/test_multilingual_transcriber.py (constrained detection never returns an unselected language, confidence/duration gating, duration-weighted dominant fallback, orchestrator per-segment tagging + timestamp offsetting + failure handling, VAD glue), routing matrix in tests/unit/test_transcriber.py (0/1 force the single path and selected language; 2+ run multilingual with no align/diarize and UNKNOWN speakers; audio analysis receives the dominant language and no diarize turns), schema defaults, and expected_languages storage/sanitization in tests/integration/test_meetings.py.
uv run ruff check . and uv run ruff format --check . → clean.

Replace the language checkbox group with a searchable Tom Select multi-select (vendored locally, Apache-2.0) and make the auto-detect behavior explicit with a live mode indicator: empty = auto-detect a single language, one = single language, two or more = multilingual per-passage transcription.

julien731 added the feature New feature or enhancement label Jun 8, 2026

julien731 self-assigned this Jun 8, 2026

julien731 force-pushed the feature/77-multilingual-language-selection branch 2 times, most recently from b32d04c to 10569de Compare June 10, 2026 04:37

julien731 added 8 commits June 10, 2026 12:17

[#77] Add multilingual transcription spec and implementation plan

587bb68

[#77] Add expected_languages and per-segment language to schemas

11db1ed

[#77] Accept expected_languages set on meeting upload

d1de79c

[#77] Add per-chunk multilingual transcription engine

156af92

[#77] Route transcription to single or multilingual pipeline

507a75b

[#77] Add language multi-select to the upload form

ed8fbe5

[#77] Record plan deviation: extracted pipeline helpers

7ad885f

[#77] Add EC-1 test: 2+ languages, all-one-language audio

7dca0bc

julien731 force-pushed the feature/77-multilingual-language-selection branch from 10569de to 7dca0bc Compare June 10, 2026 05:17

julien731 added 4 commits June 10, 2026 12:34

[#77] Hide language placeholder once a language is selected

25d5d43

[#77] List English and Thai first in language options

faac09e

[#77] Note longer processing time for multilingual meetings

56883ae

julien731 merged commit b05bc75 into main Jun 10, 2026
2 checks passed

julien731 deleted the feature/77-multilingual-language-selection branch June 10, 2026 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#77] Select multiple expected languages for mixed-language meetings#81

[#77] Select multiple expected languages for mixed-language meetings#81
julien731 merged 12 commits into
mainfrom
feature/77-multilingual-language-selection

julien731 commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

julien731 commented Jun 8, 2026

Summary

Approach

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant