A clean, terminal-based video and audio transcription tool powered by OpenAI Whisper — fully open-source, runs locally, no API keys required.
- 🎬 Transcribe video or audio files directly from the terminal
- 📁 Batch-transcribe all supported media files in a folder with one shared setting set
- ⚡ Choose from 5 Whisper model sizes — from blazing-fast to highly accurate
- 📄 Export to plain text, SRT, WebVTT, or JSON (with timestamps)
- 🌍 Automatic language detection — no configuration needed
- 🔒 Runs entirely offline — your files never leave your machine
- 💅 Clean, interactive UI powered by Rich
- Python 3.8+
- ffmpeg (system-level)
git clone https://github.com/WiseArts/transcribe.git
cd transcribe# macOS
brew install ffmpeg
# Ubuntu / Debian
sudo apt install ffmpeg
# Windows (via Chocolatey)
choco install ffmpegpython3 -m venv .venv
# macOS / Linux
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1python -m pip install --upgrade pip
pip install openai-whisper richNote: Use the tool from inside the activated virtual environment each time. The first time you run it, Whisper will automatically download the selected model weights and cache them locally. This is a one-time download per model.
# macOS / Linux
source .venv/bin/activate
python transcribe.pyOn Windows PowerShell, activate it with .venv\Scripts\Activate.ps1 before running python transcribe.py.
The tool now lets you choose between:
- Single file mode — transcribe one video/audio file
- Folder batch mode — transcribe all supported files in one folder using the same model + output format
After choosing a mode, it walks you through:
- Input — file path or folder path (drag & drop into the terminal works on most systems)
- Model — pick a size based on how fast vs. accurate you need it
- Output format — choose how you want transcripts saved
The output file is saved alongside your source file (e.g. interview.mp4 → interview.srt).
In folder mode, each file is saved next to its source file (e.g. clip01.mp4 → clip01.srt).
| # | Model | Speed | Quality | VRAM | Best for |
|---|---|---|---|---|---|
| 1 | tiny | ██████████ | ███░░░░░░░ | ~1 GB | Quick drafts, fast machines |
| 2 | base | ████████░░ | █████░░░░░ | ~1 GB | Everyday use (default) |
| 3 | small | ██████░░░░ | ███████░░░ | ~2 GB | Better accuracy, still fast |
| 4 | medium | ████░░░░░░ | █████████░ | ~5 GB | High quality, multilingual |
| 5 | large | ██░░░░░░░░ | ██████████ | ~10 GB | Best possible accuracy |
Video: .mp4 .mov .avi .mkv .webm .flv
Audio: .mp3 .wav .m4a .aac .ogg .flac
| Format | Extension | Description |
|---|---|---|
| Plain text | .txt |
Clean transcript, one line per segment |
| SRT | .srt |
Subtitles with timestamps (video players, Premiere, etc.) |
| WebVTT | .vtt |
Web subtitles for HTML5 <video> tags |
| JSON | .json |
Full Whisper output with segment-level confidence data |
- CPU vs GPU: The script uses CPU by default (
fp16=False) so it works on any machine. If you have an NVIDIA GPU with CUDA, remove thefp16=Falseflag in thetranscribe()call for a significant speedup. - Speed: As a rough guide on CPU,
basetranscribes roughly 4–8× real-time speed. A 10-minute video takes around 2–3 minutes. - Accuracy: Whisper performs best on clear speech with minimal background noise. The
mediumandlargemodels handle accents and technical vocabulary noticeably better.
| Package | Purpose |
|---|---|
| openai-whisper | Speech-to-text transcription |
| rich | Terminal UI |
| ffmpeg | Audio extraction from video files |
MIT — do whatever you like with it.