Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 68 additions & 67 deletions README.md

Large diffs are not rendered by default.

22 changes: 22 additions & 0 deletions openai-whisper-large-v3-turbo/VitisAI/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
MIT License

Copyright (c) 2022 OpenAI
Copyright (C) [2026] Advanced Micro Devices, Inc. All Rights Reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
78 changes: 78 additions & 0 deletions openai-whisper-large-v3-turbo/VitisAI/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Whisper on WinML — Speech-to-Text on the Edge with ONNX Runtime Vitis EP

---

## Installation

### 1. Create and activate conda environment

```bash
conda create -n winml_whisper python==3.11
conda activate winml_whisper
pip install --pre -r requirements.txt
```

### 2. Check installed WASDK version

```bash
conda list | findstr wasdk
```

> **Expected Output:**
> ```
> wasdk-microsoft-windows-ai-machinelearning 1.8.260209005 pypi_0 pypi
> wasdk-microsoft-windows-applicationmodel-dynamicdependency-bootstrap 1.8.260209005 pypi_0 pypi
> ```

### 3. Install Windows App SDK

Download and install the Windows App SDK matching your `wasdk` version (e.g., `1.8.260209005`):

```bash
curl -L -o windowsappruntimeinstall-x86.exe "https://aka.ms/windowsappsdk/1.8/1.8.260209005/windowsappruntimeinstall-x86.exe"
windowsappruntimeinstall-x86.exe --quiet
```

### 4. Download ONNX encoder and Vitis AI cache (`.rai`)

Download the encoder ONNX model from the Hugging Face repo:

| Model | Hugging Face repo |
|-------|-------------------|
| **Large-v3-turbo** | [amd/whisper-large-turbo-onnx-npu](https://huggingface.co/amd/whisper-large-turbo-onnx-npu/tree/main) — get `encoder_model.onnx`, `encoder_model.onnx.data` (if present) |

Place the encoder ONNX file(s) in your working directory (or a path you will pass to `--enc_onnx`).

---

## Run Inference and Transcribe

Use the encoder ONNX from one of the three Hugging Face repos above (medium, small, or large-v3-turbo) depending on your task — set `--enc_onnx` to that model’s `encoder_model.onnx` (or your local path to it) and `--model` to the matching Whisper model name.

```bash
python run_whisper.py \
--audio ".\audio_sample.wav" \
--model turbo \
--enc_onnx ".\encoder_model.onnx" \
--ep_policy NPU
```
---
## Notes
In this experiment, we are using encoder portion of the model with ORT and decoder portion of the model with Pytorch+kv cache.

## Command-Line Arguments

---

| Argument | Required | Default | Description |
|---|---|---|---|
| `--audio` | Yes | — | Path to input audio WAV file |
| `--model` | No | `turbo` | Whisper model name: `small`, `medium`, `turbo` (should match encoder ONNX) |
| `--enc_onnx` | No | `encoder_model.onnx` | Path to encoder ONNX model file |
| `--ep_policy` | No | `NPU` | Execution provider policy (`NPU`, `CPU`, `GPU`, `DEFAULT`) |

---

## Credits

This project builds on [OpenAI Whisper](https://github.com/openai/whisper). We have borrowed code from that repository and extended it with ONNX Runtime Vitis AI EP integration, WinML execution providers, and the changes documented in this README. Whisper is licensed under the [MIT License](https://github.com/openai/whisper/blob/main/LICENSE).
Binary file not shown.
6 changes: 6 additions & 0 deletions openai-whisper-large-v3-turbo/VitisAI/info.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
arch: whisper
recipes:
- name: openai-whisper-large-v3-turbo-vitisai
file: run_whisper.py
devices: npu
eps: VitisAIExecutionProvider
37 changes: 37 additions & 0 deletions openai-whisper-large-v3-turbo/VitisAI/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# MIT License
#
# Copyright (c) 2022 OpenAI
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

--extra-index-url https://pypi.org/simple
# Copyright (C) [2026] Advanced Micro Devices, Inc. All Rights Reserved.


--index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple

numpy
onnxruntime-winml
onnxscript
scipy==1.17.1
tiktoken==0.12.0
torch
transformers
wasdk-Microsoft.Windows.AI.MachineLearning[all]==1.8.260209005
wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap[all]==1.8.260209005
191 changes: 191 additions & 0 deletions openai-whisper-large-v3-turbo/VitisAI/run_whisper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# MIT License
#
# Copyright (C) 2026, Advanced Micro Devices, Inc
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# Copyright (C) [2026] Advanced Micro Devices, Inc. All Rights Reserved.

import time
import os
import argparse
import json
import sys
import numpy as np
import whisper
import subprocess
import onnxruntime as ort
from scipy.io import wavfile
from scipy.signal import resample

SAMPLE_RATE = 16000 # Whisper expects 16 kHz

_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))


def register_execution_providers(script_dir=None):
"""Register WinML execution providers. script_dir: directory containing winml.py (default: this script's dir)."""
base = script_dir or _SCRIPT_DIR
worker_script = os.path.join(base, "winml.py")
result = subprocess.check_output([sys.executable, worker_script], text=True)
paths = json.loads(result)
for name, lib_path in paths.items():
if not lib_path or not os.path.exists(lib_path):
continue
ort.register_execution_provider_library(name, lib_path)

def load_audio_no_ffmpeg(path: str) -> np.ndarray:
"""Load WAV as mono 16 kHz float32 (no ffmpeg, no torchcodec). Uses scipy."""
sr, data = wavfile.read(path)
if data.dtype == np.int16:
data = data.astype(np.float32) / 32768.0
elif data.dtype == np.int32:
data = data.astype(np.float32) / 2147483648.0
if data.ndim == 2:
data = data.mean(axis=1)
if sr != SAMPLE_RATE:
n = int(len(data) * SAMPLE_RATE / sr)
data = resample(data, n).astype(np.float32)
return data


def run_whisper(
audio_path: str = None,
audio: np.ndarray = None,
*,
enc_onnx: str = "encoder_model.onnx",
enc_cache_dir: str = "cacheDir",
vitisai_config: str = "vitisai_config.json",
model: str = "small",
download_root: str = None,
ep_policy: str = "NPU",
use_npu: bool = True,
) -> dict:
"""
Run Whisper E2E: load audio, run encoder (ONNX on NPU/CPU) + decoder, return transcription and metrics.

Provide either audio_path (path to WAV) or audio (float32 mono 16 kHz array). All other args are optional.

Returns dict with: text, detected_language, rtf, elapsed_sec, audio_duration_sec.
"""
if audio_path is None and audio is None:
raise ValueError("Provide either audio_path or audio")
if audio_path is not None and audio is not None:
raise ValueError("Provide only one of audio_path or audio")

if download_root is None:
download_root = _SCRIPT_DIR

if audio_path is not None:
audio = load_audio_no_ffmpeg(audio_path)
audio = whisper.pad_or_trim(audio)

model_obj = whisper.load_model(model, download_root=download_root)
mel = whisper.log_mel_spectrogram(audio, n_mels=model_obj.dims.n_mels).to(model_obj.device)

_, probs = model_obj.detect_language(mel)
detected_language = max(probs, key=probs.get)

if use_npu and ep_policy == "NPU":
register_execution_providers()
session_options = ort.SessionOptions()
policy_map = {
"NPU": ort.OrtExecutionProviderDevicePolicy.PREFER_NPU,
"CPU": ort.OrtExecutionProviderDevicePolicy.PREFER_CPU,
"GPU": ort.OrtExecutionProviderDevicePolicy.PREFER_GPU,
"DEFAULT": ort.OrtExecutionProviderDevicePolicy.DEFAULT,
}
policy = policy_map.get(ep_policy)
if policy:
session_options.set_provider_selection_policy(policy)

cache_key = "encoder_model"
enc_config_path = "vitisai_config.json"
options = whisper.DecodingOptions(
enc_use_onnx=bool(enc_onnx),
enc_onnx_fname=enc_onnx or "",
use_winml=use_npu,
enc_use_vitis=use_npu,
enc_cache_dir=enc_cache_dir if use_npu else "",
enc_cache_key=cache_key,
enc_config_json=enc_config_path if use_npu else "",
)

audio_duration_sec = audio.shape[0] / SAMPLE_RATE
t0 = time.perf_counter()
result = whisper.decode(model_obj, mel, options)
elapsed_sec = time.perf_counter() - t0
rtf = elapsed_sec / audio_duration_sec

return {
"text": result.text,
"detected_language": detected_language,
"rtf": rtf,
"elapsed_sec": elapsed_sec,
"audio_duration_sec": audio_duration_sec,
}

def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--enc_onnx",
type=str,
default="encoder_model.onnx",
help="Path to encoder ONNX model file",
)
parser.add_argument("--audio", type=str, required=True, help="Path to input audio WAV file")
parser.add_argument(
"--ep_policy",
type=str,
default="NPU",
choices=["NPU", "CPU", "GPU", "DEFAULT"],
help="Execution provider policy (default: NPU)",
)
parser.add_argument(
"--download_root",
type=str,
default=None,
help="Directory to download/cache Whisper PyTorch model (default: script directory)",
)
parser.add_argument(
"--model",
type=str,
default="small",
choices=["small", "medium", "turbo"],
help="Whisper model name for load_model (default: small)",
)
args = parser.parse_args()

out = run_whisper(
audio_path=args.audio,
enc_onnx=args.enc_onnx,
model=args.model,
download_root=args.download_root,
ep_policy=args.ep_policy,
use_npu=True,
)
print("\n")
print("Transcription results:")
print(f"Detected language: {out['detected_language']}")
print(f"RTF: {out['rtf']:.4f} (decode: {out['elapsed_sec']:.3f}s, audio: {out['audio_duration_sec']:.3f}s)")
print(out["text"])


if __name__ == "__main__":
main()
25 changes: 25 additions & 0 deletions openai-whisper-large-v3-turbo/VitisAI/vitisai_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"passes": [
{
"name": "init",
"plugin": "vaip-pass_init"
},
{
"name": "vaiml_partition",
"plugin": "vaip-pass_vaiml_partition",
"vaiml_config": {
"fe_experiment": "use-accurate-mode=LayerNorm2PassAdf"
}
}
],
"target": "VAIML",
"targets": [
{
"name": "VAIML",
"pass": [
"init",
"vaiml_partition"
]
}
]
}
Loading