microsoft · NiranjanRaviGit · May 19, 2026 · May 19, 2026 · May 26, 2026
diff --git a/README.md b/README.md
diff --git a/openai-whisper-large-v3-turbo/VitisAI/LICENSE b/openai-whisper-large-v3-turbo/VitisAI/LICENSE
@@ -0,0 +1,22 @@
+MIT License
+
+Copyright (c) 2022 OpenAI
+Copyright (C) [2026] Advanced Micro Devices, Inc. All Rights Reserved.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/openai-whisper-large-v3-turbo/VitisAI/README.md b/openai-whisper-large-v3-turbo/VitisAI/README.md
@@ -0,0 +1,78 @@
+# Whisper on WinML — Speech-to-Text on the Edge with ONNX Runtime Vitis EP
+
+---
+
+## Installation
+
+### 1. Create and activate conda environment
+
+```bash
+conda create -n winml_whisper python==3.11
+conda activate winml_whisper
+pip install --pre -r requirements.txt
+```
+
+### 2. Check installed WASDK version
+
+```bash
+conda list | findstr wasdk
+```
+
+> **Expected Output:**
+> ```
+> wasdk-microsoft-windows-ai-machinelearning              1.8.260209005   pypi_0   pypi
+> wasdk-microsoft-windows-applicationmodel-dynamicdependency-bootstrap   1.8.260209005   pypi_0   pypi
+> ```
+
+### 3. Install Windows App SDK
+
+Download and install the Windows App SDK matching your `wasdk` version (e.g., `1.8.260209005`):
+
+```bash
+curl -L -o windowsappruntimeinstall-x86.exe "https://aka.ms/windowsappsdk/1.8/1.8.260209005/windowsappruntimeinstall-x86.exe"
+windowsappruntimeinstall-x86.exe --quiet
+```
+
+### 4. Download ONNX encoder and Vitis AI cache (`.rai`)
+
+Download the encoder ONNX model from the Hugging Face repo:
+
+| Model | Hugging Face repo |
+|-------|-------------------|
+| **Large-v3-turbo** | [amd/whisper-large-turbo-onnx-npu](https://huggingface.co/amd/whisper-large-turbo-onnx-npu/tree/main) — get `encoder_model.onnx`, `encoder_model.onnx.data` (if present) |
+
+Place the encoder ONNX file(s) in your working directory (or a path you will pass to `--enc_onnx`).
+
+---
+
+## Run Inference and Transcribe
+
+Use the encoder ONNX from one of the three Hugging Face repos above (medium, small, or large-v3-turbo) depending on your task — set `--enc_onnx` to that model’s `encoder_model.onnx` (or your local path to it) and `--model` to the matching Whisper model name.
+
+```bash
+python run_whisper.py \
+  --audio ".\audio_sample.wav" \
+  --model turbo \
+  --enc_onnx ".\encoder_model.onnx" \
+  --ep_policy NPU
+```
+---
+## Notes
+In this experiment, we are using encoder portion of the model with ORT and decoder portion of the model with Pytorch+kv cache.
+
+## Command-Line Arguments
+
+---
+
+| Argument | Required | Default | Description |
+|---|---|---|---|
+| `--audio` | Yes | — | Path to input audio WAV file |
+| `--model` | No | `turbo` | Whisper model name: `small`, `medium`, `turbo` (should match encoder ONNX) |
+| `--enc_onnx` | No | `encoder_model.onnx` | Path to encoder ONNX model file |
+| `--ep_policy` | No | `NPU` | Execution provider policy (`NPU`, `CPU`, `GPU`, `DEFAULT`) |
+
+---
+
+## Credits
+
+This project builds on [OpenAI Whisper](https://github.com/openai/whisper). We have borrowed code from that repository and extended it with ONNX Runtime Vitis AI EP integration, WinML execution providers, and the changes documented in this README. Whisper is licensed under the [MIT License](https://github.com/openai/whisper/blob/main/LICENSE).
diff --git a/openai-whisper-large-v3-turbo/VitisAI/audio_sample.wav b/openai-whisper-large-v3-turbo/VitisAI/audio_sample.wav
diff --git a/openai-whisper-large-v3-turbo/VitisAI/info.yml b/openai-whisper-large-v3-turbo/VitisAI/info.yml
@@ -0,0 +1,6 @@
+arch: whisper
+recipes:
+  - name: openai-whisper-large-v3-turbo-vitisai
+    file: run_whisper.py
+    devices: npu
+    eps: VitisAIExecutionProvider
diff --git a/openai-whisper-large-v3-turbo/VitisAI/requirements.txt b/openai-whisper-large-v3-turbo/VitisAI/requirements.txt
@@ -0,0 +1,37 @@
+# MIT License
+#
+# Copyright (c) 2022 OpenAI
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+
+--extra-index-url https://pypi.org/simple
+# Copyright (C) [2026] Advanced Micro Devices, Inc. All Rights Reserved.
+
+
+--index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple
+
+numpy
+onnxruntime-winml
+onnxscript
+scipy==1.17.1
+tiktoken==0.12.0
+torch
+transformers
+wasdk-Microsoft.Windows.AI.MachineLearning[all]==1.8.260209005
+wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap[all]==1.8.260209005
diff --git a/openai-whisper-large-v3-turbo/VitisAI/run_whisper.py b/openai-whisper-large-v3-turbo/VitisAI/run_whisper.py
@@ -0,0 +1,191 @@
+# MIT License
+#
+# Copyright (C) 2026, Advanced Micro Devices, Inc
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+
+# Copyright (C) [2026] Advanced Micro Devices, Inc. All Rights Reserved.
+
+import time
+import os
+import argparse
+import json
+import sys
+import numpy as np
+import whisper
+import subprocess
+import onnxruntime as ort
+from scipy.io import wavfile
+from scipy.signal import resample
+
+SAMPLE_RATE = 16000  # Whisper expects 16 kHz
+
+_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+
+
+def register_execution_providers(script_dir=None):
+    """Register WinML execution providers. script_dir: directory containing winml.py (default: this script's dir)."""
+    base = script_dir or _SCRIPT_DIR
+    worker_script = os.path.join(base, "winml.py")
+    result = subprocess.check_output([sys.executable, worker_script], text=True)
+    paths = json.loads(result)
+    for name, lib_path in paths.items():
+        if not lib_path or not os.path.exists(lib_path):
+            continue
+        ort.register_execution_provider_library(name, lib_path)
+
+def load_audio_no_ffmpeg(path: str) -> np.ndarray:
+    """Load WAV as mono 16 kHz float32 (no ffmpeg, no torchcodec). Uses scipy."""
+    sr, data = wavfile.read(path)
+    if data.dtype == np.int16:
+        data = data.astype(np.float32) / 32768.0
+    elif data.dtype == np.int32:
+        data = data.astype(np.float32) / 2147483648.0
+    if data.ndim == 2:
+        data = data.mean(axis=1)
+    if sr != SAMPLE_RATE:
+        n = int(len(data) * SAMPLE_RATE / sr)
+        data = resample(data, n).astype(np.float32)
+    return data
+
+
+def run_whisper(
+    audio_path: str = None,
+    audio: np.ndarray = None,
+    *,
+    enc_onnx: str = "encoder_model.onnx",
+    enc_cache_dir: str = "cacheDir",
+    vitisai_config: str = "vitisai_config.json",
+    model: str = "small",
+    download_root: str = None,
+    ep_policy: str = "NPU",
+    use_npu: bool = True,
+) -> dict:
+    """
+    Run Whisper E2E: load audio, run encoder (ONNX on NPU/CPU) + decoder, return transcription and metrics.
+
+    Provide either audio_path (path to WAV) or audio (float32 mono 16 kHz array). All other args are optional.
+
+    Returns dict with: text, detected_language, rtf, elapsed_sec, audio_duration_sec.
+    """
+    if audio_path is None and audio is None:
+        raise ValueError("Provide either audio_path or audio")
+    if audio_path is not None and audio is not None:
+        raise ValueError("Provide only one of audio_path or audio")
+
+    if download_root is None:
+        download_root = _SCRIPT_DIR
+
+    if audio_path is not None:
+        audio = load_audio_no_ffmpeg(audio_path)
+    audio = whisper.pad_or_trim(audio)
+
+    model_obj = whisper.load_model(model, download_root=download_root)
+    mel = whisper.log_mel_spectrogram(audio, n_mels=model_obj.dims.n_mels).to(model_obj.device)
+
+    _, probs = model_obj.detect_language(mel)
+    detected_language = max(probs, key=probs.get)
+
+    if use_npu and ep_policy == "NPU":
+        register_execution_providers()
+        session_options = ort.SessionOptions()
+        policy_map = {
+            "NPU": ort.OrtExecutionProviderDevicePolicy.PREFER_NPU,
+            "CPU": ort.OrtExecutionProviderDevicePolicy.PREFER_CPU,
+            "GPU": ort.OrtExecutionProviderDevicePolicy.PREFER_GPU,
+            "DEFAULT": ort.OrtExecutionProviderDevicePolicy.DEFAULT,
+        }
+        policy = policy_map.get(ep_policy)
+        if policy:
+            session_options.set_provider_selection_policy(policy)
+
+    cache_key = "encoder_model"
+    enc_config_path = "vitisai_config.json"
+    options = whisper.DecodingOptions(
+        enc_use_onnx=bool(enc_onnx),
+        enc_onnx_fname=enc_onnx or "",
+        use_winml=use_npu,
+        enc_use_vitis=use_npu,
+        enc_cache_dir=enc_cache_dir if use_npu else "",
+        enc_cache_key=cache_key,
+        enc_config_json=enc_config_path if use_npu else "",
+    )
+
+    audio_duration_sec = audio.shape[0] / SAMPLE_RATE
+    t0 = time.perf_counter()
+    result = whisper.decode(model_obj, mel, options)
+    elapsed_sec = time.perf_counter() - t0
+    rtf = elapsed_sec / audio_duration_sec
+
+    return {
+        "text": result.text,
+        "detected_language": detected_language,
+        "rtf": rtf,
+        "elapsed_sec": elapsed_sec,
+        "audio_duration_sec": audio_duration_sec,
+    }
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--enc_onnx",
+        type=str,
+        default="encoder_model.onnx",
+        help="Path to encoder ONNX model file",
+    )
+    parser.add_argument("--audio", type=str, required=True, help="Path to input audio WAV file")
+    parser.add_argument(
+        "--ep_policy",
+        type=str,
+        default="NPU",
+        choices=["NPU", "CPU", "GPU", "DEFAULT"],
+        help="Execution provider policy (default: NPU)",
+    )
+    parser.add_argument(
+        "--download_root",
+        type=str,
+        default=None,
+        help="Directory to download/cache Whisper PyTorch model (default: script directory)",
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        default="small",
+        choices=["small", "medium", "turbo"],
+        help="Whisper model name for load_model (default: small)",
+    )
+    args = parser.parse_args()
+
+    out = run_whisper(
+        audio_path=args.audio,
+        enc_onnx=args.enc_onnx,
+        model=args.model,
+        download_root=args.download_root,
+        ep_policy=args.ep_policy,
+        use_npu=True,
+    )
+    print("\n")
+    print("Transcription results:")
+    print(f"Detected language: {out['detected_language']}")
+    print(f"RTF: {out['rtf']:.4f}  (decode: {out['elapsed_sec']:.3f}s, audio: {out['audio_duration_sec']:.3f}s)")
+    print(out["text"])
+
+
+if __name__ == "__main__":
+    main()
diff --git a/openai-whisper-large-v3-turbo/VitisAI/vitisai_config.json b/openai-whisper-large-v3-turbo/VitisAI/vitisai_config.json
@@ -0,0 +1,25 @@
+{
+  "passes": [
+    {
+      "name": "init",
+      "plugin": "vaip-pass_init"
+    },
+    {
+      "name": "vaiml_partition",
+      "plugin": "vaip-pass_vaiml_partition",
+      "vaiml_config": {
+        "fe_experiment": "use-accurate-mode=LayerNorm2PassAdf"
+      }
+    }
+  ],
+  "target": "VAIML",
+  "targets": [
+    {
+      "name": "VAIML",
+      "pass": [
+        "init",
+        "vaiml_partition"
+      ]
+    }
+  ]
+}