[Bug]: VCN encode → compute fence deadlock hangs RX 7900 XT (RDNA3), no TDR

**Describe the bug**

On an RX 7900 XT (RDNA3 / Navi 31), the VCN/AMF hardware video-encode engine deadlocks against the compute queue: the encoder backs up first, then a LiquidVR compute wait times out and never recovers, hanging the PCVR runtime. The codec/encode engine stays busy-but-**not saturated** (~60%) while the compute queue sits ~93% **idle** failing a multi-second wait — i.e. a **never-signaled fence (stall), not overload**. Retries fire on an **exact 2.000 s cadence** (deadlock retry loop). It is a **soft-hang**: no TDR / kernel Event 4101 fires, so Radeon GPU Detective cannot instrument it.

The most important point: the **identical** codec-busy / compute-starved deadlock also reproduces on a path that uses **no AMF at all** — a Pimax headset over wired DisplayPort DSC (the Pimax runtime logs `amd direct mode fence wait failed:2`). Two different runtimes, two different driver code paths (AMF encode vs "amd direct mode" DSC), and **two different driver builds** all hit the same fence-wait timeout. That points the root at the **core RDNA3 codec/encode → compute fence/semaphore signaling path**, not the AMF API surface or any one runtime.

**To Reproduce**
Steps to reproduce the behavior:
1. RX 7900 XT (RDNA3), latest Adrenalin, Windows 11.
2. Start a sustained PCVR session that drives AMF HW encode + LiquidVR compute interop continuously — e.g. Meta Quest 3 over **wired** Quest Link (Oculus runtime `OVRServer_x64`), under a steady/heavy encode load.
3. Run for an extended period (this reproduces near-nightly; it does not yet reproduce outside a full VR runtime — no minimal AMF-sample repro isolated yet).
4. Observe the sequence: encoder begins backing up (`dropped due to encoder backup`) → **~115 s / ~10,000 frames later** the LiquidVR compute wait starts timing out (`CPUWaitOnCompute ... -1003`), retrying on an **exact 2.000 s** cadence → the session wedges and does not self-recover. **No TDR fires.**
5. Cross-check (no AMF in the pipeline): the same deadlock reproduces on a Pimax Dream Air SE over wired DisplayPort DSC DirectMode (`pi_server`), which performs **no network video encode** — see the Debug Log and Additional context. This is what isolates the fault to the core driver fence-sync rather than AMF.

**Setup (please complete the following information):**
 - OS: Windows 11 Pro (build 26200)
 - Driver Version: **32.0.31019.2002 (Adrenalin 26.6.1)** — also reproduced on the prior **32.0.31007.5012 (Adrenalin 26.5.2)**, so it persists across ≥2 driver releases
 - GPU: Radeon **RX 7900 XT** (RDNA3 / Navi 31)
 - Which component has the issue: **Encoder** — VCN/AMF hardware encode completion fence vs the **compute queue** (LiquidVR interop) it should signal

**Debug Log (please upload or paste):**
```
=== Stack A — Meta Quest 3 / wired Quest Link (Oculus OVRServer_x64; AMF HW encode + LiquidVR) ===
# Oculus Service_*.txt:
... dropped due to encoder backup                                    # first at ~01:12:46
LiquidVRComputeContext::CPUWaitOnCompute: LiquidVR wait timed out  (-1003)   # first at ~01:14:41 (~115 s after encoder backup)
Flush ... -1006
# 58 CPUWaitOnCompute timeouts at an EXACT 2.000 s rearm cadence (deadlock retry loop, not slowness)
# AMD's own fault handler "amdfendrsr" fired ~69x over the Quest-era period (2026-06-05..06-17)

# Windows "GPU Engine" perf counters, sampled LIVE mid-hang:
#   video codec engine ~60-66%   |   3D ~39%   |   compute ~7%   <-- a 93%-idle compute queue failing a >700 s wait = never-signaled fence

=== Stack B — Pimax Dream Air SE / wired DisplayPort DSC, DirectMode (pi_server; NO AMF, no network encode) ===
# pvr_srv_log_*.txt  (level tag [err][PSRV]):
[err][PSRV] amd direct mode fence wait failed:2
[err][PSRV] QueueSemaphoreWait failed with err:2
[err][PSRV] QueueSemaphoreSignal failed with err:2
[err][PSRV] QueueTimestamp failed for begin with err:2
[err][PSRV] failed to Flush with err=2
# repeating ~5-line block @ ~5 ms/loop; sharp onset 708 lines in first 9 s; up to 29,539 lines in ~2.5 min
# 6 confirmed occurrences over 2026-06-21..06-23 (essentially nightly)

# Windows "GPU Engine" perf counters, sampled LIVE mid-hang:
#   video codec engine ~55%   |   3D ~55%   |   copy ~14%   |   compute <1% (starved, does not register)

NOTE: there is NO canonical AMF debug-log error line and NO TDR (no kernel 4101/4099, no dxgkrnl, no WHEA)
for this hang. It is a sub-threshold busy-wait the engine reports as "executing", so the OS TDR watchdog
never resets it and Radeon GPU Detective produces only a structurally-complete-but-EMPTY .rgd
(parsing with rgd.exe 1.6.3.7 -> "execution marker information missing [UMD]"). The evidence is therefore
the per-engine telemetry above (codec-busy / compute-starved) plus the cross-runtime + cross-build invariance.
I can capture a GPUView/ETW trace of the live hang on request (does not require a TDR).
```

**Expected behavior**

Under sustained encode/DSC backpressure, the VCN/codec-engine completion fence should signal back to the compute queue so the `CPUWaitOnCompute` / direct-mode fence wait completes and rendering continues. If a completion is genuinely lost, a per-engine watchdog should recover the stuck op rather than leaving the compute queue in an unbounded busy-wait that never trips TDR and forces a full runtime restart (which tears down the live VR session).

**Screenshots**

Per-engine GPU utilization captures (`gpu_engine.json` / Task Manager "GPU Engine" view) showing the codec engine ~55–66% busy while compute sits <1–7% idle for the entire stall are available on request — this is the smoking gun for "deadlock, not overload." Raw runtime logs for any listed occurrence can also be attached.

**Additional context**

- **Why this is the AMD driver (the decisive argument):**
  - Two different vendors' runtimes (Oculus `OVRServer_x64` vs Pimax `pi_server`) fail with the same codec-busy / compute-starved profile.
  - Two different driver code paths — AMF HW-encode + LiquidVR vs "amd direct mode" DSC-over-DP — hit the same fence-wait timeout.
  - Stack B has **no AMF / no network encoder** and does **not** route through Meta's or Valve's compositor, yet still deadlocks → root is **core GPU-queue/fence synchronization**, broader than AMF.
  - Survives **clean GPU-driver reinstalls** and reproduces on **two driver versions** (26.5.2 and 26.6.1) — so it is not a corrupt install or a single bad version. The AMD GPU driver is the **one component present in every failure**.
- **Ruled out** (disclosed): not overload (compute 93–99% idle), not VRAM (11.7/20 GB), not RAM (10–24 GB free), not CPU (40–77%), not disk paging (queue ~0 at onset; paging rises only after the wedge), not Meta/Valve (different runtimes fail identically), not the Pimax runtime (it merely **times out** the dead AMD-owned fence with err:2; a `pi_server` build downgrade 55→54 didn't help). Possible VCN co-tenant (a screen recorder occasionally shares the codec engine) is disclosed but the storm predates it and occurs with it absent.
- **Extensive mitigations that did NOT help** — each rules out a confound, further isolating the AMD driver. **Nothing tried has fixed or even changed the deadlock:**
  - **Clean GPU-driver reinstall** (full DDU uninstall + reinstall) and **multiple driver versions** tried — not a corrupt install; reproduces on 26.5.2 *and* 26.6.1.
  - **Bought two different-brand VR headsets** specifically to exercise different display/driver paths — Meta Quest 3 (AMF stream-encode) and Pimax Dream Air SE (wired DisplayPort DSC). The deadlock **followed across both brands** (this is the cross-runtime evidence above) — so it is not a single headset or its vendor runtime.
  - **GPU under-clock and over-clock** both tried — no effect (rules out core/memory clock instability, thermal, and voltage/boost behavior).
  - **Chipset drivers reinstalled** and **VR runtime/drivers reinstalled** — no effect.
  - **Cable swaps** (DisplayPort / USB / Link cables) — no effect (rules out a bad physical link).
  - **Pimax runtime downgrade** (build 55 → 54); **`TdrDelay` 10 s → 2 s + reboot** (confirmed `TdrDelay=0x2` live; deadlock 4 min after boot still produced no TDR); **SteamVR stable → beta**.
  - Net: among everything that was swapped (headset brand, runtime, cables, GPU clocks, chipset driver, GPU-driver reinstalls and versions), the **AMD GPU driver is the one component present in every single failure** — and it stayed broken through reinstalls and across versions, which points at a defect in the driver code rather than a bad install or one bad version.
- **A second, related variant** on the Pimax path (which I'm filing as its own separate issue): GPU pinned 100% with delivered fps free-falling 72 → 0.25, `err:19`, ending in a HID-service disconnect and `rpc exception:1722` storm — possibly the same fence deadlock surfacing differently.
- **Related (not duplicate) precedents** confirming an AMD VCN codec path can wedge/crash the driver — each checked against this signature and distinct from it:
  - **#327** — H265 game recording (OBS/ReLive) *semi-regularly crashes the display driver* with **"encoder usage looks nominal, and GPU memory shows no signs of being overloaded"** — i.e. an encode-path driver fault that is explicitly *not* an overload. Distinct from this report only in that it is a hard driver crash, not a no-TDR soft-hang.
  - **#12** — driver crash inside `SubmitInput`: *"Occasionally a BSOD ... `THREAD_STUCK_IN_DEVICE_DRIVER`"* — the encode submit path wedging the driver.
  - **#482** — HEVC/AV1 *decode* → driver-timeout exception in `amfrtdrv64.dll` (an AMD VCN codec path faulting the driver).
  - **#516** — encoder intermittently stalls with `INPUT_QUEUE_FULL` / `AMF_REPEAT` (a single-threaded API-queue synchronization issue, not a GPU fence).
  - **#364** — Oculus AirLink encoding-speed cap (VR-adjacent, but a throughput cap, not a hang).
  - External: **HandBrake #4444** — AMD VCE HEVC preset → driver timeout (`SubmitInput` error 18) in a third-party app.
- **I searched this tracker (2026-06-23)** for fence / deadlock / VCN / compute / `CPUWaitOnCompute` / LiquidVR / "direct mode" / VR / AirLink / driver-timeout / semaphore. **No existing issue matches this signature** — encode→compute fence deadlock with the compute queue *starved*, no TDR, also reproducing on a non-AMF DisplayPort-DSC path. The adjacent issues above are crashes, perf caps, or config-specific encode hangs, not this — so I'm filing fresh. A parallel report via the in-driver AMD Bug Report Tool is being submitted; happy to cross-link the case and to run any capture (GPUView/ETW, extended per-engine counters, Medal-off A/B, driver-rollback A/B) you want.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: VCN encode → compute fence deadlock hangs RX 7900 XT (RDNA3), no TDR #599

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: VCN encode → compute fence deadlock hangs RX 7900 XT (RDNA3), no TDR #599

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions