Skip to content

[Bug]: VCN encode → compute fence deadlock hangs RX 7900 XT (RDNA3), no TDR #599

Description

@Sedation6612

Describe the bug

On an RX 7900 XT (RDNA3 / Navi 31), the VCN/AMF hardware video-encode engine deadlocks against the compute queue: the encoder backs up first, then a LiquidVR compute wait times out and never recovers, hanging the PCVR runtime. The codec/encode engine stays busy-but-not saturated (~60%) while the compute queue sits ~93% idle failing a multi-second wait — i.e. a never-signaled fence (stall), not overload. Retries fire on an exact 2.000 s cadence (deadlock retry loop). It is a soft-hang: no TDR / kernel Event 4101 fires, so Radeon GPU Detective cannot instrument it.

The most important point: the identical codec-busy / compute-starved deadlock also reproduces on a path that uses no AMF at all — a Pimax headset over wired DisplayPort DSC (the Pimax runtime logs amd direct mode fence wait failed:2). Two different runtimes, two different driver code paths (AMF encode vs "amd direct mode" DSC), and two different driver builds all hit the same fence-wait timeout. That points the root at the core RDNA3 codec/encode → compute fence/semaphore signaling path, not the AMF API surface or any one runtime.

To Reproduce
Steps to reproduce the behavior:

  1. RX 7900 XT (RDNA3), latest Adrenalin, Windows 11.
  2. Start a sustained PCVR session that drives AMF HW encode + LiquidVR compute interop continuously — e.g. Meta Quest 3 over wired Quest Link (Oculus runtime OVRServer_x64), under a steady/heavy encode load.
  3. Run for an extended period (this reproduces near-nightly; it does not yet reproduce outside a full VR runtime — no minimal AMF-sample repro isolated yet).
  4. Observe the sequence: encoder begins backing up (dropped due to encoder backup) → ~115 s / ~10,000 frames later the LiquidVR compute wait starts timing out (CPUWaitOnCompute ... -1003), retrying on an exact 2.000 s cadence → the session wedges and does not self-recover. No TDR fires.
  5. Cross-check (no AMF in the pipeline): the same deadlock reproduces on a Pimax Dream Air SE over wired DisplayPort DSC DirectMode (pi_server), which performs no network video encode — see the Debug Log and Additional context. This is what isolates the fault to the core driver fence-sync rather than AMF.

Setup (please complete the following information):

  • OS: Windows 11 Pro (build 26200)
  • Driver Version: 32.0.31019.2002 (Adrenalin 26.6.1) — also reproduced on the prior 32.0.31007.5012 (Adrenalin 26.5.2), so it persists across ≥2 driver releases
  • GPU: Radeon RX 7900 XT (RDNA3 / Navi 31)
  • Which component has the issue: Encoder — VCN/AMF hardware encode completion fence vs the compute queue (LiquidVR interop) it should signal

Debug Log (please upload or paste):

=== Stack A — Meta Quest 3 / wired Quest Link (Oculus OVRServer_x64; AMF HW encode + LiquidVR) ===
# Oculus Service_*.txt:
... dropped due to encoder backup                                    # first at ~01:12:46
LiquidVRComputeContext::CPUWaitOnCompute: LiquidVR wait timed out  (-1003)   # first at ~01:14:41 (~115 s after encoder backup)
Flush ... -1006
# 58 CPUWaitOnCompute timeouts at an EXACT 2.000 s rearm cadence (deadlock retry loop, not slowness)
# AMD's own fault handler "amdfendrsr" fired ~69x over the Quest-era period (2026-06-05..06-17)

# Windows "GPU Engine" perf counters, sampled LIVE mid-hang:
#   video codec engine ~60-66%   |   3D ~39%   |   compute ~7%   <-- a 93%-idle compute queue failing a >700 s wait = never-signaled fence

=== Stack B — Pimax Dream Air SE / wired DisplayPort DSC, DirectMode (pi_server; NO AMF, no network encode) ===
# pvr_srv_log_*.txt  (level tag [err][PSRV]):
[err][PSRV] amd direct mode fence wait failed:2
[err][PSRV] QueueSemaphoreWait failed with err:2
[err][PSRV] QueueSemaphoreSignal failed with err:2
[err][PSRV] QueueTimestamp failed for begin with err:2
[err][PSRV] failed to Flush with err=2
# repeating ~5-line block @ ~5 ms/loop; sharp onset 708 lines in first 9 s; up to 29,539 lines in ~2.5 min
# 6 confirmed occurrences over 2026-06-21..06-23 (essentially nightly)

# Windows "GPU Engine" perf counters, sampled LIVE mid-hang:
#   video codec engine ~55%   |   3D ~55%   |   copy ~14%   |   compute <1% (starved, does not register)

NOTE: there is NO canonical AMF debug-log error line and NO TDR (no kernel 4101/4099, no dxgkrnl, no WHEA)
for this hang. It is a sub-threshold busy-wait the engine reports as "executing", so the OS TDR watchdog
never resets it and Radeon GPU Detective produces only a structurally-complete-but-EMPTY .rgd
(parsing with rgd.exe 1.6.3.7 -> "execution marker information missing [UMD]"). The evidence is therefore
the per-engine telemetry above (codec-busy / compute-starved) plus the cross-runtime + cross-build invariance.
I can capture a GPUView/ETW trace of the live hang on request (does not require a TDR).

Expected behavior

Under sustained encode/DSC backpressure, the VCN/codec-engine completion fence should signal back to the compute queue so the CPUWaitOnCompute / direct-mode fence wait completes and rendering continues. If a completion is genuinely lost, a per-engine watchdog should recover the stuck op rather than leaving the compute queue in an unbounded busy-wait that never trips TDR and forces a full runtime restart (which tears down the live VR session).

Screenshots

Per-engine GPU utilization captures (gpu_engine.json / Task Manager "GPU Engine" view) showing the codec engine ~55–66% busy while compute sits <1–7% idle for the entire stall are available on request — this is the smoking gun for "deadlock, not overload." Raw runtime logs for any listed occurrence can also be attached.

Additional context

  • Why this is the AMD driver (the decisive argument):
    • Two different vendors' runtimes (Oculus OVRServer_x64 vs Pimax pi_server) fail with the same codec-busy / compute-starved profile.
    • Two different driver code paths — AMF HW-encode + LiquidVR vs "amd direct mode" DSC-over-DP — hit the same fence-wait timeout.
    • Stack B has no AMF / no network encoder and does not route through Meta's or Valve's compositor, yet still deadlocks → root is core GPU-queue/fence synchronization, broader than AMF.
    • Survives clean GPU-driver reinstalls and reproduces on two driver versions (26.5.2 and 26.6.1) — so it is not a corrupt install or a single bad version. The AMD GPU driver is the one component present in every failure.
  • Ruled out (disclosed): not overload (compute 93–99% idle), not VRAM (11.7/20 GB), not RAM (10–24 GB free), not CPU (40–77%), not disk paging (queue ~0 at onset; paging rises only after the wedge), not Meta/Valve (different runtimes fail identically), not the Pimax runtime (it merely times out the dead AMD-owned fence with err:2; a pi_server build downgrade 55→54 didn't help). Possible VCN co-tenant (a screen recorder occasionally shares the codec engine) is disclosed but the storm predates it and occurs with it absent.
  • Extensive mitigations that did NOT help — each rules out a confound, further isolating the AMD driver. Nothing tried has fixed or even changed the deadlock:
    • Clean GPU-driver reinstall (full DDU uninstall + reinstall) and multiple driver versions tried — not a corrupt install; reproduces on 26.5.2 and 26.6.1.
    • Bought two different-brand VR headsets specifically to exercise different display/driver paths — Meta Quest 3 (AMF stream-encode) and Pimax Dream Air SE (wired DisplayPort DSC). The deadlock followed across both brands (this is the cross-runtime evidence above) — so it is not a single headset or its vendor runtime.
    • GPU under-clock and over-clock both tried — no effect (rules out core/memory clock instability, thermal, and voltage/boost behavior).
    • Chipset drivers reinstalled and VR runtime/drivers reinstalled — no effect.
    • Cable swaps (DisplayPort / USB / Link cables) — no effect (rules out a bad physical link).
    • Pimax runtime downgrade (build 55 → 54); TdrDelay 10 s → 2 s + reboot (confirmed TdrDelay=0x2 live; deadlock 4 min after boot still produced no TDR); SteamVR stable → beta.
    • Net: among everything that was swapped (headset brand, runtime, cables, GPU clocks, chipset driver, GPU-driver reinstalls and versions), the AMD GPU driver is the one component present in every single failure — and it stayed broken through reinstalls and across versions, which points at a defect in the driver code rather than a bad install or one bad version.
  • A second, related variant on the Pimax path (which I'm filing as its own separate issue): GPU pinned 100% with delivered fps free-falling 72 → 0.25, err:19, ending in a HID-service disconnect and rpc exception:1722 storm — possibly the same fence deadlock surfacing differently.
  • Related (not duplicate) precedents confirming an AMD VCN codec path can wedge/crash the driver — each checked against this signature and distinct from it:
  • I searched this tracker (2026-06-23) for fence / deadlock / VCN / compute / CPUWaitOnCompute / LiquidVR / "direct mode" / VR / AirLink / driver-timeout / semaphore. No existing issue matches this signature — encode→compute fence deadlock with the compute queue starved, no TDR, also reproducing on a non-AMF DisplayPort-DSC path. The adjacent issues above are crashes, perf caps, or config-specific encode hangs, not this — so I'm filing fresh. A parallel report via the in-driver AMD Bug Report Tool is being submitted; happy to cross-link the case and to run any capture (GPUView/ETW, extended per-engine counters, Medal-off A/B, driver-rollback A/B) you want.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions