Skip to content

os_dep/linux: usb_ops_linux: decrement per-AC counter on USB submit failure#200

Closed
breakneck-git wants to merge 1 commit into
morrownr:mainfrom
breakneck-git:fix-tx-counter-leak
Closed

os_dep/linux: usb_ops_linux: decrement per-AC counter on USB submit failure#200
breakneck-git wants to merge 1 commit into
morrownr:mainfrom
breakneck-git:fix-tx-counter-leak

Conversation

@breakneck-git

Copy link
Copy Markdown
Contributor

Summary

Fixes a permanent TX stall caused by per-AC queue counter leak on USB submit failure. Counter inflates by one on every submit error; once it crosses the threshold that triggers netif_stop_subqueue(), TX is stuck until module reload (the wake path lives in the completion callback that the failed URB will never reach).

Root cause

usb_write_port() increments the per-access-category counter (voq_cnt / viq_cnt / beq_cnt / bkq_cnt) under pxmitpriv->lock before calling usb_submit_urb(). The matching decrement lives only in the completion callback usb_write_port_complete(), which is invoked when the URB completes — including kill / disconnect / surprise-removed paths via usb_kill_urb().

_enter_critical(&pxmitpriv->lock, &irqL);
switch (addr) {
case VO_QUEUE_INX: pxmitpriv->voq_cnt++; pxmitbuf->flags = VO_QUEUE_INX; break;
...
}
_exit_critical(&pxmitpriv->lock, &irqL);

...

status = usb_submit_urb(purb, GFP_ATOMIC);
if (!status) {
    /* URB queued; usb_write_port_complete() will decrement on completion */
} else {
    rtw_sctx_done_err(...);
    RTW_INFO("usb_write_port, status=%d\n", status);
    /* ↑ counter NOT decremented here, but callback will NEVER run */
    switch (status) {
    case -ENODEV: rtw_set_drv_stopped(padapter); break;
    default: break;
    }
    goto exit;
}

When usb_submit_urb() returns non-zero (-ENOMEM, -ENODEV, -EPIPE, -ESHUTDOWN, transient USB stack errors), the URB was not queued. The completion callback will not run. The counter is permanently inflated by one.

Each subsequent submit failure compounds the leak. Once any of {voq,viq,beq,bkq}_cnt exceeds the threshold checked in rtw_os_need_stop_queue(), the TX path calls netif_stop_subqueue(). The corresponding wake also lives in usb_write_port_complete() — which won't run because there's no in-flight URB to complete. Result: permanent TX stall, recoverable only by rmmod 88XXau / modprobe 88XXau.

This is reachable on any USB link with intermittent errors (marginal cable, heat, EMI, autosuspend races, hub reconnect storms). The driver continues to look "associated" — beacons still arrive, MLME state is fine — only TX dies, which makes the bug particularly user-confusing.

Fix

Decrement the same counter that was incremented at the top of the function, under the same lock, on the submit-failure branch. Counter is selected by pxmitbuf->flags which was set under the lock at increment time, so the choice is unambiguous.

This mirrors exactly what usb_write_port_complete() would do if the URB had been queued, restoring the counter invariant.

Test environment

  • Realtek 8821AU (USB ID 0bda:0811) on Raspberry Pi 2B, kernel 6.12.47
  • Static analysis confirmed:
    • voq_cnt / viq_cnt / beq_cnt / bkq_cnt are touched in exactly 3 places (init=0, inc in this fn, dec in completion) — verified by exhaustive grep
    • usb_kill_urb() triggers the completion callback with error status, so the kill/cancel paths do hit the existing decrement (no double-decrement risk added by this patch)
  • I have not runtime-induced submit failures (would need EMI test or USB error injection); the failure mode is observable in dmesg as usb_write_port, status=-N lines accumulating, but the counter inflation itself is silent until queue stops

Disclosure

This patch is the result of a code-review pass conducted with the help of an LLM (Claude). The reasoning was verified by reading the increment site, the completion callback decrement site, the usb_kill_urb flow, and the Linux USB API contract for usb_submit_urb errors. I am not a kernel engineer.

…ailure

usb_write_port() unconditionally increments the per-access-category
counter (voq_cnt / viq_cnt / beq_cnt / bkq_cnt) under pxmitpriv->lock
before calling usb_submit_urb(). The matching decrement happens *only*
in the completion callback usb_write_port_complete(), which is invoked
when the URB completes (success, peer cancel, kill, etc.).

When usb_submit_urb() itself returns a non-zero error (e.g. -ENOMEM,
-ENODEV, -EPIPE, -ESHUTDOWN, transient USB stack errors), the URB is
*not* queued and the completion callback will *not* run. The function
then takes the error branch and `goto exit` straight to free_xmitbuf,
leaving the counter permanently inflated by one.

Each subsequent submit failure compounds the leak. Once
voq_cnt/viq_cnt/beq_cnt/bkq_cnt exceeds the per-AC threshold checked
in rtw_os_need_stop_queue() (NR_XMITBUFF/4), the TX path calls
netif_stop_subqueue() with no path to wake (since wake also lives in
the completion callback that never runs). Result: TX permanently
stalls and is recoverable only by module reload (rmmod 88XXau /
modprobe 88XXau).

In practice this is reachable on any USB link with intermittent errors
(marginal cable, heat, EMI, autosuspend race, hub disconnect/reconnect
storms). The driver continues to look "associated" because beacons
keep arriving and the MLME layer is unaffected -- only TX dies.

Fix: take the same lock the increment uses and decrement the same
counter (selected by pxmitbuf->flags, which was set under the lock at
increment time) before falling through to the existing -ENODEV
detection / goto exit. Mirrors the cleanup that
usb_write_port_complete() would do if the URB had been queued.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant