Upload Media: Retry transient wasm-vips worker decode failures#79378
Upload Media: Retry transient wasm-vips worker decode failures#79378adamsilverstein wants to merge 1 commit into
Conversation
In the cross-origin-isolated, multi-threaded wasm-vips worker, image decode calls (resize, transcode, rotate) can intermittently fail when the worker receives a short or garbled source buffer under heavy main-thread contention. libheif aborts with a "bad seek" / "Bitstream not supported" error that surfaces as a generic IMAGE_TRANSCODING_ERROR and cancels the upload, even though the same bytes decode correctly on a later attempt (and in Node and in manual Chrome). Wrap the four File-based worker calls in a bounded retry helper that re-reads the source buffer and re-issues the call up to VIPS_MAX_ATTEMPTS times with a short backoff, short-circuiting on an aborted signal. Mechanism-agnostic: whatever yields the transient short buffer, a fresh re-read on a later task recovers; genuinely undecodable images still fail after the final attempt. Add unit tests covering retry-then-succeed, give-up-after-max, and stop-on-abort. Issue: WordPress#79377
67ce4dc to
e6fbda8
Compare
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
Heads-up after root-causing the CSM Chrome 148+ CI failures this PR was meant to address: Those failures were not a wasm-vips decode race. They were test-expectation mismatches in The actual fixes (correct sub-size vs main-file format expectations, CI-robust So this retry change is not required to fix that CI. It may still have narrow standalone value as resilience against genuinely transient libheif decode hiccups under load, but I couldn't confirm any current test actually needs it (the one deterministic local AVIF decode failure I saw was not helped by the retry, and AVIF passes in CI). Recommend closing this unless we want to keep it purely as decode-resilience hardening, in which case the description should be re-scoped away from the Chrome 148 framing. |
|
Closing as not needed. |
What?
Make wasm-vips worker image-decode operations (resize, transcode, rotate) resilient to a transient failure by retrying a small, bounded number of times before giving up.
Fixes #79377.
Why?
When client-side media processing is active (cross-origin isolated, multi-threaded wasm-vips), an upload can intermittently fail with a generic
IMAGE_TRANSCODING_ERROR/IMAGE_ROTATION_ERROR. The real error from the worker is a libheif decode abort:Every failing seek is just past the source file's real EOF - i.e. the worker transiently receives a short / wrong-length buffer. The same wasm-vips decodes the same fixtures correctly in Node and in manual Chrome, and the failure is timing-sensitive (it flips with unrelated main-thread work), so it is a race in the worker decode path - not a decoder, fixture, or Document-Isolation-Policy bug. It became visible when the e2e browser moved to Chrome >= 148, the first CI browser to support DIP, which is what activates CSM in CI.
How?
packages/upload-media/src/store/utils/index.ts: wrap the four File-based worker calls inwithVipsRetry(), which re-readsfile.arrayBuffer()and re-issues the call up toVIPS_MAX_ATTEMPTS(3) times with a short backoff, short-circuiting on an aborted signal. It is mechanism-agnostic: whatever produces the transient short buffer, a fresh re-read on a later task recovers. Genuinely undecodable images still fail after the final attempt.Testing Instructions
npm run test:unit packages/upload-media/src/store/test/vips.ts- new cases cover retry-then-succeed, give-up-after-max, and stop-on-abort.Screenshots or screencast
N/A - no UI change.