Skip to content

Upload Media: Retry transient wasm-vips worker decode failures#79378

Closed
adamsilverstein wants to merge 1 commit into
WordPress:trunkfrom
adamsilverstein:fix/csm-vips-decode-retry
Closed

Upload Media: Retry transient wasm-vips worker decode failures#79378
adamsilverstein wants to merge 1 commit into
WordPress:trunkfrom
adamsilverstein:fix/csm-vips-decode-retry

Conversation

@adamsilverstein

@adamsilverstein adamsilverstein commented Jun 21, 2026

Copy link
Copy Markdown
Member

What?

Make wasm-vips worker image-decode operations (resize, transcode, rotate) resilient to a transient failure by retrying a small, bounded number of times before giving up.

Fixes #79377.

Why?

When client-side media processing is active (cross-origin isolated, multi-threaded wasm-vips), an upload can intermittently fail with a generic IMAGE_TRANSCODING_ERROR / IMAGE_ROTATION_ERROR. The real error from the worker is a libheif decode abort:

unable to call thumbnail_buffer
source: bad seek to 1109
heif: Invalid input: Bitstream not supported by this decoder (2.0)

Every failing seek is just past the source file's real EOF - i.e. the worker transiently receives a short / wrong-length buffer. The same wasm-vips decodes the same fixtures correctly in Node and in manual Chrome, and the failure is timing-sensitive (it flips with unrelated main-thread work), so it is a race in the worker decode path - not a decoder, fixture, or Document-Isolation-Policy bug. It became visible when the e2e browser moved to Chrome >= 148, the first CI browser to support DIP, which is what activates CSM in CI.

How?

packages/upload-media/src/store/utils/index.ts: wrap the four File-based worker calls in withVipsRetry(), which re-reads file.arrayBuffer() and re-issues the call up to VIPS_MAX_ATTEMPTS (3) times with a short backoff, short-circuiting on an aborted signal. It is mechanism-agnostic: whatever produces the transient short buffer, a fresh re-read on a later task recovers. Genuinely undecodable images still fail after the final attempt.

Testing Instructions

  • npm run test:unit packages/upload-media/src/store/test/vips.ts - new cases cover retry-then-succeed, give-up-after-max, and stop-on-abort.
  • Full validation of the race fix requires running the client-side-media e2e suite on Chrome >= 148 with CSM active (e.g. on the Playwright-upgrade branch), where the suite is currently skip-gated. The race is fragile and contention-dependent and does not reproduce reliably on an idle dev machine, so the unit tests plus a CI run on Chrome >= 148 are the validation path.

Screenshots or screencast

N/A - no UI change.

In the cross-origin-isolated, multi-threaded wasm-vips worker, image decode
calls (resize, transcode, rotate) can intermittently fail when the worker
receives a short or garbled source buffer under heavy main-thread
contention. libheif aborts with a "bad seek" / "Bitstream not supported"
error that surfaces as a generic IMAGE_TRANSCODING_ERROR and cancels the
upload, even though the same bytes decode correctly on a later attempt (and
in Node and in manual Chrome).

Wrap the four File-based worker calls in a bounded retry helper that
re-reads the source buffer and re-issues the call up to VIPS_MAX_ATTEMPTS
times with a short backoff, short-circuiting on an aborted signal.
Mechanism-agnostic: whatever yields the transient short buffer, a fresh
re-read on a later task recovers; genuinely undecodable images still fail
after the final attempt. Add unit tests covering retry-then-succeed,
give-up-after-max, and stop-on-abort.

Issue: WordPress#79377
@adamsilverstein adamsilverstein force-pushed the fix/csm-vips-decode-retry branch from 67ce4dc to e6fbda8 Compare June 21, 2026 04:53
@adamsilverstein adamsilverstein marked this pull request as ready for review June 21, 2026 15:41
@github-actions

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: adamsilverstein <adamsilverstein@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@adamsilverstein

Copy link
Copy Markdown
Member Author

Heads-up after root-causing the CSM Chrome 148+ CI failures this PR was meant to address:

Those failures were not a wasm-vips decode race. They were test-expectation mismatches in client-side-media-processing.spec.js, exposed for the first time because the suite only runs once the browser is cross-origin isolated (which Document-Isolation-Policy enables) — it had always silently skipped on Chrome < 148, so the mismatches were never caught. wasm-vips itself processes correctly on Chrome 149 across main-thread, nested-worker, COOP/COEP and Document-Isolation-Policy: isolate-and-credentialless isolation (verified with isolated harnesses + instrumented e2e runs).

The actual fixes (correct sub-size vs main-file format expectations, CI-robust wasm-vips import, srcset/block-URL assertion, EXIF rotation marked fixme as a real gap) plus dropping the incorrect Chromium >= 148 skip gate have landed on #79342, and the CSM suite passes on Chrome 148+ there (14 pass, 1 skip).

So this retry change is not required to fix that CI. It may still have narrow standalone value as resilience against genuinely transient libheif decode hiccups under load, but I couldn't confirm any current test actually needs it (the one deterministic local AVIF decode failure I saw was not helped by the retry, and AVIF passes in CI). Recommend closing this unless we want to keep it purely as decode-resilience hardening, in which case the description should be re-scoped away from the Chrome 148 framing.

@adamsilverstein adamsilverstein added the [Type] Bug An existing feature does not function as intended label Jun 21, 2026
@adamsilverstein

Copy link
Copy Markdown
Member Author

Closing as not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Type] Bug An existing feature does not function as intended

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client-side media: intermittent wasm-vips decode failure under automation on cross-origin-isolated Chrome ≥148

1 participant