Skip to content

perf: read ahead optimization for OSC/DCS/APC PUT#5832

Open
homanp wants to merge 4 commits into
xtermjs:masterfrom
homanp:perf/payload-loop-unrolling
Open

perf: read ahead optimization for OSC/DCS/APC PUT#5832
homanp wants to merge 4 commits into
xtermjs:masterfrom
homanp:perf/payload-loop-unrolling

Conversation

@homanp

@homanp homanp commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Follow-up from #5825
Replaces the multi-condition comparison chains in OSC_PUT, DCS_PUT, and APC_PUT inner loops with static Uint8Array lookup tables and clean while loops.

Parser benchmark:

benchmark master this PR change
OSC string (short) 123.8 107.1 -13%
OSC string (long) 134.0 151.6 +13%
OSC class (short) 157.3 147.7 -6%
OSC class (long) 536.0 544.6 +2%
DCS string (short) 91.9 88.0 -4%
DCS string (long) 125.0 153.0 +22%
DCS class (short) 116.9 119.6 +2%
DCS class (long) 431.5 540.4 +25%

Long payloads improve +13-25%. short OSC regresses slightly from the table indirection but short OSC payloads are rare in real terminal data (most carry window titles, URLs, or image data). DCS is positive across the board.

@homanp

homanp commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@jerch I tried a couple of different approaches here, landed on lookup table an option. Happy to get feedback on this.

@jerch

jerch commented Apr 23, 2026

Copy link
Copy Markdown
Member

Oh wow - that is an interesting approach, tbh - would not have thought, that another table can give advantage here. Lemme play around with this for a bit.

@homanp

homanp commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

Oh wow - that is an interesting approach, tbh - would not have thought, that another table can give advantage here. Lemme play around with this for a bit.

Couldn't find a better way without nuking short string.

@jerch

jerch commented Apr 23, 2026

Copy link
Copy Markdown
Member

@homanp The table eval is actually slightly wrong, as it only respects byte values. The input is UTF32, thus the final condition against NON_ASCII_PRINTABLE should include all codepoints in 0xa0..0x10FFFF.

@homanp homanp force-pushed the perf/payload-loop-unrolling branch from b25ed1e to 6a6157c Compare April 23, 2026 17:48
@homanp

homanp commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@homanp The table eval is actually slightly wrong, as it only respects byte values. The input is UTF32, thus the final condition against NON_ASCII_PRINTABLE should include all codepoints in 0xa0..0x10FFFF.

Added a fix: codepoints > 0xff now bypass the table lookup

@jerch

jerch commented Apr 23, 2026

Copy link
Copy Markdown
Member

@homanp This run slower now for me.

I have played around with DCS_PUT only - this is the fastest I could find:

        case ParserAction.DCS_PUT:
          // inner loop - exit DCS_PUT: 0x18, 0x1a, 0x1b, 0x7f, 0x80 - 0x9f
          // unhook triggered by: 0x1b, 0x9c (success) and 0x18, 0x1a (abort)
          c = i;
          for (; c < l4;) {
            if (
              (data[++c] <= 0x1b || data[c] > 0x7f) &&
              (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
            ) break;
            if (
              (data[++c] <= 0x1b || data[c] > 0x7f) &&
              (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
            ) break;
            if (
              (data[++c] <= 0x1b || data[c] > 0x7f) &&
              (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
            ) break;
            if (
              (data[++c] <= 0x1b || data[c] > 0x7f) &&
              (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
            ) break;
          }
          if (c >= l4) {
            while (c < length) {
              if (
                (data[c] <= 0x1b || data[c] > 0x7f) &&
                (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
            ) break;
              c++;
            }
          }
          this._dcsParser.put(data, i, c);
          i = c - 1;
          break;

Idea: Most DCS payloads are ASCII printables (base64 data), so this optimizes for ASCII.

Edit: Simplified code.

@homanp

homanp commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@homanp This run slower now for me.

I have played around with DCS_PUT only - this is the fastest I could find:

        case ParserAction.DCS_PUT:

          // inner loop - exit DCS_PUT: 0x18, 0x1a, 0x1b, 0x7f, 0x80 - 0x9f

          // unhook triggered by: 0x1b, 0x9c (success) and 0x18, 0x1a (abort)

          c = i;

          for (; c < l4;) {

            if (

              (data[++c] <= 0x1b || data[c] > 0x7f) &&

              (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))

            ) break;

            if (

              (data[++c] <= 0x1b || data[c] > 0x7f) &&

              (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))

            ) break;

            if (

              (data[++c] <= 0x1b || data[c] > 0x7f) &&

              (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))

            ) break;

            if (

              (data[++c] <= 0x1b || data[c] > 0x7f) &&

              (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))

            ) break;

          }

          if (c >= l4) {

            while (c < length) {

              if (

                (data[c] <= 0x1b || data[c] > 0x7f) &&

                (data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))

            ) break;

              c++;

            }

          }

          this._dcsParser.put(data, i, c);

          i = c - 1;

          break;

Idea: Most DCS payloads are ASCII printables (base64 data), so this optimizes for ASCII.

Edit: Simplified code.

Will play around with in an hour or so.

// inner loop - exit APC_PUT: 0x18, 0x1a, 0x1b, 0x9c
for (let j = i + 1; ; ++j) {
if (j >= length || (code = data[j]) === 0x18 || code === 0x1a || code === 0x1b || code === 0x9c || (code > 0x7f && code < NON_ASCII_PRINTABLE)) {
this._apcParser.put(data, i, j);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME: The code === 0x9c condition is redundant here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dang, for APC parser tests & benchmarks are missing 😱
--> #5834

@jerch

jerch commented Apr 23, 2026

Copy link
Copy Markdown
Member

And for OSC_PUT I currently have this (very similar to PRINT):

        case ParserAction.OSC_PUT:
          // inner loop: 0x20 (SP) included, 0x7F (DEL) included
          c = i;
          while (c < l4
            && data[++c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)
            && data[++c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)
            && data[++c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)
            && data[++c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)
          ) {}
          if (c >= l4) {
            while (c < length && data[c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)) {
              c++;
            }
          }
          this._oscParser.put(data, i, c);
          i = c - 1;
          break;

@homanp homanp force-pushed the perf/payload-loop-unrolling branch from 6a6157c to bfefc1d Compare April 23, 2026 20:31
@homanp

homanp commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@jerch switched to your approach, two-stage condition with ASCII fast filter for DCS/APC and PRINT-style unrolling for OSC. dropped the lookup tables.

parser benchmark:

benchmark master this PR change
OSC string (short) 123.8 116.6 -6%
OSC string (long) 134.0 150.6 +12%
OSC class (short) 157.3 143.9 -9%
OSC class (long) 536.0 547.2 +2%
DCS string (short) 91.9 87.1 -5%
DCS string (long) 125.0 150.4 +20%
DCS class (short) 116.9 118.0 +1%
DCS class (long) 431.5 522.6 +21%

Apply loop unrolling with two-stage condition checks optimized for
ASCII payloads. DCS/APC use a fast filter (data[c] <= 0x1b || > 0x7f)
that skips the detailed exit check for common ASCII bytes. OSC uses
the same pattern as PRINT. Shared c and l4 variables across the parse
method avoid repeated declarations.
@homanp homanp force-pushed the perf/payload-loop-unrolling branch from bfefc1d to a3dda7f Compare April 23, 2026 20:34
@homanp homanp requested a review from jerch April 23, 2026 20:44
@homanp

homanp commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

Maybe we should hold on this until the new benchmarks and tests are done?

@jerch

jerch commented Apr 23, 2026

Copy link
Copy Markdown
Member

@homanp Not so overwhelming like the other perf optimizations, but better than before. The tiny penalty on short OSC is ok, the real burner for OSC, DCS & APC are the image sequences, where one sequence can have >10 MB of payload. There even 20% speedup makes a huge difference.

@homanp

homanp commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@homanp Not so overwhelming like the other perf optimizations, but better than before. The tiny penalty on short OSC is ok, the real burner for OSC, DCS & APC are the image sequences, where one sequence can have >10 MB of payload. There even 20% speedup makes a huge difference.

I've been thinking about the buffer and what type of optimizations could be done there. I'm guessing its heavily optimized already but still. Would be interesting to take a stab at next.

@jerch

jerch commented Apr 23, 2026

Copy link
Copy Markdown
Member

I've been thinking about the buffer and what type of optimizations could be done there. I'm guessing its heavily optimized already but still. Would be interesting to take a stab at next.

Sure feel free to give it a go. I think the primitives in Bufferline.ts have not much room for speedup, perf-wise during data input more interesting are:

  • Inputhandler.print - takes a huge runtime portion (was it >60% of the input chain? don't remember)
  • UTF32 encoding in src/common/input/TextDecoder.ts
  • other often used handlers like SGR (no clue about its runtime)

@homanp

homanp commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

I've been thinking about the buffer and what type of optimizations could be done there. I'm guessing its heavily optimized already but still. Would be interesting to take a stab at next.

Sure feel free to give it a go. I think the primitives in Bufferline.ts have not much room for speedup, perf-wise during data input more interesting are:

  • Inputhandler.print - takes a huge runtime portion (was it >60% of the input chain? don't remember)
  • UTF32 encoding in src/common/input/TextDecoder.ts
  • other often used handlers like SGR (no clue about its runtime)

Will look into it.

@jerch

jerch commented Apr 24, 2026

Copy link
Copy Markdown
Member

@homanp FYI: the current APC handling is quite off and needs a major rework (see #5834). So we def. need to hold this back until it got fixed.

@homanp

homanp commented Apr 24, 2026

Copy link
Copy Markdown
Contributor Author

@homanp FYI: the current APC handling is quite off and needs a major rework (see #5834). So we def. need to hold this back until it got fixed.

Makes sense, get back to me here when you have it worked out and I can run antoher pass.

@jerch

jerch commented May 6, 2026

Copy link
Copy Markdown
Member

@homanp The APC handling is fixed with #5840, feel free to continue the perf investigations.

@homanp

homanp commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

@homanp The APC handling is fixed with #5840, feel free to continue the perf investigations.

Thanks, will merge into this and keep working on it.

@jerch

jerch commented May 6, 2026

Copy link
Copy Markdown
Member

You can update to master, as it is already merged there.

@jerch jerch changed the title perf: table-driven payload loops for OSC/DCS/APC PUT perf: read ahead optimization for OSC/DCS/APC PUT May 10, 2026
@homanp

homanp commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

@jerch taking a stab at this today. saw you merged the changes

@homanp

homanp commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

@jerch Pushed another pass here.

The main change is a parser-local fast path for simple 7-bit OSC / DCS / APC string payloads. It only takes the fast path for plain printable payloads, and falls back to the existing transition table for the more complex forms.

I also added a couple of regression tests around the read-ahead behavior.

The APC benchmarks that were added now look much better on my machine:

Case Baseline Current Change
APC string short 60.77 MB/s 156.82 MB/s +158.04%
APC string long 81.85 MB/s 114.17 MB/s +39.49%
APC class short 86.88 MB/s 279.73 MB/s +221.98%
APC class long 308.73 MB/s 545.83 MB/s +76.80%

Happy to get your feedback on this.

I tried a couple of different approaches with minimal perf gains, I think I might be on to something here, but in all honesty the diff is to big for a PR of my taste :)

@homanp

homanp commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

Here's the full benchmark:

Case Baseline Current Change Eval
PRINT - a 529.04 MB/s 598.81 MB/s +13.19% OK
EXECUTE - \n 243.26 MB/s 265.90 MB/s +9.31% OK
ESCAPE - ESC E 58.17 MB/s 56.23 MB/s -3.32% OK
ESCAPE with collect - ESC % G 76.22 MB/s 68.09 MB/s -10.67% FAIL
CSI - CSI A 224.14 MB/s 282.07 MB/s +25.85% OK
CSI with collect - CSI ? p 300.91 MB/s 290.91 MB/s -3.32% OK
CSI params short 180.64 MB/s 272.68 MB/s +50.95% OK
CSI params long 215.47 MB/s 321.01 MB/s +48.98% OK
OSC string short 89.97 MB/s 133.25 MB/s +48.11% OK
OSC string long 106.05 MB/s 110.88 MB/s +4.56% OK
OSC class short 93.33 MB/s 199.26 MB/s +113.49% OK
OSC class long 314.14 MB/s 651.01 MB/s +107.24% OK
DCS string short 67.90 MB/s 131.69 MB/s +93.96% OK
DCS string long 91.06 MB/s 114.48 MB/s +25.72% OK
DCS class short 95.90 MB/s 419.30 MB/s +337.24% OK
DCS class long 263.03 MB/s 514.02 MB/s +95.42% OK
APC string short 60.77 MB/s 156.82 MB/s +158.04% OK
APC string long 81.85 MB/s 114.17 MB/s +39.49% OK
APC class short 86.88 MB/s 279.73 MB/s +221.98% OK
APC class long 308.73 MB/s 545.83 MB/s +76.80% OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants