perf: read ahead optimization for OSC/DCS/APC PUT#5832
Conversation
|
@jerch I tried a couple of different approaches here, landed on lookup table an option. Happy to get feedback on this. |
|
Oh wow - that is an interesting approach, tbh - would not have thought, that another table can give advantage here. Lemme play around with this for a bit. |
Couldn't find a better way without nuking short string. |
|
@homanp The table eval is actually slightly wrong, as it only respects byte values. The input is UTF32, thus the final condition against NON_ASCII_PRINTABLE should include all codepoints in 0xa0..0x10FFFF. |
b25ed1e to
6a6157c
Compare
Added a fix: codepoints > 0xff now bypass the table lookup |
|
@homanp This run slower now for me. I have played around with DCS_PUT only - this is the fastest I could find: case ParserAction.DCS_PUT:
// inner loop - exit DCS_PUT: 0x18, 0x1a, 0x1b, 0x7f, 0x80 - 0x9f
// unhook triggered by: 0x1b, 0x9c (success) and 0x18, 0x1a (abort)
c = i;
for (; c < l4;) {
if (
(data[++c] <= 0x1b || data[c] > 0x7f) &&
(data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
) break;
if (
(data[++c] <= 0x1b || data[c] > 0x7f) &&
(data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
) break;
if (
(data[++c] <= 0x1b || data[c] > 0x7f) &&
(data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
) break;
if (
(data[++c] <= 0x1b || data[c] > 0x7f) &&
(data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
) break;
}
if (c >= l4) {
while (c < length) {
if (
(data[c] <= 0x1b || data[c] > 0x7f) &&
(data[c] === 0x1b || data[c] === 0x1a || data[c] === 0x18 || (data[c] > 0x7f && data[c] < 160))
) break;
c++;
}
}
this._dcsParser.put(data, i, c);
i = c - 1;
break;Idea: Most DCS payloads are ASCII printables (base64 data), so this optimizes for ASCII. Edit: Simplified code. |
Will play around with in an hour or so. |
| // inner loop - exit APC_PUT: 0x18, 0x1a, 0x1b, 0x9c | ||
| for (let j = i + 1; ; ++j) { | ||
| if (j >= length || (code = data[j]) === 0x18 || code === 0x1a || code === 0x1b || code === 0x9c || (code > 0x7f && code < NON_ASCII_PRINTABLE)) { | ||
| this._apcParser.put(data, i, j); |
There was a problem hiding this comment.
FIXME: The code === 0x9c condition is redundant here.
There was a problem hiding this comment.
dang, for APC parser tests & benchmarks are missing 😱
--> #5834
|
And for OSC_PUT I currently have this (very similar to PRINT): case ParserAction.OSC_PUT:
// inner loop: 0x20 (SP) included, 0x7F (DEL) included
c = i;
while (c < l4
&& data[++c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)
&& data[++c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)
&& data[++c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)
&& data[++c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)
) {}
if (c >= l4) {
while (c < length && data[c] >= 0x20 && (data[c] <= 0x7f || data[c] >= NON_ASCII_PRINTABLE)) {
c++;
}
}
this._oscParser.put(data, i, c);
i = c - 1;
break; |
6a6157c to
bfefc1d
Compare
|
@jerch switched to your approach, two-stage condition with ASCII fast filter for DCS/APC and PRINT-style unrolling for OSC. dropped the lookup tables. parser benchmark:
|
Apply loop unrolling with two-stage condition checks optimized for ASCII payloads. DCS/APC use a fast filter (data[c] <= 0x1b || > 0x7f) that skips the detailed exit check for common ASCII bytes. OSC uses the same pattern as PRINT. Shared c and l4 variables across the parse method avoid repeated declarations.
bfefc1d to
a3dda7f
Compare
|
Maybe we should hold on this until the new benchmarks and tests are done? |
|
@homanp Not so overwhelming like the other perf optimizations, but better than before. The tiny penalty on short OSC is ok, the real burner for OSC, DCS & APC are the image sequences, where one sequence can have >10 MB of payload. There even 20% speedup makes a huge difference. |
I've been thinking about the buffer and what type of optimizations could be done there. I'm guessing its heavily optimized already but still. Would be interesting to take a stab at next. |
Sure feel free to give it a go. I think the primitives in
|
Will look into it. |
|
You can update to master, as it is already merged there. |
|
@jerch taking a stab at this today. saw you merged the changes |
|
@jerch Pushed another pass here. The main change is a parser-local fast path for simple 7-bit I also added a couple of regression tests around the read-ahead behavior. The APC benchmarks that were added now look much better on my machine:
Happy to get your feedback on this. I tried a couple of different approaches with minimal perf gains, I think I might be on to something here, but in all honesty the diff is to big for a PR of my taste :) |
|
Here's the full benchmark:
|
Follow-up from #5825
Replaces the multi-condition comparison chains in OSC_PUT, DCS_PUT, and APC_PUT inner loops with static Uint8Array lookup tables and clean while loops.
Parser benchmark:
Long payloads improve +13-25%. short OSC regresses slightly from the table indirection but short OSC payloads are rare in real terminal data (most carry window titles, URLs, or image data). DCS is positive across the board.