Add LuaJIT interpreter on latest main#42
Draft
dalehamel wants to merge 5 commits into
Draft
Conversation
Member
Author
|
I kept running into issues: I haven't had time to deep dive on it yet but I threw gpt 5.4 at it and I'm trying to get it to tweak things until it works. I suspect that the amount of traffic the load balancer gets is probably causing some buffer to fill faster than in gets drained. If I can get things working I'll comment on the upstream PR. |
Member
Author
Scale issue i guess |
f9edc5f to
ee5dd66
Compare
LuaJIT trace prefixes consume ~65K entries per nginx worker. With 16+ workers on ingress-nginx nodes, the default 524K map is exhausted, blocking native process mappings and killing all profiling on the node. Raise the BPF map from 2^19 (524K) to 2^22 (4M) and the Go-side loader override from 2^20 to 2^22 to match.
When the LuaJIT GC relocates a proto object between eBPF sample capture and Go-side symbolization, the debug-info pointers (lineinfo, uvinfo, varinfo) can be out of range. Previously this rejected the entire proto. Now: if the chunkname is still readable, return a degraded proto that provides file name attribution (e.g. '@/etc/nginx/lua/sorting_hat.lua') even without line numbers. This is much more useful than the raw '[ ()]: +0x0' that appeared before. The proto is cached with zeroed debug pointers so downstream code (getLine, getVarname, etc.) safely returns zero/empty.
The LuaJIT GC can free/relocate GCproto objects between eBPF sample capture and Go-side symbolization (via process_vm_readv), causing 'invalid GCproto object' failures for nearly all Lua-level frames. Fix: during SynchronizeMappings, read the GCtrace.startpt field (the prototype each trace was compiled from) and eagerly cache the GCproto via getGCproto(). Since the traces are alive during sync, their protos are guaranteed valid at that moment. The cached protos are then available for symbolization later, even if the GC recycles the memory. Also read GCtrace.startpt (offset 0x40) in loadTraces so each trace carries its prototype address.
c5dc758 to
8b57f24
Compare
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
LuaJIT unwinder patches for production ingress-nginx
Cherry-pick of upstream PR #1236 plus Shopify-specific fixes, tested on production ingress-nginx nodes with OpenResty/LuaJIT.
Production testing results (2026-04-24)
The upstream LuaJIT unwinder works. Profiles flow correctly including Lua-level frames. Our initial zero-sample issue was caused by our ingestion side (missing
"luajit"frame kind in the clickhouse sink), not by the unwinder itself. That ingestion fix shipped separately.We ran A/B/C tests on the same production ingress-nginx node (
routing-us-ce1-ej9) with three variants:ERROR_4012error frames in outputCommits
adb53294148ca4c1dc372pidPageMappingInfoSize1M→4M — eliminates ENOSPC log spam at scale4e5b14a8b57f24Only
c1dc372(buffer raise) produced a measurable difference in production testing. The GCproto patches are informational improvements that did not visibly change the flamegraph output.Upstream comment
Full write-up with log evidence posted on the upstream PR:
open-telemetry#1236 (comment)
Related PRs