Skip to content

Add LuaJIT interpreter on latest main#42

Draft
dalehamel wants to merge 5 commits into
dale/luajit-upstream-basefrom
dale/luajit-unwinder
Draft

Add LuaJIT interpreter on latest main#42
dalehamel wants to merge 5 commits into
dale/luajit-upstream-basefrom
dale/luajit-unwinder

Conversation

@dalehamel
Copy link
Copy Markdown
Member

@dalehamel dalehamel commented Apr 23, 2026

LuaJIT unwinder patches for production ingress-nginx

Cherry-pick of upstream PR #1236 plus Shopify-specific fixes, tested on production ingress-nginx nodes with OpenResty/LuaJIT.

Production testing results (2026-04-24)

The upstream LuaJIT unwinder works. Profiles flow correctly including Lua-level frames. Our initial zero-sample issue was caused by our ingestion side (missing "luajit" frame kind in the clickhouse sink), not by the unwinder itself. That ingestion fix shipped separately.

We ran A/B/C tests on the same production ingress-nginx node (routing-us-ce1-ej9) with three variants:

  • A (upstream baseline): profiles flow, but ENOSPC errors in logs and ERROR_4012 error frames in output
  • B (buffer fix only): ENOSPC errors eliminated, ~2x sample count
  • C (all fixes): visually identical to B — GCproto pre-caching did not produce a visible difference

Commits

Commit Purpose Upstream-relevant?
adb5329 Fix arm64 lint/test drift from cherry-pick ✅ Yes
4148ca4 eBPF verifier fix (FRAMES_PER_WALK 15→8) for kernels ≥6.16 ✅ Yes
c1dc372 Raise pidPageMappingInfoSize 1M→4M — eliminates ENOSPC log spam at scale ✅ Yes (main fix)
4e5b14a Degraded GCproto fallback — extract chunkname when debug pointers stale ❌ Non-critical
8b57f24 GCproto pre-caching via GCtrace.startpt ❌ Non-critical

Only c1dc372 (buffer raise) produced a measurable difference in production testing. The GCproto patches are informational improvements that did not visibly change the flamegraph output.

Upstream comment

Full write-up with log evidence posted on the upstream PR:
open-telemetry#1236 (comment)

Related PRs

  • PR #43 — upstream cherry-pick base (target for this PR)

@dalehamel
Copy link
Copy Markdown
Member Author

I kept running into issues:

{"time":"2026-04-23T22:48:27.124374999Z","level":"ERROR","msg":"Error adding trace(5876): update: no space left on device"}
{"time":"2026-04-23T22:48:27.124381043Z","level":"ERROR","msg":"Error adding trace(1569): update: no space left on device"}
{"time":"2026-04-23T22:48:27.12438688Z","level":"ERROR","msg":"Error adding trace(3960): update: no space left on device"}
{"time":"2026-04-23T22:48:27.124393159Z","level":"ERROR","msg":"Error adding trace(4096): update: no space left on device"}
{"time":"2026-04-23T22:48:27.124399773Z","level":"ERROR","msg":"Error adding trace(5250): update: no space left on device"}
{"time":"2026-04-23T22:48:27.12440761Z","level":"ERROR","msg":"Error adding trace(6331): update: no space left on device"}

I haven't had time to deep dive on it yet but I threw gpt 5.4 at it and I'm trying to get it to tweak things until it works.

I suspect that the amount of traffic the load balancer gets is probably causing some buffer to fill faster than in gets drained.

If I can get things working I'll comment on the upstream PR.

@dalehamel
Copy link
Copy Markdown
Member Author

The profiler binary and OTLP reporter work fine — the gRPC connection is established and the report loop runs. The problem is upstream of the reporter: zero samples ever arrive at ReportTraceEvent.                           
                                                                                                                                                                                                                                 
 Here's the chain:                                                                                                                                                                                                               
                                                                                                                                                                                                                                 
 1. LuaJIT trace mapping sync consumes the pid_page_to_mapping_info BPF map. Each nginx worker has ~6,500-7,000 JIT traces generating ~65,000 LPM prefixes. With 20+ nginx workers, the first 7-8 workers exhaust the map's      
 524,288-entry soft limit (half of 1M max_entries).                                                                                                                                                                              
 2. All subsequent processes — including the remaining nginx workers — fail to insert their dummy PID entry via processNewMapping → UpdatePidPageMappingInfo. The log shows this clearly: ERROR_4012: native_no_pid_page_mapping 
  means the eBPF tracer saw a sample from a PID it doesn't recognize.                                                                                                                                                            
 3. Without the dummy entry, the eBPF perf program drops the sample entirely — it never reaches the userspace perf buffer, so ReportTraceEvent is never called, and the OTLP reporter sends empty batches (which it silently     
 skips at Debug log level).                                    

Scale issue i guess

LuaJIT trace prefixes consume ~65K entries per nginx worker. With 16+
workers on ingress-nginx nodes, the default 524K map is exhausted,
blocking native process mappings and killing all profiling on the node.

Raise the BPF map from 2^19 (524K) to 2^22 (4M) and the Go-side
loader override from 2^20 to 2^22 to match.
When the LuaJIT GC relocates a proto object between eBPF sample capture
and Go-side symbolization, the debug-info pointers (lineinfo, uvinfo,
varinfo) can be out of range. Previously this rejected the entire proto.

Now: if the chunkname is still readable, return a degraded proto that
provides file name attribution (e.g. '@/etc/nginx/lua/sorting_hat.lua')
even without line numbers. This is much more useful than the raw
'[ ()]: +0x0' that appeared before.

The proto is cached with zeroed debug pointers so downstream code
(getLine, getVarname, etc.) safely returns zero/empty.
The LuaJIT GC can free/relocate GCproto objects between eBPF sample
capture and Go-side symbolization (via process_vm_readv), causing
'invalid GCproto object' failures for nearly all Lua-level frames.

Fix: during SynchronizeMappings, read the GCtrace.startpt field (the
prototype each trace was compiled from) and eagerly cache the GCproto
via getGCproto(). Since the traces are alive during sync, their protos
are guaranteed valid at that moment. The cached protos are then available
for symbolization later, even if the GC recycles the memory.

Also read GCtrace.startpt (offset 0x40) in loadTraces so each trace
carries its prototype address.
@dalehamel dalehamel force-pushed the dale/luajit-unwinder branch from c5dc758 to 8b57f24 Compare April 24, 2026 18:41
@github-actions
Copy link
Copy Markdown

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant