Add LuaJIT interpreter on latest main by dalehamel · Pull Request #42 · Shopify/opentelemetry-ebpf-profiler

dalehamel · 2026-04-23T16:37:10Z

LuaJIT unwinder patches for production ingress-nginx

Cherry-pick of upstream PR #1236 plus Shopify-specific fixes, tested on production ingress-nginx nodes with OpenResty/LuaJIT.

Production testing results (2026-04-24)

The upstream LuaJIT unwinder works. Profiles flow correctly including Lua-level frames. Our initial zero-sample issue was caused by our ingestion side (missing "luajit" frame kind in the clickhouse sink), not by the unwinder itself. That ingestion fix shipped separately.

We ran A/B/C tests on the same production ingress-nginx node (routing-us-ce1-ej9) with three variants:

A (upstream baseline): profiles flow, but ENOSPC errors in logs and ERROR_4012 error frames in output
B (buffer fix only): ENOSPC errors eliminated, ~2x sample count
C (all fixes): visually identical to B — GCproto pre-caching did not produce a visible difference

Commits

Commit	Purpose	Upstream-relevant?
`adb5329`	Fix arm64 lint/test drift from cherry-pick	✅ Yes
`4148ca4`	eBPF verifier fix (FRAMES_PER_WALK 15→8) for kernels ≥6.16	✅ Yes
`c1dc372`	Raise `pidPageMappingInfoSize` 1M→4M — eliminates ENOSPC log spam at scale	✅ Yes (main fix)
`4e5b14a`	Degraded GCproto fallback — extract chunkname when debug pointers stale	❌ Non-critical
`8b57f24`	GCproto pre-caching via GCtrace.startpt	❌ Non-critical

Only c1dc372 (buffer raise) produced a measurable difference in production testing. The GCproto patches are informational improvements that did not visibly change the flamegraph output.

Upstream comment

Full write-up with log evidence posted on the upstream PR:
open-telemetry#1236 (comment)

Related PRs

PR #43 — upstream cherry-pick base (target for this PR)

dalehamel · 2026-04-23T23:33:23Z

I kept running into issues:

{"time":"2026-04-23T22:48:27.124374999Z","level":"ERROR","msg":"Error adding trace(5876): update: no space left on device"}
{"time":"2026-04-23T22:48:27.124381043Z","level":"ERROR","msg":"Error adding trace(1569): update: no space left on device"}
{"time":"2026-04-23T22:48:27.12438688Z","level":"ERROR","msg":"Error adding trace(3960): update: no space left on device"}
{"time":"2026-04-23T22:48:27.124393159Z","level":"ERROR","msg":"Error adding trace(4096): update: no space left on device"}
{"time":"2026-04-23T22:48:27.124399773Z","level":"ERROR","msg":"Error adding trace(5250): update: no space left on device"}
{"time":"2026-04-23T22:48:27.12440761Z","level":"ERROR","msg":"Error adding trace(6331): update: no space left on device"}

I haven't had time to deep dive on it yet but I threw gpt 5.4 at it and I'm trying to get it to tweak things until it works.

I suspect that the amount of traffic the load balancer gets is probably causing some buffer to fill faster than in gets drained.

If I can get things working I'll comment on the upstream PR.

dalehamel · 2026-04-24T00:13:30Z

The profiler binary and OTLP reporter work fine — the gRPC connection is established and the report loop runs. The problem is upstream of the reporter: zero samples ever arrive at ReportTraceEvent.                           
                                                                                                                                                                                                                                 
 Here's the chain:                                                                                                                                                                                                               
                                                                                                                                                                                                                                 
 1. LuaJIT trace mapping sync consumes the pid_page_to_mapping_info BPF map. Each nginx worker has ~6,500-7,000 JIT traces generating ~65,000 LPM prefixes. With 20+ nginx workers, the first 7-8 workers exhaust the map's      
 524,288-entry soft limit (half of 1M max_entries).                                                                                                                                                                              
 2. All subsequent processes — including the remaining nginx workers — fail to insert their dummy PID entry via processNewMapping → UpdatePidPageMappingInfo. The log shows this clearly: ERROR_4012: native_no_pid_page_mapping 
  means the eBPF tracer saw a sample from a PID it doesn't recognize.                                                                                                                                                            
 3. Without the dummy entry, the eBPF perf program drops the sample entirely — it never reaches the userspace perf buffer, so ReportTraceEvent is never called, and the OTLP reporter sends empty batches (which it silently     
 skips at Debug log level).

Scale issue i guess

LuaJIT trace prefixes consume ~65K entries per nginx worker. With 16+ workers on ingress-nginx nodes, the default 524K map is exhausted, blocking native process mappings and killing all profiling on the node. Raise the BPF map from 2^19 (524K) to 2^22 (4M) and the Go-side loader override from 2^20 to 2^22 to match.

When the LuaJIT GC relocates a proto object between eBPF sample capture and Go-side symbolization, the debug-info pointers (lineinfo, uvinfo, varinfo) can be out of range. Previously this rejected the entire proto. Now: if the chunkname is still readable, return a degraded proto that provides file name attribution (e.g. '@/etc/nginx/lua/sorting_hat.lua') even without line numbers. This is much more useful than the raw '[ ()]: +0x0' that appeared before. The proto is cached with zeroed debug pointers so downstream code (getLine, getVarname, etc.) safely returns zero/empty.

The LuaJIT GC can free/relocate GCproto objects between eBPF sample capture and Go-side symbolization (via process_vm_readv), causing 'invalid GCproto object' failures for nearly all Lua-level frames. Fix: during SynchronizeMappings, read the GCtrace.startpt field (the prototype each trace was compiled from) and eagerly cache the GCproto via getGCproto(). Since the traces are alive during sync, their protos are guaranteed valid at that moment. The cached protos are then available for symbolization later, even if the GC recycles the memory. Also read GCtrace.startpt (offset 0x40) in loadTraces so each trace carries its prototype address.

github-actions · 2026-05-23T08:55:24Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

dalehamel mentioned this pull request Apr 24, 2026

Add LuaJIT interpreter open-telemetry/opentelemetry-ebpf-profiler#1236

Draft

dalehamel changed the base branch from main to dale/luajit-upstream-base April 24, 2026 12:41

dalehamel mentioned this pull request Apr 24, 2026

Cherry-pick upstream LuaJIT interpreter onto Shopify main #43

Draft

dalehamel force-pushed the dale/luajit-unwinder branch from f9edc5f to ee5dd66 Compare April 24, 2026 12:59

dalehamel added 5 commits April 24, 2026 14:30

Fix Linux arm64 lint and test drift

adb5329

Keep LuaJIT verifier work below the later-kernel limit

4148ca4

dalehamel force-pushed the dale/luajit-unwinder branch from c5dc758 to 8b57f24 Compare April 24, 2026 18:41

github-actions Bot added the Stale label May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LuaJIT interpreter on latest main#42

Add LuaJIT interpreter on latest main#42
dalehamel wants to merge 5 commits into
dale/luajit-upstream-basefrom
dale/luajit-unwinder

dalehamel commented Apr 23, 2026 •

edited

Loading

Uh oh!

dalehamel commented Apr 23, 2026

Uh oh!

dalehamel commented Apr 24, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dalehamel commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LuaJIT unwinder patches for production ingress-nginx

Production testing results (2026-04-24)

Commits

Upstream comment

Related PRs

Uh oh!

dalehamel commented Apr 23, 2026

Uh oh!

dalehamel commented Apr 24, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dalehamel commented Apr 23, 2026 •

edited

Loading