Problem
When an agent process is OOM-killed by the kernel, AgentSight cannot distinguish it from a normal crash (SIGKILL, segfault, etc.). There is no mechanism to attribute the crash to OOM in the agent_crash interruption event.
Additionally, if AgentSight itself is OOM-killed and restarts, historical OOM events that occurred during downtime are lost.
Proposed Solution
Implement a dual-path OOM detection mechanism:
-
Real-time drain path (1s cycle): After persisting pending calls for dead PIDs in the trace process, immediately query dmesg to confirm OOM kill. Generates agent_crash event with oom:true and source:drain+dmesg.
-
HealthChecker backup path (30s cycle): When the serve process detects an agent disappearing, check dmesg for OOM confirmation as a fallback.
-
Startup recovery path: On AgentSight restart, scan dmesg history to recover OOM events that occurred while AgentSight was down.
Also removes stale integration test scripts from version control (test artifacts that should not be tracked).
Component
Additional Context
- Uses
dmesg -T to scan for "Killed process " lines
- The drain path (1s) is much faster than HealthChecker (30s) for catching OOM events
- Verified end-to-end on remote machine with cgroup memory limits triggering OOM kill
Problem
When an agent process is OOM-killed by the kernel, AgentSight cannot distinguish it from a normal crash (SIGKILL, segfault, etc.). There is no mechanism to attribute the crash to OOM in the agent_crash interruption event.
Additionally, if AgentSight itself is OOM-killed and restarts, historical OOM events that occurred during downtime are lost.
Proposed Solution
Implement a dual-path OOM detection mechanism:
Real-time drain path (1s cycle): After persisting pending calls for dead PIDs in the trace process, immediately query dmesg to confirm OOM kill. Generates
agent_crashevent withoom:trueandsource:drain+dmesg.HealthChecker backup path (30s cycle): When the serve process detects an agent disappearing, check dmesg for OOM confirmation as a fallback.
Startup recovery path: On AgentSight restart, scan dmesg history to recover OOM events that occurred while AgentSight was down.
Also removes stale integration test scripts from version control (test artifacts that should not be tracked).
Component
sight(agentsight)Additional Context
dmesg -Tto scan for "Killed process " lines