Skip to content

[sight] feat(sight): add real-time OOM crash detection for agent processes #778

@Daydreamer-Li

Description

@Daydreamer-Li

Problem

When an agent process is OOM-killed by the kernel, AgentSight cannot distinguish it from a normal crash (SIGKILL, segfault, etc.). There is no mechanism to attribute the crash to OOM in the agent_crash interruption event.

Additionally, if AgentSight itself is OOM-killed and restarts, historical OOM events that occurred during downtime are lost.

Proposed Solution

Implement a dual-path OOM detection mechanism:

  1. Real-time drain path (1s cycle): After persisting pending calls for dead PIDs in the trace process, immediately query dmesg to confirm OOM kill. Generates agent_crash event with oom:true and source:drain+dmesg.

  2. HealthChecker backup path (30s cycle): When the serve process detects an agent disappearing, check dmesg for OOM confirmation as a fallback.

  3. Startup recovery path: On AgentSight restart, scan dmesg history to recover OOM events that occurred while AgentSight was down.

Also removes stale integration test scripts from version control (test artifacts that should not be tracked).

Component

  • sight (agentsight)

Additional Context

  • Uses dmesg -T to scan for "Killed process " lines
  • The drain path (1s) is much faster than HealthChecker (30s) for catching OOM events
  • Verified end-to-end on remote machine with cgroup memory limits triggering OOM kill

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions