Skip to content

feat(ai): collect agent activity (raw lines + client-side redaction)#1259

Draft
amascia-gg wants to merge 14 commits into
mainfrom
amascia/agent-activity-metadata
Draft

feat(ai): collect agent activity (raw lines + client-side redaction)#1259
amascia-gg wants to merge 14 commits into
mainfrom
amascia/agent-activity-metadata

Conversation

@amascia-gg

Copy link
Copy Markdown
Member

Context

Builds on the MCP-history framework (#1244 / #1257). Today ggshield ai discover --history ships only MCP PreToolUse events (~5% of agent activity). This adds collection of per-event usage metadata for the whole session, to power agentic-AI usage dashboards.

Design: AI Agent Activity Pipeline.

What has been done

  • Per-agent activity sources (Claude Code, Codex, Cursor) walk the on-disk transcripts/SQLite and emit one event per record via client.send_agent_activity(...).
  • Metadata-only, fail-closed. Each source's serialize() builds a NEW dict of allow-listed safe fields (event_type, tool name, model, timestamps, …) and drops every free-text field (prompts, command strings, tool inputs/outputs, file contents). The base serialize() raises by default, so a source can never accidentally ship raw content — no secret or PII ever leaves the machine (no server-side redaction needed).
  • PII/size hardening: home paths anonymised (/Users/x~); per-record size cap + byte-batching.
  • Renamed the package raw_historyagent_activity (it's metadata, not raw).
  • Review fixes to the original framework: GGClient typing + Detail error handling in the orchestrator; home-path leak fix in source_path_for.
  • Bumps pygitguardian to the commit adding send_agent_activity() (feat: add send_agent_activity endpoint py-gitguardian#175).

Validation

  • pytest tests/unit/verticals/ai/ → 387 passing.
  • Verified end-to-end against a local GIM (ward-runs-app): ggshield ai discover --historyPOST /v1/nhi/ai/activity → ClickHouse agent_activity. Stored content is metadata only (e.g. {"type":"assistant","model":"…","tools":["Bash"]}); a table-wide sweep confirmed prompts/commands/outputs are absent.

Related / depends on

PR check list

  • As much as possible, the changes include tests (unit and/or functional)
  • Changelog entry — TODO before un-drafting.

Draft. Based on pbeslin/collect-ai-agent-history (#1257); registering the agent sources turns on real data shipping, so this is the "sources + metadata" layer on top of the framework.

🤖 Generated with Claude Code

@amascia-gg amascia-gg force-pushed the amascia/agent-activity-metadata branch from 47fe3b3 to a83e1b2 Compare June 2, 2026 08:54
@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.20548% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.43%. Comparing base (1e6b8da) to head (567dd79).

Files with missing lines Patch % Lines
ggshield/verticals/ai/agent_activity/readers.py 82.05% 7 Missing ⚠️
ggshield/verticals/ai/agent_activity/sources.py 95.83% 3 Missing ⚠️
ggshield/cmd/ai/discover.py 90.47% 2 Missing ⚠️
...shield/verticals/ai/agent_activity/orchestrator.py 97.10% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1259      +/-   ##
==========================================
+ Coverage   93.38%   93.43%   +0.05%     
==========================================
  Files         189      194       +5     
  Lines       10782    11067     +285     
==========================================
+ Hits        10069    10341     +272     
- Misses        713      726      +13     
Flag Coverage Δ
unittests 93.43% <95.20%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@amascia-gg amascia-gg force-pushed the amascia/agent-activity-metadata branch 5 times, most recently from e7f8375 to 6013a56 Compare June 2, 2026 10:58
@amascia-gg amascia-gg changed the title feat(ai): collect agent-activity metadata in ai discover --history feat(ai): collect agent activity (raw lines + client-side redaction) Jun 2, 2026
@amascia-gg amascia-gg force-pushed the amascia/agent-activity-metadata branch from 6013a56 to 21c6f29 Compare June 2, 2026 11:19
@amascia-gg amascia-gg self-assigned this Jun 2, 2026
@amascia-gg amascia-gg force-pushed the amascia/agent-activity-metadata branch from 44898bb to a29f94b Compare June 3, 2026 08:57
@amascia-gg amascia-gg changed the base branch from pbeslin/collect-ai-agent-history to main June 3, 2026 08:57
@amascia-gg amascia-gg force-pushed the amascia/agent-activity-metadata branch from 1170e1f to 52f4a66 Compare June 3, 2026 20:22

@xblanchot-gg xblanchot-gg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of comments as I've not finished the review yet, as the PR is quite long to read. Overall looks good, but I think there is a bug somewhere, when testing on my machine, only 10k events over 63k are actually sent to GitGuardian.

Comment thread ggshield/verticals/ai/agent_activity/orchestrator.py Outdated
Comment thread ggshield/verticals/ai/agent_activity/orchestrator.py Outdated
Comment thread ggshield/verticals/ai/agent_activity/readers.py Outdated
Comment thread ggshield/verticals/ai/agent_activity/orchestrator.py Outdated
Comment thread ggshield/verticals/ai/agents/copilot.py Outdated
Comment thread tests/unit/verticals/ai/agent_activity/test_models.py Outdated
Comment thread tests/unit/verticals/ai/agent_activity/test_orchestrator.py Outdated

@xblanchot-gg xblanchot-gg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second batch, a few other comments, nothing really blocking on my side

@paulpetit-gg-ext paulpetit-gg-ext force-pushed the amascia/agent-activity-metadata branch from d39e520 to 121240c Compare June 9, 2026 08:30
amascia-gg and others added 2 commits June 11, 2026 17:30
Ship full AI-agent session activity to GitGuardian, not just MCP tool calls.
ggshield discovers each agent's on-disk transcripts / databases and sends the
raw records verbatim; GitGuardian scans them and strips secrets server-side
before storing them, so the client stays "dumb" and the data shape never
depends on the ggshield version.

Sources cover Claude Code, Codex, Cursor, Copilot CLI and VSCode (Copilot
Chat). Records are batched (by count and by bytes) and posted to the
agent-activity endpoint; the server deduplicates per record.

Issue: NHI-1628

Co-Authored-By: Paul Beslin <paul.beslin-ext@gitguardian.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Embedded images and thinking-block signatures are large opaque base64 that
carry no analytical value — on real transcripts they were ~a third of the
payload and needlessly enlarge the server-side secret scan. Replace any
pure-base64 string value over 4 KB with a short placeholder, keeping the rest
of the record (and the blob's sibling metadata) verbatim. Textual fields like
stdout have whitespace, so they are never matched.

Issue: NHI-1628

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@paulpetit-gg-ext paulpetit-gg-ext force-pushed the amascia/agent-activity-metadata branch from 567dd79 to 28f4470 Compare June 11, 2026 15:31
amascia-gg and others added 7 commits June 11, 2026 17:32
Send the discovered UserInfo (machine_id + hostname/username/email) with every
agent-activity batch so GitGuardian can attribute the records and correlate
them with the machine inventory (the machine_id matches the machine scan).
Reuses the user already gathered for discovery.

Issue: NHI-1628

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The server reports how many records it could not scan and dropped (never
stored). Thread that count through AgentActivityBatchResult and the report, and
print it in the `ai discover --history` summary when non-zero, so drops are
visible instead of silently making the counts not add up.

Issue: NHI-1628

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each batch is one HTTP request and one ClickHouse insert (one part). 500 meant
~126 requests/parts for a 63k collection; server-side scanning is sub-ms per
record so a 5000-record batch still completes well under the request timeout.
The 5 MB byte cap stays the real guard against oversized requests.

Issue: NHI-1628

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback on agent-activity collection:

- Wrap each agent's history walk in try/except so one agent failing does
  not abort collection for the agents that follow.
- Encapsulate the send buffer in a small _Batch value object, which drops
  the nonlocal closure and renames the misleading flush() (it performs an
  API call) to send_batch().
- Rename the two-letter `fp` to `handle` in iter_jsonl.
- Patch BATCH_SIZE / MAX_BATCH_BYTES to small values in the batching tests
  instead of allocating thousands of events, and drop the to_dict test that
  only exercised dataclasses.asdict.

Issue: NHI-1628

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Subagents spawned via the Task tool record their transcript in a separate file
(projects/*/<session>/subagents/agent-*.jsonl) that is never inlined into the
main session transcript, so all subagent activity was being missed. Add a
dedicated source with its own `subagent_transcript` kind.

Issue: NHI-1628

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both sources hardcoded the Linux ~/.config base, so on macOS and Windows —
where VSCode-family editors keep their User data under
~/Library/Application Support and %APPDATA% — agent-activity collection found
nothing. Resolve the editor user-data dir per OS instead; VSCode records its
source_path relative to that dir so the shipped path stays OS-independent.

Issue: NHI-1628

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each subagents/agent-*.jsonl transcript has a sibling agent-*.meta.json holding
the subagent's agentType, description and the parent's toolUseId — the only
place that ties a subagent run back to the Task call that spawned it. Ship it
verbatim under a new `subagent_meta` kind.

Issue: NHI-1628

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
amascia-gg and others added 5 commits June 11, 2026 17:32
Cursor and VSCode keep their state.vscdb open in WAL mode, so a plain mode=ro
connection fails to open it ("unable to open database file") — Cursor activity
collection returned nothing whenever Cursor was running (i.e. almost always).
Open with immutable=1 to read the file as a static snapshot regardless of the
writer's lock; we never write back.

Issue: NHI-1628

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@paulpetit-gg-ext paulpetit-gg-ext force-pushed the amascia/agent-activity-metadata branch from 28f4470 to e6d19e6 Compare June 11, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants