feat(ai): collect agent activity (raw lines + client-side redaction) by amascia-gg · Pull Request #1259 · GitGuardian/ggshield

amascia-gg · 2026-06-01T15:15:36Z

Context

Builds on the MCP-history framework (#1244 / #1257). Today ggshield ai discover --history ships only MCP PreToolUse events (~5% of agent activity). This adds collection of per-event usage metadata for the whole session, to power agentic-AI usage dashboards.

Design: AI Agent Activity Pipeline.

What has been done

Per-agent activity sources (Claude Code, Codex, Cursor) walk the on-disk transcripts/SQLite and emit one event per record via client.send_agent_activity(...).
Metadata-only, fail-closed. Each source's serialize() builds a NEW dict of allow-listed safe fields (event_type, tool name, model, timestamps, …) and drops every free-text field (prompts, command strings, tool inputs/outputs, file contents). The base serialize() raises by default, so a source can never accidentally ship raw content — no secret or PII ever leaves the machine (no server-side redaction needed).
PII/size hardening: home paths anonymised (/Users/x → ~); per-record size cap + byte-batching.
Renamed the package raw_history → agent_activity (it's metadata, not raw).
Review fixes to the original framework: GGClient typing + Detail error handling in the orchestrator; home-path leak fix in source_path_for.
Bumps pygitguardian to the commit adding send_agent_activity() (feat: add send_agent_activity endpoint py-gitguardian#175).

Validation

pytest tests/unit/verticals/ai/ → 387 passing.
Verified end-to-end against a local GIM (ward-runs-app): ggshield ai discover --history → POST /v1/nhi/ai/activity → ClickHouse agent_activity. Stored content is metadata only (e.g. {"type":"assistant","model":"…","tools":["Bash"]}); a table-wide sweep confirmed prompts/commands/outputs are absent.

Related / depends on

SDK: feat: add send_agent_activity endpoint py-gitguardian#175 (send_agent_activity). Re-pin to the merged commit before landing.
Server: GIM ward-runs-app MR (endpoint + agent_activity ClickHouse table).

PR check list

As much as possible, the changes include tests (unit and/or functional)
Changelog entry — TODO before un-drafting.

Draft. Based on pbeslin/collect-ai-agent-history (#1257); registering the agent sources turns on real data shipping, so this is the "sources + metadata" layer on top of the framework.

🤖 Generated with Claude Code

codecov · 2026-06-02T08:55:29Z

Codecov Report

❌ Patch coverage is 95.20548% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.43%. Comparing base (1e6b8da) to head (567dd79).

Files with missing lines	Patch %	Lines
ggshield/verticals/ai/agent_activity/readers.py	82.05%	7 Missing ⚠️
ggshield/verticals/ai/agent_activity/sources.py	95.83%	3 Missing ⚠️
ggshield/cmd/ai/discover.py	90.47%	2 Missing ⚠️
...shield/verticals/ai/agent_activity/orchestrator.py	97.10%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1259      +/-   ##
==========================================
+ Coverage   93.38%   93.43%   +0.05%     
==========================================
  Files         189      194       +5     
  Lines       10782    11067     +285     
==========================================
+ Hits        10069    10341     +272     
- Misses        713      726      +13

Flag	Coverage Δ
unittests	`93.43% <95.20%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

xblanchot-gg

First batch of comments as I've not finished the review yet, as the PR is quite long to read. Overall looks good, but I think there is a bug somewhere, when testing on my machine, only 10k events over 63k are actually sent to GitGuardian.

xblanchot-gg

Second batch, a few other comments, nothing really blocking on my side

Ship full AI-agent session activity to GitGuardian, not just MCP tool calls. ggshield discovers each agent's on-disk transcripts / databases and sends the raw records verbatim; GitGuardian scans them and strips secrets server-side before storing them, so the client stays "dumb" and the data shape never depends on the ggshield version. Sources cover Claude Code, Codex, Cursor, Copilot CLI and VSCode (Copilot Chat). Records are batched (by count and by bytes) and posted to the agent-activity endpoint; the server deduplicates per record. Issue: NHI-1628 Co-Authored-By: Paul Beslin <paul.beslin-ext@gitguardian.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Embedded images and thinking-block signatures are large opaque base64 that carry no analytical value — on real transcripts they were ~a third of the payload and needlessly enlarge the server-side secret scan. Replace any pure-base64 string value over 4 KB with a short placeholder, keeping the rest of the record (and the blob's sibling metadata) verbatim. Textual fields like stdout have whitespace, so they are never matched. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Send the discovered UserInfo (machine_id + hostname/username/email) with every agent-activity batch so GitGuardian can attribute the records and correlate them with the machine inventory (the machine_id matches the machine scan). Reuses the user already gathered for discovery. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The server reports how many records it could not scan and dropped (never stored). Thread that count through AgentActivityBatchResult and the report, and print it in the `ai discover --history` summary when non-zero, so drops are visible instead of silently making the counts not add up. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Each batch is one HTTP request and one ClickHouse insert (one part). 500 meant ~126 requests/parts for a 63k collection; server-side scanning is sub-ms per record so a 5000-record batch still completes well under the request timeout. The 5 MB byte cap stays the real guard against oversized requests. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address review feedback on agent-activity collection: - Wrap each agent's history walk in try/except so one agent failing does not abort collection for the agents that follow. - Encapsulate the send buffer in a small _Batch value object, which drops the nonlocal closure and renames the misleading flush() (it performs an API call) to send_batch(). - Rename the two-letter `fp` to `handle` in iter_jsonl. - Patch BATCH_SIZE / MAX_BATCH_BYTES to small values in the batching tests instead of allocating thousands of events, and drop the to_dict test that only exercised dataclasses.asdict. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Subagents spawned via the Task tool record their transcript in a separate file (projects/*/<session>/subagents/agent-*.jsonl) that is never inlined into the main session transcript, so all subagent activity was being missed. Add a dedicated source with its own `subagent_transcript` kind. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Both sources hardcoded the Linux ~/.config base, so on macOS and Windows — where VSCode-family editors keep their User data under ~/Library/Application Support and %APPDATA% — agent-activity collection found nothing. Resolve the editor user-data dir per OS instead; VSCode records its source_path relative to that dir so the shipped path stays OS-independent. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Each subagents/agent-*.jsonl transcript has a sibling agent-*.meta.json holding the subagent's agentType, description and the parent's toolUseId — the only place that ties a subagent run back to the Task call that spawned it. Ship it verbatim under a new `subagent_meta` kind. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Cursor and VSCode keep their state.vscdb open in WAL mode, so a plain mode=ro connection fails to open it ("unable to open database file") — Cursor activity collection returned nothing whenever Cursor was running (i.e. almost always). Open with immutable=1 to read the file as a static snapshot regardless of the writer's lock; we never write back. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

amascia-gg force-pushed the amascia/agent-activity-metadata branch from 47fe3b3 to a83e1b2 Compare June 2, 2026 08:54

amascia-gg force-pushed the amascia/agent-activity-metadata branch 5 times, most recently from e7f8375 to 6013a56 Compare June 2, 2026 10:58

amascia-gg changed the title ~~feat(ai): collect agent-activity metadata in ai discover --history~~ feat(ai): collect agent activity (raw lines + client-side redaction) Jun 2, 2026

amascia-gg mentioned this pull request Jun 2, 2026

feat: add send_agent_activity endpoint GitGuardian/py-gitguardian#175

Open

amascia-gg force-pushed the amascia/agent-activity-metadata branch from 6013a56 to 21c6f29 Compare June 2, 2026 11:19

amascia-gg mentioned this pull request Jun 2, 2026

feat(ai): skip already-shipped agent activity via a local cursor #1261

Draft

amascia-gg self-assigned this Jun 2, 2026

amascia-gg force-pushed the amascia/agent-activity-metadata branch from 44898bb to a29f94b Compare June 3, 2026 08:57

amascia-gg changed the base branch from pbeslin/collect-ai-agent-history to main June 3, 2026 08:57

amascia-gg force-pushed the amascia/agent-activity-metadata branch from 1170e1f to 52f4a66 Compare June 3, 2026 20:22

xblanchot-gg reviewed Jun 8, 2026

View reviewed changes

Comment thread tests/unit/verticals/ai/agent_activity/test_models.py Outdated

xblanchot-gg reviewed Jun 8, 2026

View reviewed changes

Comment thread tests/unit/verticals/ai/agent_activity/test_orchestrator.py Outdated

xblanchot-gg reviewed Jun 8, 2026

View reviewed changes

paulpetit-gg-ext force-pushed the amascia/agent-activity-metadata branch from d39e520 to 121240c Compare June 9, 2026 08:30

amascia-gg and others added 2 commits June 11, 2026 17:30

paulpetit-gg-ext force-pushed the amascia/agent-activity-metadata branch from 567dd79 to 28f4470 Compare June 11, 2026 15:31

amascia-gg and others added 7 commits June 11, 2026 17:32

amascia-gg and others added 5 commits June 11, 2026 17:32

fix(ai): send session metadata for Cursor

b625e42

fix(ai): have record_offsets work if sorted lexicographically

830da27

fix(ai): enforce replay order by lexicographically sorting source_kinds

05d3fd2

feat(ai): send VSCode transcripts

e6d19e6

paulpetit-gg-ext force-pushed the amascia/agent-activity-metadata branch from 28f4470 to e6d19e6 Compare June 11, 2026 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai): collect agent activity (raw lines + client-side redaction)#1259

feat(ai): collect agent activity (raw lines + client-side redaction)#1259
amascia-gg wants to merge 14 commits into
mainfrom
amascia/agent-activity-metadata

amascia-gg commented Jun 1, 2026

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

xblanchot-gg left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xblanchot-gg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

amascia-gg commented Jun 1, 2026

Context

What has been done

Validation

Related / depends on

PR check list

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xblanchot-gg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xblanchot-gg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jun 2, 2026 •

edited

Loading