feat(ai): collect agent activity (raw lines + client-side redaction)#1259
Draft
amascia-gg wants to merge 14 commits into
Draft
feat(ai): collect agent activity (raw lines + client-side redaction)#1259amascia-gg wants to merge 14 commits into
amascia-gg wants to merge 14 commits into
Conversation
47fe3b3 to
a83e1b2
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1259 +/- ##
==========================================
+ Coverage 93.38% 93.43% +0.05%
==========================================
Files 189 194 +5
Lines 10782 11067 +285
==========================================
+ Hits 10069 10341 +272
- Misses 713 726 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
e7f8375 to
6013a56
Compare
6013a56 to
21c6f29
Compare
44898bb to
a29f94b
Compare
1170e1f to
52f4a66
Compare
xblanchot-gg
reviewed
Jun 8, 2026
xblanchot-gg
left a comment
Member
There was a problem hiding this comment.
First batch of comments as I've not finished the review yet, as the PR is quite long to read. Overall looks good, but I think there is a bug somewhere, when testing on my machine, only 10k events over 63k are actually sent to GitGuardian.
xblanchot-gg
reviewed
Jun 8, 2026
xblanchot-gg
reviewed
Jun 8, 2026
xblanchot-gg
reviewed
Jun 8, 2026
xblanchot-gg
left a comment
Member
There was a problem hiding this comment.
Second batch, a few other comments, nothing really blocking on my side
d39e520 to
121240c
Compare
Ship full AI-agent session activity to GitGuardian, not just MCP tool calls. ggshield discovers each agent's on-disk transcripts / databases and sends the raw records verbatim; GitGuardian scans them and strips secrets server-side before storing them, so the client stays "dumb" and the data shape never depends on the ggshield version. Sources cover Claude Code, Codex, Cursor, Copilot CLI and VSCode (Copilot Chat). Records are batched (by count and by bytes) and posted to the agent-activity endpoint; the server deduplicates per record. Issue: NHI-1628 Co-Authored-By: Paul Beslin <paul.beslin-ext@gitguardian.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Embedded images and thinking-block signatures are large opaque base64 that carry no analytical value — on real transcripts they were ~a third of the payload and needlessly enlarge the server-side secret scan. Replace any pure-base64 string value over 4 KB with a short placeholder, keeping the rest of the record (and the blob's sibling metadata) verbatim. Textual fields like stdout have whitespace, so they are never matched. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
567dd79 to
28f4470
Compare
Send the discovered UserInfo (machine_id + hostname/username/email) with every agent-activity batch so GitGuardian can attribute the records and correlate them with the machine inventory (the machine_id matches the machine scan). Reuses the user already gathered for discovery. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The server reports how many records it could not scan and dropped (never stored). Thread that count through AgentActivityBatchResult and the report, and print it in the `ai discover --history` summary when non-zero, so drops are visible instead of silently making the counts not add up. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each batch is one HTTP request and one ClickHouse insert (one part). 500 meant ~126 requests/parts for a 63k collection; server-side scanning is sub-ms per record so a 5000-record batch still completes well under the request timeout. The 5 MB byte cap stays the real guard against oversized requests. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback on agent-activity collection: - Wrap each agent's history walk in try/except so one agent failing does not abort collection for the agents that follow. - Encapsulate the send buffer in a small _Batch value object, which drops the nonlocal closure and renames the misleading flush() (it performs an API call) to send_batch(). - Rename the two-letter `fp` to `handle` in iter_jsonl. - Patch BATCH_SIZE / MAX_BATCH_BYTES to small values in the batching tests instead of allocating thousands of events, and drop the to_dict test that only exercised dataclasses.asdict. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Subagents spawned via the Task tool record their transcript in a separate file (projects/*/<session>/subagents/agent-*.jsonl) that is never inlined into the main session transcript, so all subagent activity was being missed. Add a dedicated source with its own `subagent_transcript` kind. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both sources hardcoded the Linux ~/.config base, so on macOS and Windows — where VSCode-family editors keep their User data under ~/Library/Application Support and %APPDATA% — agent-activity collection found nothing. Resolve the editor user-data dir per OS instead; VSCode records its source_path relative to that dir so the shipped path stays OS-independent. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each subagents/agent-*.jsonl transcript has a sibling agent-*.meta.json holding the subagent's agentType, description and the parent's toolUseId — the only place that ties a subagent run back to the Task call that spawned it. Ship it verbatim under a new `subagent_meta` kind. Issue: NHI-1628 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cursor and VSCode keep their state.vscdb open in WAL mode, so a plain mode=ro
connection fails to open it ("unable to open database file") — Cursor activity
collection returned nothing whenever Cursor was running (i.e. almost always).
Open with immutable=1 to read the file as a static snapshot regardless of the
writer's lock; we never write back.
Issue: NHI-1628
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
28f4470 to
e6d19e6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Builds on the MCP-history framework (#1244 / #1257). Today
ggshield ai discover --historyships only MCPPreToolUseevents (~5% of agent activity). This adds collection of per-event usage metadata for the whole session, to power agentic-AI usage dashboards.Design: AI Agent Activity Pipeline.
What has been done
client.send_agent_activity(...).serialize()builds a NEW dict of allow-listed safe fields (event_type, tool name, model, timestamps, …) and drops every free-text field (prompts, command strings, tool inputs/outputs, file contents). The baseserialize()raises by default, so a source can never accidentally ship raw content — no secret or PII ever leaves the machine (no server-side redaction needed)./Users/x→~); per-record size cap + byte-batching.raw_history→agent_activity(it's metadata, not raw).GGClienttyping +Detailerror handling in the orchestrator; home-path leak fix insource_path_for.pygitguardianto the commit addingsend_agent_activity()(feat: add send_agent_activity endpoint py-gitguardian#175).Validation
pytest tests/unit/verticals/ai/→ 387 passing.ward-runs-app):ggshield ai discover --history→POST /v1/nhi/ai/activity→ ClickHouseagent_activity. Stored content is metadata only (e.g.{"type":"assistant","model":"…","tools":["Bash"]}); a table-wide sweep confirmed prompts/commands/outputs are absent.Related / depends on
send_agent_activity). Re-pin to the merged commit before landing.ward-runs-appMR (endpoint +agent_activityClickHouse table).PR check list
🤖 Generated with Claude Code