feat(secrets): vendor rules, scan UX, candidate visibility#548
Open
wesm wants to merge 20 commits into
Open
Conversation
Approved-via-brainstorming design for a single PR adding six vendor rules (OpenAI, GitLab, npm, PyPI, HuggingFace, SendGrid), fixing the base64 padding capture in high-entropy-assignment, plumbing a definite/candidate breakdown through the scan summary, rewriting scan help text, and adding a candidate-visibility hint after non-JSON scan output. One rulesAlgorithmVersion bump (5 -> 6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Definite vs candidate rule wording: motivation paragraph now correctly attributes JWT, basic-auth-url, and high-entropy-assignment to the candidate tier. - High-entropy padding fix: highEntropyValue must TrimRight '=' before measuring length and Shannon entropy, otherwise appended padding can push a borderline body below the 3.5-bit floor. Validator-padding regression test added. v6 doc comment updated. - PyPI boundary test: the 85-char floor applies to the base64 body after pypi-, with the 16-char macaroon-header anchor counted in. Correct boundary cases are 68-char tail (must NOT match) and 69-char tail (MUST match). - openai-key mask: replace fixed 8-char prefix preservation with a maskOpenAIKey helper that picks the matched prefix length, so sk-svcacct- (11) and sk-admin- (9) are not truncated mid-word. - openai-key citation: phrase the prefix patterns as scanner knowledge derived from observed token format, not a 2024 announcement URL we cannot cite cleanly. Project/service-account/admin concepts are cited to OpenAI's API platform docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Twelve TDD-driven tasks covering the approved spec: per-vendor-rule commits (openai-key, gitlab-pat, npm-token, pypi-token, huggingface- token, sendgrid-key), high-entropy padding capture + validator trim, SecretScanSummary breakdown plumbing, CLI help/hint UX, rulesAlgorithmVersion bump, verification pass, and scratch-file cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a definite-tier rule for OpenAI project, service-account, and admin key prefixes. Includes maskOpenAIKey helper that preserves the matched prefix length (sk-svcacct- and sk-admin- differ from sk-proj-), and notOpenAIKeyPlaceholder validator using the existing bodyLooksRandom + notTrailingRunRepeat structural checks. Legacy sk- + 48-char keys are deliberately not matched. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the misleading maskKeepEnds(s, 3, 4) fallback in maskOpenAIKey with a panic — the regex guarantees one of the three known prefixes, so silently producing wrong output if a future rule change desyncs the two is worse than crashing. Rewrites the prefix-ordering comment to be honest about the framing: today no prefix shadows another, so the longest-first order is defensive against a hypothetical future addition, not protection against a current risk.
Definite-tier rule for GitLab personal/project access tokens. Uses the Google-key terminator pattern (capture-then-non-body-byte) because the token alphabet includes '-'. Validator reuses bodyLooksRandom on the trimmed body. Source: GitLab token prefix docs.
A 20-char body with 16 random-looking chars followed by 'AAAA' passes bodyLooksRandom (entropy ~3.92, AAAA covers only 20% of the body), so without notTrailingRunRepeat the validator reports glpat-...AAAA as a definite leak. Mirrors the anthropic-key and openai-key pattern.
Definite-tier rule for npm access tokens in the 2021+ format: npm_ + 36 base62 chars (30 payload + 6 CRC). Validator uses bodyHasNoPlaceholderShape on the fixed-length body (entropy gate is over-restrictive at this length, like AWS access keys). Legacy classic-format tokens (bare hex) are deliberately not matched.
Same false-positive shape that prompted the gitlab-pat fix (commit 158384f): a 36-char body with 32 random-looking chars followed by 'AAAA' passes bodyHasNoPlaceholderShape (5/36 = 14% single-byte coverage, below the 75% threshold), so without notTrailingRunRepeat the validator reports npm_...AAAA as a definite leak.
Definite-tier rule for PyPI API tokens. Anchors on the literal base64-encoded macaroon header bytes (pypi-AgEIcHlwaS5vcmcC) that every PyPI token starts with — a stronger anchor than just pypi-. The 85-char body minimum from PyPI's docs translates to a 69-char variable tail after the 16-char anchor. Validator includes notTrailingRunRepeat(s, 4) up front to catch the random-body-with-trailing-AAAA placeholder shape that bodyLooksRandom alone doesn't reject (same gap addressed in Tasks 2 and 3 for gitlab-pat and npm-token). Source: docs.pypi.org/api/secrets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Definite-tier rule for Hugging Face access tokens. Base62 body, minimum 30 chars. Validator uses notTrailingRunRepeat + bodyLooksRandom on the trimmed body, matching the pattern established for the other variable- body vendor rules in this PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous body was 29 chars, below the regex's 30-char minimum, so the test passed because the regex rejected it before the validator was called. Padding by one char makes the body 30 chars, so the regex matches and notTrailingRunRepeat is the active gate.
Definite-tier rule for SendGrid API keys. Two fixed-length base64url segments separated by a literal '.'. Validator splits on the internal dot, applies bodyHasNoPlaceholderShape to the 22-char identifier (short, like AWS) and notTrailingRunRepeat + bodyLooksRandom to the 43-char secret (matching the trailing-run guard pattern in the other variable- body vendor rules in this PR).
Widens the high-entropy-assignment capture group to include trailing '=' padding so the reported span and redacted display match the literal value. Updates highEntropyValue to TrimRight the padding before measuring length and Shannon entropy: '=' is not in the body charset, so trimming guarantees padding can't affect the gate. Adds tests for the span behavior and validator trim. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds DefiniteFindings and CandidateFindings to sync.SecretScanSummary, mirrors them on service.SecretScanSummary, passes them through both direct and HTTP backends, and surfaces them in the CLI human-output line. The existing WithSecrets field keeps its meaning (sessions with ≥1 definite finding); the human label changes from 'with secrets' to 'with definite leaks' since the new line co-displays candidate counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewrites 'secrets scan' Short and adds a Long that explains the inline-vs-full two-tier model and the --confidence-all step needed to see candidate findings. Retightens the --backfill help. Adds a smart hint printed after the scan summary when candidate findings exist and output is human-formatted; JSON callers get candidate_findings in the payload and decide their own UX. Includes hint-shown, hint-suppressed, and JSON-no-hint regression tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Marks the ruleset change in this PR (six new vendor rules, base64 padding capture widening, highEntropyValue trim-padding). Sessions previously scanned at v5 are flagged stale and will be re-scanned by the next 'secrets scan' or 'secrets scan --backfill', which is the designed behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds SHA-256 hashes for the 14 new test fixture values introduced by the six new vendor rules (openai-key, gitlab-pat, npm-token, pypi-token, huggingface-token, sendgrid-key). These synthetic tokens from rules_test.go were appearing as 447 definite findings across development conversations; the deny-list reduces that to 39 (all pre-existing rules, all expected). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per the repo convention, design and plan files under docs/superpowers/ are internal scratch and are removed before the feature branch is pushed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces commit-era phrasing with a description of the rule category that stays meaningful after the PR merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds six well-anchored definite-tier rules to the secret scanner with placeholder validators following the existing rigor:
openai-keymatchessk-proj-,sk-svcacct-, andsk-admin-prefixes. AmaskOpenAIKeyhelper preserves the matched prefix length when redacting.gitlab-patmatchesglpat-personal/project access tokens.npm-tokenmatches the post-2021npm_+ 36 base62 format (legacy classic-hex tokens are deliberately not matched).pypi-tokenanchors on the literalpypi-AgEIcHlwaS5vcmcCmacaroon-header bytes and enforces PyPI's documented 85-char body minimum (69-char variable tail).huggingface-tokenmatcheshf_+ base62 body.sendgrid-keymatches the two-segmentSG.<22>.<43>format, validating each segment.Every new validator chains
notTrailingRunRepeat(_, 4)beforebodyLooksRandom/bodyHasNoPlaceholderShapeso random-looking bodies with trailingAAAAplaceholder shapes don't slip through.Fixes the
high-entropy-assignmentrule so trailing base64=padding is included in the captured span (was clipped before), and updateshighEntropyValuetoTrimRightthe padding before measuring length and entropy.Splits
SecretScanSummaryintoDefiniteFindingsandCandidateFindingscounts acrosssync,service, and CLI layers. The CLI human-output line now readswith definite leaks; N findings (X definite, Y candidate). JSON consumers get the new fields automatically.Rewrites
secrets scanhelp text to surface the inline-sync (definite-only) vs full-scan (definite + candidate) two-tier model, and adds a hint after the human-format summary pointing users atsecrets list --confidence allwhen candidate findings exist. JSON callers don't see the hint and can render their own UX.Bumps
rulesAlgorithmVersionfrom 5 to 6. Adds fixture-deny-list entries for the synthetic test bodies the new rules correctly identify.🤖 Generated with Claude Code