fix(worker,service): UTF-8-safe content truncation#163
Open
marco-spagn wants to merge 1 commit into
Open
Conversation
✅ Test coverage:
|
| Package | Stmts | Covered | Coverage |
|---|---|---|---|
config |
64 | 57 | ✅ 89.1% |
crypto |
99 | 87 | ✅ 87.9% |
database |
16 | 16 | ✅ 100.0% |
embedding |
125 | 118 | ✅ 94.4% |
event |
277 | 209 | ✅ 75.5% |
eventschema |
23 | 21 | ✅ 91.3% |
graph |
266 | 149 | ✅ 56.0% |
grpc |
821 | 625 | ✅ 76.1% |
handler |
983 | 820 | ✅ 83.4% |
metrics |
29 | 29 | ✅ 100.0% |
middleware |
180 | 154 | ✅ 85.6% |
model |
95 | 78 | ✅ 82.1% |
ratelimit |
38 | 28 | ✅ 73.7% |
repository |
953 | 782 | ✅ 82.1% |
service |
349 | 274 | ✅ 78.5% |
telemetry |
26 | 24 | ✅ 92.3% |
webhook |
174 | 145 | ✅ 83.3% |
worker |
700 | 516 | ✅ 73.7% |
| Global | 5218 | 4132 | ✅ 79.2% |
Generated by scripts/ci_coverage_check.sh — bump floors in the same PR that lifts coverage.
ConsolidationWorker.runConsolidation clamped the joined memory content with a
byte slice (content[:16000]); SummarizeService.extractiveSummary did the same
(combined[:maxLen]). A byte cut can land inside a multi-byte UTF-8 rune and
yield an invalid byte sequence.
The consolidation result is persisted via repo.Store, so truncating non-ASCII
content on a rune boundary produces bytes PostgreSQL rejects on INSERT
("invalid byte sequence for encoding UTF8") — the consolidation run then fails
permanently for that prefix (any non-English/emoji content near the limit can
trip it). The summarize path is less severe (JSON marshaling sanitizes to
U+FFFD) but still emitted a garbled trailing character.
Add a small truncateUTF8 helper in each package that backs up to a rune
boundary, dropping the partial trailing rune so the result is always valid
UTF-8 and within the byte budget. Both call sites use it.
Tests: table-driven helper tests (incl. cutting a multibyte string at every
byte offset → always valid UTF-8) and a summarize case proving a fully-CJK
input truncates to valid UTF-8 with the ellipsis suffix.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
597fb0a to
ae45b6f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two call sites clamp text with a byte slice that can split a multi-byte UTF-8 rune:
ConsolidationWorker.runConsolidation—content = content[:16000]SummarizeService.extractiveSummary—combined[:maxLen] + "…"A byte cut landing mid-rune produces an invalid UTF-8 byte sequence.
The consolidation result is persisted via
repo.Store, so for any non-ASCII content (accented text, CJK, emoji…) whose 16000th byte falls inside a rune, PostgreSQL rejects theINSERTwithinvalid byte sequence for encoding UTF8. The consolidation run is then markedfailedand never succeeds for that prefix. The summarize path is less severe —json.Marshalsanitizes invalid UTF-8 toU+FFFD— but still emits a garbled trailing character.Fix
Add a small
truncateUTF8(s, maxBytes)helper (one per affected package — no shared util grab-bag exists) that backs up to a rune boundary, dropping the partial trailing rune. The result is always valid UTF-8 and within the byte budget. Both call sites use it.Tests
é) and 3-byte (€) runes cut on and off boundaries, plus zero/negative budgets.extractiveSummarywith a fully-CJK input truncates to valid UTF-8 with the ellipsis suffix; existing summarize tests still pass.go build ./...,go vet, and the fullinternal/worker+internal/servicesuites pass locally.🤖 Generated with Claude Code