Skip to content

fix(worker,service): UTF-8-safe content truncation#163

Open
marco-spagn wants to merge 1 commit into
mainfrom
fix/utf8-safe-truncation
Open

fix(worker,service): UTF-8-safe content truncation#163
marco-spagn wants to merge 1 commit into
mainfrom
fix/utf8-safe-truncation

Conversation

@marco-spagn

Copy link
Copy Markdown
Owner

Problem

Two call sites clamp text with a byte slice that can split a multi-byte UTF-8 rune:

  • ConsolidationWorker.runConsolidationcontent = content[:16000]
  • SummarizeService.extractiveSummarycombined[:maxLen] + "…"

A byte cut landing mid-rune produces an invalid UTF-8 byte sequence.

The consolidation result is persisted via repo.Store, so for any non-ASCII content (accented text, CJK, emoji…) whose 16000th byte falls inside a rune, PostgreSQL rejects the INSERT with invalid byte sequence for encoding UTF8. The consolidation run is then marked failed and never succeeds for that prefix. The summarize path is less severe — json.Marshal sanitizes invalid UTF-8 to U+FFFD — but still emits a garbled trailing character.

Fix

Add a small truncateUTF8(s, maxBytes) helper (one per affected package — no shared util grab-bag exists) that backs up to a rune boundary, dropping the partial trailing rune. The result is always valid UTF-8 and within the byte budget. Both call sites use it.

func truncateUTF8(s string, maxBytes int) string {
	if maxBytes <= 0 { return "" }
	if len(s) <= maxBytes { return s }
	b := maxBytes
	for b > 0 && !utf8.RuneStart(s[b]) { b-- }
	return s[:b]
}

Tests

  • Table-driven helper tests covering ASCII, 2-byte (é) and 3-byte () runes cut on and off boundaries, plus zero/negative budgets.
  • Fuzz-style loop: cutting a multibyte string at every byte offset always yields valid UTF-8 within budget.
  • extractiveSummary with a fully-CJK input truncates to valid UTF-8 with the ellipsis suffix; existing summarize tests still pass.
  • go build ./..., go vet, and the full internal/worker + internal/service suites pass locally.

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

✅ Test coverage: 79.19%

🔗 View workflow run

Global statement coverage; threshold = 41% (scripts/ci_coverage_check.sh).

Package Stmts Covered Coverage
config 64 57 ✅ 89.1%
crypto 99 87 ✅ 87.9%
database 16 16 ✅ 100.0%
embedding 125 118 ✅ 94.4%
event 277 209 ✅ 75.5%
eventschema 23 21 ✅ 91.3%
graph 266 149 ✅ 56.0%
grpc 821 625 ✅ 76.1%
handler 983 820 ✅ 83.4%
metrics 29 29 ✅ 100.0%
middleware 180 154 ✅ 85.6%
model 95 78 ✅ 82.1%
ratelimit 38 28 ✅ 73.7%
repository 953 782 ✅ 82.1%
service 349 274 ✅ 78.5%
telemetry 26 24 ✅ 92.3%
webhook 174 145 ✅ 83.3%
worker 700 516 ✅ 73.7%
Global 5218 4132 ✅ 79.2%

Generated by scripts/ci_coverage_check.sh — bump floors in the same PR that lifts coverage.

ConsolidationWorker.runConsolidation clamped the joined memory content with a
byte slice (content[:16000]); SummarizeService.extractiveSummary did the same
(combined[:maxLen]). A byte cut can land inside a multi-byte UTF-8 rune and
yield an invalid byte sequence.

The consolidation result is persisted via repo.Store, so truncating non-ASCII
content on a rune boundary produces bytes PostgreSQL rejects on INSERT
("invalid byte sequence for encoding UTF8") — the consolidation run then fails
permanently for that prefix (any non-English/emoji content near the limit can
trip it). The summarize path is less severe (JSON marshaling sanitizes to
U+FFFD) but still emitted a garbled trailing character.

Add a small truncateUTF8 helper in each package that backs up to a rune
boundary, dropping the partial trailing rune so the result is always valid
UTF-8 and within the byte budget. Both call sites use it.

Tests: table-driven helper tests (incl. cutting a multibyte string at every
byte offset → always valid UTF-8) and a summarize case proving a fully-CJK
input truncates to valid UTF-8 with the ellipsis suffix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@marco-spagn marco-spagn force-pushed the fix/utf8-safe-truncation branch from 597fb0a to ae45b6f Compare June 25, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant