fix(worker,service): UTF-8-safe content truncation by marco-spagn · Pull Request #163 · marco-spagn/pcmi

marco-spagn · 2026-06-22T15:07:26Z

Problem

Two call sites clamp text with a byte slice that can split a multi-byte UTF-8 rune:

ConsolidationWorker.runConsolidation — content = content[:16000]
SummarizeService.extractiveSummary — combined[:maxLen] + "…"

A byte cut landing mid-rune produces an invalid UTF-8 byte sequence.

The consolidation result is persisted via repo.Store, so for any non-ASCII content (accented text, CJK, emoji…) whose 16000th byte falls inside a rune, PostgreSQL rejects the INSERT with invalid byte sequence for encoding UTF8. The consolidation run is then marked failed and never succeeds for that prefix. The summarize path is less severe — json.Marshal sanitizes invalid UTF-8 to U+FFFD — but still emits a garbled trailing character.

Fix

Add a small truncateUTF8(s, maxBytes) helper (one per affected package — no shared util grab-bag exists) that backs up to a rune boundary, dropping the partial trailing rune. The result is always valid UTF-8 and within the byte budget. Both call sites use it.

func truncateUTF8(s string, maxBytes int) string {
	if maxBytes <= 0 { return "" }
	if len(s) <= maxBytes { return s }
	b := maxBytes
	for b > 0 && !utf8.RuneStart(s[b]) { b-- }
	return s[:b]
}

Tests

Table-driven helper tests covering ASCII, 2-byte (é) and 3-byte (€) runes cut on and off boundaries, plus zero/negative budgets.
Fuzz-style loop: cutting a multibyte string at every byte offset always yields valid UTF-8 within budget.
extractiveSummary with a fully-CJK input truncates to valid UTF-8 with the ellipsis suffix; existing summarize tests still pass.
go build ./..., go vet, and the full internal/worker + internal/service suites pass locally.

🤖 Generated with Claude Code

github-actions · 2026-06-22T15:10:04Z

✅ Test coverage: `79.19%`

🔗 View workflow run

Global statement coverage; threshold = 41% (scripts/ci_coverage_check.sh).

Package	Stmts	Covered	Coverage
`config`	64	57	✅ 89.1%
`crypto`	99	87	✅ 87.9%
`database`	16	16	✅ 100.0%
`embedding`	125	118	✅ 94.4%
`event`	277	209	✅ 75.5%
`eventschema`	23	21	✅ 91.3%
`graph`	266	149	✅ 56.0%
`grpc`	821	625	✅ 76.1%
`handler`	983	820	✅ 83.4%
`metrics`	29	29	✅ 100.0%
`middleware`	180	154	✅ 85.6%
`model`	95	78	✅ 82.1%
`ratelimit`	38	28	✅ 73.7%
`repository`	953	782	✅ 82.1%
`service`	349	274	✅ 78.5%
`telemetry`	26	24	✅ 92.3%
`webhook`	174	145	✅ 83.3%
`worker`	700	516	✅ 73.7%
Global	5218	4132	✅ 79.2%

_{Generated by scripts/ci_coverage_check.sh — bump floors in the same PR that lifts coverage.}

ConsolidationWorker.runConsolidation clamped the joined memory content with a byte slice (content[:16000]); SummarizeService.extractiveSummary did the same (combined[:maxLen]). A byte cut can land inside a multi-byte UTF-8 rune and yield an invalid byte sequence. The consolidation result is persisted via repo.Store, so truncating non-ASCII content on a rune boundary produces bytes PostgreSQL rejects on INSERT ("invalid byte sequence for encoding UTF8") — the consolidation run then fails permanently for that prefix (any non-English/emoji content near the limit can trip it). The summarize path is less severe (JSON marshaling sanitizes to U+FFFD) but still emitted a garbled trailing character. Add a small truncateUTF8 helper in each package that backs up to a rune boundary, dropping the partial trailing rune so the result is always valid UTF-8 and within the byte budget. Both call sites use it. Tests: table-driven helper tests (incl. cutting a multibyte string at every byte offset → always valid UTF-8) and a summarize case proving a fully-CJK input truncates to valid UTF-8 with the ellipsis suffix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

marco-spagn force-pushed the fix/utf8-safe-truncation branch from 597fb0a to ae45b6f Compare June 25, 2026 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker,service): UTF-8-safe content truncation#163

fix(worker,service): UTF-8-safe content truncation#163
marco-spagn wants to merge 1 commit into
mainfrom
fix/utf8-safe-truncation

marco-spagn commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marco-spagn commented Jun 22, 2026

Problem

Fix

Tests

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Test coverage: 79.19%

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 22, 2026 •

edited

Loading

✅ Test coverage: `79.19%`