fix(repair): point index-read failures to repair --mode from-sqlite (#1843)#1847
fix(repair): point index-read failures to repair --mode from-sqlite (#1843)#1847mvalentsev wants to merge 1 commit into
Conversation
…emPalace#1843) When the chromadb compactor cannot apply the WAL into the drawers HNSW segment (InternalError: Failed to apply logs to the hnsw segment writer), the legacy repair paths fail on their first Collection.count() read and advise re-mining from source files. The drawer rows are intact in chroma.sqlite3, so repair --mode from-sqlite rebuilds them; re-mining silently drops drawers added via the MCP server and diary entries that have no source file. Both legacy read-failure sites (cmd_repair and rebuild_index) now emit shared guidance pointing at the from-sqlite recovery, worded conditionally so it also covers a live server or mine still holding the palace open. Co-Authored-By: undeadindustries <9536461+undeadindustries@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request improves the recovery guidance provided to users when a drawer-index read fails (e.g., due to a chromadb HNSW compactor failure). Instead of advising users to re-mine from source files—which can cause data loss for drawers added via the MCP server or diary entries—the CLI and index rebuilder now recommend rebuilding the index from SQLite using mempalace repair --mode from-sqlite --archive-existing. Unit tests have been added to verify this behavior. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Addresses #1843.
What does this PR do?
When the chromadb compactor cannot apply the WAL into the drawers HNSW
segment (
InternalError: Failed to apply logs to the hnsw segment writer),mempalace repair(default legacy mode) and the libraryrebuild_indexboth fail on their first read (
Collection.count()) and then print:That advice is harmful: the drawer rows are intact in
chroma.sqlite3, andrepair --mode from-sqliterebuilds the index from them. Re-mining insteadsilently drops drawers added through the MCP server and diary entries, which
have no source file to mine. That is the data loss #1843 reports.
Both legacy read-failure sites now emit shared guidance pointing at the
working recovery:
The message is worded conditionally (it also covers a live server/mine still
holding the palace open) because the
except Exceptioncannot prove whichcase it caught.
Scope
This is the "at minimum, change the message" part of #1843's first request.
The larger asks (auto-fallback to from-sqlite, MCP reconnect resilience,
decoupling the durable SQLite write from HNSW indexing in
add_drawer) areout of scope here; the compaction failure itself lives in chromadb's core.
Two other
repair.pyre-mine messages are left as-is on purpose:check_extraction_safety(a row-count truncation abort, not an index read)and the
RebuildCollectionErrorno-backup-after-live-swap path (where thecollection is genuinely gone and re-mining is a legitimate last resort).
How to test
New regression tests:
test_cmd_repair_error_reading_points_to_from_sqlite_not_reminetest_rebuild_index_read_failure_points_to_from_sqlitetest_index_read_recovery_guidance_recommends_from_sqliteChecklist
python -m pytest tests/ -v)ruff check .)