Summary
After a mempalace mine is interrupted/killed mid-write, every MCP tool call hangs indefinitely (mempalace_status, mempalace_search, mempalace_reconnect, mempalace_add_drawer) — no timeout, no error. The CLI keeps working (mempalace status, mempalace repair-status return in seconds). The hang survives full reboots because the broken state is on disk. Root cause is a damaged HNSW segment writer + pending rows in chromadb's embeddings_queue that the chromadb-1.x compactor cannot apply.
Environment
- mempalace 3.4.0, chromadb 1.5.9 (Rust core / "compactor")
- Windows 11, Python 3.13, onnxruntime-gpu, embedding model
embeddinggemma (384-dim), backend chroma
- Palace size ~268k drawers (~1.4 GB sqlite + ~450 MB HNSW)
Symptom
- MCP: any tool that opens the chroma collection hangs forever. The harness logs
Tool 'mempalace_status' still running (720s elapsed). The hang is in native code — it ignores SIGINT.
- CLI:
mempalace status and mempalace repair-status work fine (they read sqlite counts and never open a chromadb client).
Root cause
An interrupted mine leaves unflushed rows in chroma's embeddings_queue plus a damaged HNSW segment writer. Any operation that opens the collection (count(), query, add) makes the chromadb-1.x compactor try to backfill the queue into the HNSW segment, which fails:
chromadb.errors.InternalError: Error executing plan:
Error sending backfill request to compactor:
Failed to apply logs to the hnsw segment writer
In the long-lived MCP server this surfaces as an infinite hang; called directly from Python it raises the error above.
How to confirm (read-only)
import chromadb
c = chromadb.PersistentClient(path=r"<HOME>/.mempalace/palace")
for col in c.list_collections():
print(col.name, col.count()) # -> InternalError: ...hnsw segment writer
Note: mempalace repair-status passes (sqlite vs HNSW counts match within flush-lag) because it never opens the chroma client — so it does not catch a broken segment writer.
What it was NOT (ruled out, each with evidence)
- Stale lock files — cleaned
~/.mempalace/locks/mine_palace_*.lock + hook_state/mine_pids/*.pid; still hangs. (Separate leak: 2200+ empty locks/*.lock accumulate.)
- HNSW corruption —
repair-status OK; link_lists.bin tiny (~2 MB), no blowup.
- HF Hub network — model is cached locally;
HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 still hangs.
- GPU/CUDA/TensorRT —
embedding_device: cpu still hangs (the hang is before embedding).
max_seq_id poison — values sane (max_seq_id == embeddings count).
Fix that worked
repair --mode from-sqlite rebuilds the HNSW from the sqlite ground truth, bypassing the broken client/segment:
# In-place fails on Windows (see gotcha below), so rebuild into a new path
# reading the live sqlite read-only:
mempalace --palace <HOME>/.mempalace/palace_rebuilt repair \
--mode from-sqlite --source <HOME>/.mempalace/palace --yes
mempalace --palace <HOME>/.mempalace/palace_rebuilt palace set-embedder --model embeddinggemma
# then point config.json "palace_path" at palace_rebuilt and restart the MCP server
Result: rebuilt 269,491 rows; count() OK, repair-status divergence 0, end-to-end MCP status/search/add_drawer work again.
Windows gotcha
In-place repair --mode from-sqlite --archive-existing fails with:
PermissionError: [WinError 32] ... data_level0.bin is used by another process
because the running MCP server holds the HNSW segment files open, so the archive-rename can't proceed. Either stop the MCP server first, or rebuild into a separate path from the live sqlite (read-only) as above.
Suggested hardening
- Fail fast, don't hang. Wrap collection-open / compaction with an operation deadline (and a SQLite
busy_timeout) so a stuck compactor returns a structured error instead of hanging the whole MCP server forever.
- Surface the real cause. Detect the
Failed to apply logs to the hnsw segment writer compactor error and point the user to repair --mode from-sqlite instead of an opaque infinite hang.
- mine crash-safety. Make queue/WAL application atomic or recoverable so an interrupted
mine can't leave an unbackfillable segment; on startup, detect a dirty queue and offer a guarded rebuild.
- Lock hygiene.
mine_palace_lock (palace.py ~L945–984) never liveness-checks the recorded holder PID (no os.kill(pid,0) / lease) and leaks empty locks/*.lock. Add PID-liveness/lease + cleanup-on-release.
- Deeper
repair-status. Since it never opens the chroma client, it can't see a broken segment writer — consider an optional guarded count() probe so the broken state is detectable without a 6-hour investigation.
Summary
After a
mempalace mineis interrupted/killed mid-write, every MCP tool call hangs indefinitely (mempalace_status,mempalace_search,mempalace_reconnect,mempalace_add_drawer) — no timeout, no error. The CLI keeps working (mempalace status,mempalace repair-statusreturn in seconds). The hang survives full reboots because the broken state is on disk. Root cause is a damaged HNSW segment writer + pending rows in chromadb'sembeddings_queuethat the chromadb-1.x compactor cannot apply.Environment
embeddinggemma(384-dim), backendchromaSymptom
Tool 'mempalace_status' still running (720s elapsed). The hang is in native code — it ignores SIGINT.mempalace statusandmempalace repair-statuswork fine (they read sqlite counts and never open a chromadb client).Root cause
An interrupted
mineleaves unflushed rows in chroma'sembeddings_queueplus a damaged HNSW segment writer. Any operation that opens the collection (count(), query, add) makes the chromadb-1.x compactor try to backfill the queue into the HNSW segment, which fails:In the long-lived MCP server this surfaces as an infinite hang; called directly from Python it raises the error above.
How to confirm (read-only)
Note:
mempalace repair-statuspasses (sqlite vs HNSW counts match within flush-lag) because it never opens the chroma client — so it does not catch a broken segment writer.What it was NOT (ruled out, each with evidence)
~/.mempalace/locks/mine_palace_*.lock+hook_state/mine_pids/*.pid; still hangs. (Separate leak: 2200+ emptylocks/*.lockaccumulate.)repair-statusOK;link_lists.bintiny (~2 MB), no blowup.HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1still hangs.embedding_device: cpustill hangs (the hang is before embedding).max_seq_idpoison — values sane (max_seq_id == embeddings count).Fix that worked
repair --mode from-sqliterebuilds the HNSW from the sqlite ground truth, bypassing the broken client/segment:Result: rebuilt 269,491 rows;
count()OK,repair-statusdivergence 0, end-to-end MCPstatus/search/add_drawerwork again.Windows gotcha
In-place
repair --mode from-sqlite --archive-existingfails with:because the running MCP server holds the HNSW segment files open, so the archive-rename can't proceed. Either stop the MCP server first, or rebuild into a separate path from the live sqlite (read-only) as above.
Suggested hardening
busy_timeout) so a stuck compactor returns a structured error instead of hanging the whole MCP server forever.Failed to apply logs to the hnsw segment writercompactor error and point the user torepair --mode from-sqliteinstead of an opaque infinite hang.minecan't leave an unbackfillable segment; on startup, detect a dirty queue and offer a guarded rebuild.mine_palace_lock(palace.py ~L945–984) never liveness-checks the recorded holder PID (noos.kill(pid,0)/ lease) and leaks emptylocks/*.lock. Add PID-liveness/lease + cleanup-on-release.repair-status. Since it never opens the chroma client, it can't see a broken segment writer — consider an optional guardedcount()probe so the broken state is detectable without a 6-hour investigation.