Skip to content

MCP tools hang forever after an interrupted mine — chromadb 1.x compactor cannot backfill WAL into the HNSW segment (CLI unaffected) #1845

Description

@KeilerHirsch

Summary

After a mempalace mine is interrupted/killed mid-write, every MCP tool call hangs indefinitely (mempalace_status, mempalace_search, mempalace_reconnect, mempalace_add_drawer) — no timeout, no error. The CLI keeps working (mempalace status, mempalace repair-status return in seconds). The hang survives full reboots because the broken state is on disk. Root cause is a damaged HNSW segment writer + pending rows in chromadb's embeddings_queue that the chromadb-1.x compactor cannot apply.

Environment

  • mempalace 3.4.0, chromadb 1.5.9 (Rust core / "compactor")
  • Windows 11, Python 3.13, onnxruntime-gpu, embedding model embeddinggemma (384-dim), backend chroma
  • Palace size ~268k drawers (~1.4 GB sqlite + ~450 MB HNSW)

Symptom

  • MCP: any tool that opens the chroma collection hangs forever. The harness logs Tool 'mempalace_status' still running (720s elapsed). The hang is in native code — it ignores SIGINT.
  • CLI: mempalace status and mempalace repair-status work fine (they read sqlite counts and never open a chromadb client).

Root cause

An interrupted mine leaves unflushed rows in chroma's embeddings_queue plus a damaged HNSW segment writer. Any operation that opens the collection (count(), query, add) makes the chromadb-1.x compactor try to backfill the queue into the HNSW segment, which fails:

chromadb.errors.InternalError: Error executing plan:
  Error sending backfill request to compactor:
  Failed to apply logs to the hnsw segment writer

In the long-lived MCP server this surfaces as an infinite hang; called directly from Python it raises the error above.

How to confirm (read-only)

import chromadb
c = chromadb.PersistentClient(path=r"<HOME>/.mempalace/palace")
for col in c.list_collections():
    print(col.name, col.count())   # -> InternalError: ...hnsw segment writer

Note: mempalace repair-status passes (sqlite vs HNSW counts match within flush-lag) because it never opens the chroma client — so it does not catch a broken segment writer.

What it was NOT (ruled out, each with evidence)

  • Stale lock files — cleaned ~/.mempalace/locks/mine_palace_*.lock + hook_state/mine_pids/*.pid; still hangs. (Separate leak: 2200+ empty locks/*.lock accumulate.)
  • HNSW corruptionrepair-status OK; link_lists.bin tiny (~2 MB), no blowup.
  • HF Hub network — model is cached locally; HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 still hangs.
  • GPU/CUDA/TensorRTembedding_device: cpu still hangs (the hang is before embedding).
  • max_seq_id poison — values sane (max_seq_id == embeddings count).

Fix that worked

repair --mode from-sqlite rebuilds the HNSW from the sqlite ground truth, bypassing the broken client/segment:

# In-place fails on Windows (see gotcha below), so rebuild into a new path
# reading the live sqlite read-only:
mempalace --palace <HOME>/.mempalace/palace_rebuilt repair \
          --mode from-sqlite --source <HOME>/.mempalace/palace --yes
mempalace --palace <HOME>/.mempalace/palace_rebuilt palace set-embedder --model embeddinggemma
# then point config.json "palace_path" at palace_rebuilt and restart the MCP server

Result: rebuilt 269,491 rows; count() OK, repair-status divergence 0, end-to-end MCP status/search/add_drawer work again.

Windows gotcha

In-place repair --mode from-sqlite --archive-existing fails with:

PermissionError: [WinError 32] ... data_level0.bin is used by another process

because the running MCP server holds the HNSW segment files open, so the archive-rename can't proceed. Either stop the MCP server first, or rebuild into a separate path from the live sqlite (read-only) as above.

Suggested hardening

  1. Fail fast, don't hang. Wrap collection-open / compaction with an operation deadline (and a SQLite busy_timeout) so a stuck compactor returns a structured error instead of hanging the whole MCP server forever.
  2. Surface the real cause. Detect the Failed to apply logs to the hnsw segment writer compactor error and point the user to repair --mode from-sqlite instead of an opaque infinite hang.
  3. mine crash-safety. Make queue/WAL application atomic or recoverable so an interrupted mine can't leave an unbackfillable segment; on startup, detect a dirty queue and offer a guarded rebuild.
  4. Lock hygiene. mine_palace_lock (palace.py ~L945–984) never liveness-checks the recorded holder PID (no os.kill(pid,0) / lease) and leaks empty locks/*.lock. Add PID-liveness/lease + cleanup-on-release.
  5. Deeper repair-status. Since it never opens the chroma client, it can't see a broken segment writer — consider an optional guarded count() probe so the broken state is detectable without a 6-hour investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions