MCP tools hang forever after an interrupted mine — chromadb 1.x compactor cannot backfill WAL into the HNSW segment (CLI unaffected)

## Summary
After a `mempalace mine` is interrupted/killed mid-write, **every MCP tool call hangs indefinitely** (`mempalace_status`, `mempalace_search`, `mempalace_reconnect`, `mempalace_add_drawer`) — no timeout, no error. The **CLI keeps working** (`mempalace status`, `mempalace repair-status` return in seconds). The hang **survives full reboots** because the broken state is on disk. Root cause is a damaged HNSW segment writer + pending rows in chromadb's `embeddings_queue` that the chromadb-1.x compactor cannot apply.

## Environment
- mempalace **3.4.0**, chromadb **1.5.9** (Rust core / "compactor")
- Windows 11, Python 3.13, onnxruntime-gpu, embedding model `embeddinggemma` (384-dim), backend `chroma`
- Palace size ~268k drawers (~1.4 GB sqlite + ~450 MB HNSW)

## Symptom
- MCP: any tool that opens the chroma collection hangs forever. The harness logs `Tool 'mempalace_status' still running (720s elapsed)`. The hang is in **native code** — it ignores SIGINT.
- CLI: `mempalace status` and `mempalace repair-status` work fine (they read sqlite counts and **never open a chromadb client**).

## Root cause
An interrupted `mine` leaves unflushed rows in chroma's `embeddings_queue` plus a damaged HNSW **segment writer**. Any operation that opens the collection (`count()`, query, add) makes the chromadb-1.x **compactor** try to backfill the queue into the HNSW segment, which fails:

```
chromadb.errors.InternalError: Error executing plan:
  Error sending backfill request to compactor:
  Failed to apply logs to the hnsw segment writer
```

In the long-lived MCP server this surfaces as an **infinite hang**; called directly from Python it raises the error above.

## How to confirm (read-only)
```python
import chromadb
c = chromadb.PersistentClient(path=r"<HOME>/.mempalace/palace")
for col in c.list_collections():
    print(col.name, col.count())   # -> InternalError: ...hnsw segment writer
```
Note: `mempalace repair-status` passes (sqlite vs HNSW counts match within flush-lag) because it **never opens the chroma client** — so it does **not** catch a broken segment writer.

## What it was NOT (ruled out, each with evidence)
- **Stale lock files** — cleaned `~/.mempalace/locks/mine_palace_*.lock` + `hook_state/mine_pids/*.pid`; still hangs. (Separate leak: 2200+ empty `locks/*.lock` accumulate.)
- **HNSW corruption** — `repair-status` OK; `link_lists.bin` tiny (~2 MB), no blowup.
- **HF Hub network** — model is cached locally; `HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1` still hangs.
- **GPU/CUDA/TensorRT** — `embedding_device: cpu` still hangs (the hang is *before* embedding).
- **`max_seq_id` poison** — values sane (`max_seq_id == embeddings count`).

## Fix that worked
`repair --mode from-sqlite` rebuilds the HNSW from the sqlite ground truth, bypassing the broken client/segment:

```bash
# In-place fails on Windows (see gotcha below), so rebuild into a new path
# reading the live sqlite read-only:
mempalace --palace <HOME>/.mempalace/palace_rebuilt repair \
          --mode from-sqlite --source <HOME>/.mempalace/palace --yes
mempalace --palace <HOME>/.mempalace/palace_rebuilt palace set-embedder --model embeddinggemma
# then point config.json "palace_path" at palace_rebuilt and restart the MCP server
```
Result: rebuilt 269,491 rows; `count()` OK, `repair-status` **divergence 0**, end-to-end MCP `status`/`search`/`add_drawer` work again.

### Windows gotcha
In-place `repair --mode from-sqlite --archive-existing` fails with:
```
PermissionError: [WinError 32] ... data_level0.bin is used by another process
```
because the **running MCP server holds the HNSW segment files open**, so the archive-rename can't proceed. Either stop the MCP server first, or rebuild into a separate path from the live sqlite (read-only) as above.

## Suggested hardening
1. **Fail fast, don't hang.** Wrap collection-open / compaction with an operation deadline (and a SQLite `busy_timeout`) so a stuck compactor returns a structured error instead of hanging the whole MCP server forever.
2. **Surface the real cause.** Detect the `Failed to apply logs to the hnsw segment writer` compactor error and point the user to `repair --mode from-sqlite` instead of an opaque infinite hang.
3. **mine crash-safety.** Make queue/WAL application atomic or recoverable so an interrupted `mine` can't leave an unbackfillable segment; on startup, detect a dirty queue and offer a guarded rebuild.
4. **Lock hygiene.** `mine_palace_lock` (palace.py ~L945–984) never liveness-checks the recorded holder PID (no `os.kill(pid,0)` / lease) and leaks empty `locks/*.lock`. Add PID-liveness/lease + cleanup-on-release.
5. **Deeper `repair-status`.** Since it never opens the chroma client, it can't see a broken segment writer — consider an optional guarded `count()` probe so the broken state is detectable without a 6-hour investigation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP tools hang forever after an interrupted mine — chromadb 1.x compactor cannot backfill WAL into the HNSW segment (CLI unaffected) #1845

Summary

Environment

Symptom

Root cause

How to confirm (read-only)

What it was NOT (ruled out, each with evidence)

Fix that worked

Windows gotcha

Suggested hardening

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MCP tools hang forever after an interrupted mine — chromadb 1.x compactor cannot backfill WAL into the HNSW segment (CLI unaffected) #1845

Description

Summary

Environment

Symptom

Root cause

How to confirm (read-only)

What it was NOT (ruled out, each with evidence)

Fix that worked

Windows gotcha

Suggested hardening

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions