Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions scripts/verify-specs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env bash
set -euo pipefail
find specs -name '*.md' | while read -r f; do
n="$(grep -coE '\([^)]+:L[0-9]+' "$f" || true)"
printf '%4s %s\n' "${n:-0}" "$f"
done | sort -n | tee /tmp/spec-citation-report.txt
echo "--- coverage ---"
echo "source modules: $(git ls-files 'mempalace/*.py' | wc -l) source specs: $(find specs/src -name '*.md' 2>/dev/null | wc -l)"
echo "test files: $(git ls-files 'tests/*.py' | wc -l) test specs: $(find specs/tests -name '*.md' 2>/dev/null | wc -l)"
echo "--- specs with ZERO citations (must be empty) ---"
awk '$1==0{print $2}' /tmp/spec-citation-report.txt
Comment on lines +3 to +11

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Issues:

  1. Insecure Temporary File Creation (CWE-377): Writing directly to a hardcoded path in /tmp (/tmp/spec-citation-report.txt) can lead to symlink attacks or conflicts if multiple users or concurrent processes run the script. Using mktemp is much safer.
  2. Robust Filename Handling: Using find piped to while read without -print0 and IFS= can fail or misbehave if filenames contain spaces, newlines, or backslashes.

Suggestion:

Use mktemp to securely create a temporary file, set up a trap to clean it up on exit, and use find -print0 with read -d '' to robustly handle any special characters in filenames.

Suggested change
find specs -name '*.md' | while read -r f; do
n="$(grep -coE '\([^)]+:L[0-9]+' "$f" || true)"
printf '%4s %s\n' "${n:-0}" "$f"
done | sort -n | tee /tmp/spec-citation-report.txt
echo "--- coverage ---"
echo "source modules: $(git ls-files 'mempalace/*.py' | wc -l) source specs: $(find specs/src -name '*.md' 2>/dev/null | wc -l)"
echo "test files: $(git ls-files 'tests/*.py' | wc -l) test specs: $(find specs/tests -name '*.md' 2>/dev/null | wc -l)"
echo "--- specs with ZERO citations (must be empty) ---"
awk '$1==0{print $2}' /tmp/spec-citation-report.txt
report_file=$(mktemp)
trap 'rm -f "$report_file"' EXIT
find specs -name '*.md' -print0 | while IFS= read -r -d '' f; do
n="$(grep -coE '\([^)]+:L[0-9]+' "$f" || true)"
printf '%4s %s\n' "${n:-0}" "$f"
done | sort -n | tee "$report_file"
echo "--- coverage ---"
echo "source modules: $(git ls-files 'mempalace/*.py' | wc -l) source specs: $(find specs/src -name '*.md' 2>/dev/null | wc -l)"
echo "test files: $(git ls-files 'tests/*.py' | wc -l) test specs: $(find specs/tests -name '*.md' 2>/dev/null | wc -l)"
echo "--- specs with ZERO citations (must be empty) ---"
awk '$1==0{print $2}' "$report_file"

87 changes: 87 additions & 0 deletions specs/src/__init__.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Behavior Specification: `mempalace/__init__.py`

This file is the package initialization module for MemPalace. It runs at import
time and performs environment hygiene and telemetry suppression before exposing
the package version (`mempalace/__init__.py:L1-L1`).

## Public Surface

The package exports exactly one public symbol: `__version__`, a version string
re-exported from the package's version module (`mempalace/__init__.py:L38-L38`,
`mempalace/__init__.py:L60-L60`). No other names are part of the public API
contract; the public name list contains only `__version__`
(`mempalace/__init__.py:L60-L60`).

## Import-Time Side Effects (Ordering Guarantees)

On import, the module executes the following steps in this exact order:

1. Strips leaked interpreter search-path entries originating from the
`PYTHONPATH` environment variable (`mempalace/__init__.py:L36-L36`).
2. Imports and binds the version string (`mempalace/__init__.py:L38-L38`).
3. Silences a specific telemetry logger by raising its level to the most
critical threshold (`mempalace/__init__.py:L44-L44`).

The path-stripping step (1) must run before any further imports so that
subsequent imports resolve packages only from the environment's own
installation rather than from externally-injected paths
(`mempalace/__init__.py:L9-L24`, `mempalace/__init__.py:L36-L38`).

## Behavior: PYTHONPATH Search-Path Sanitization

A function performs the following observable contract when the package is
imported (`mempalace/__init__.py:L8-L33`):

- **Input:** The current value of the `PYTHONPATH` environment variable and the
interpreter's current module search path list (`mempalace/__init__.py:L25-L25`,
`mempalace/__init__.py:L33-L33`).
- **No-op condition:** If `PYTHONPATH` is unset or empty, the function returns
immediately and the search path is left unchanged
(`mempalace/__init__.py:L25-L27`).
- **Action:** The `PYTHONPATH` value is split on the platform path separator
into individual entries; empty entries are discarded
(`mempalace/__init__.py:L32-L32`). Each search-path entry is then removed from
the interpreter search path if it matches one of those `PYTHONPATH`-derived
entries (`mempalace/__init__.py:L33-L33`).
- **Matching rule:** Comparison is performed on a normalized form of each path
that collapses case differences and path-separator/normalization quirks, so
that case-insensitive filesystems and trailing-separator differences are
treated as equal (`mempalace/__init__.py:L29-L30`, `mempalace/__init__.py:L13-L14`).
- **Preservation invariant:** The empty-string entry on the search path (the
marker representing the implicit current working directory) is always
preserved and never removed, even if `PYTHONPATH` contains a value referring
to the current directory (`mempalace/__init__.py:L15-L17`,
`mempalace/__init__.py:L33-L33`).
- **Environment invariant:** The `PYTHONPATH` environment variable itself is NOT
modified by this function. Only the in-process search path is altered. This
keeps an embedding host application's `PYTHONPATH` intact for its own
unrelated subprocesses (`mempalace/__init__.py:L19-L24`). (Entry-point
programs separately drop `PYTHONPATH` from the environment themselves; that
behavior is external to this file — `mempalace/__init__.py:L19-L22`.)

## Behavior: Telemetry Logger Suppression

The logger named `chromadb.telemetry.product.posthog` has its level set to the
most-critical (highest) severity threshold at import time, which suppresses
noisy telemetry-related warning output on the standard error stream
(`mempalace/__init__.py:L40-L44`).

## Edge Cases

- Empty or unset `PYTHONPATH`: search path untouched
(`mempalace/__init__.py:L26-L27`).
- `PYTHONPATH` containing only empty segments (e.g. a lone separator): those
empty segments are filtered out, so no real entries are matched, but the
current-directory marker on the search path is still preserved
(`mempalace/__init__.py:L32-L33`).
- Paths differing only by letter case or by trailing separators are still
matched and removed due to the normalization rule
(`mempalace/__init__.py:L29-L30`).

## Notes for Reimplementation

The version string is the single externally observable output of this module;
everything else is environment/process hygiene with no return value
(`mempalace/__init__.py:L38-L38`, `mempalace/__init__.py:L60-L60`). The
search-path scrubbing is a side effect on interpreter-global state and produces
no return value (`mempalace/__init__.py:L8-L8`).
51 changes: 51 additions & 0 deletions specs/src/__main__.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Behavior Spec: `mempalace/__main__.py`

## Purpose

This file is the package execution entry point. It makes the package runnable as
an executable module (e.g. `python -m mempalace`), delegating all behavior to the
package's CLI dispatcher (mempalace/__main__.py:L1-L5).

## Public Surface

This module exposes no functions, classes, or constants of its own. Its only
observable behavior is the side effect produced when the module is loaded/executed
as the program entry point (mempalace/__main__.py:L3-L5).

## Behavior

On execution, the module obtains the CLI dispatcher entry point named `main` from
the package's CLI component and invokes it with no arguments
(mempalace/__main__.py:L3-L5). All command-line argument parsing, dispatch,
input/output, exit codes, and side effects are therefore defined entirely by that
CLI entry point, not by this file (mempalace/__main__.py:L3-L5).

The invocation occurs unconditionally at module load time — invoking the module as
the program entry point runs the CLI immediately (mempalace/__main__.py:L5).

## Inputs / Outputs

- Inputs: none consumed directly by this module; it forwards no explicit arguments
to the CLI entry point (mempalace/__main__.py:L5).
- Outputs / exit code: this module returns or produces nothing of its own; the
process exit code and all output are determined by the delegated CLI entry point
(mempalace/__main__.py:L3-L5).

## Invariants / Ordering

- The CLI entry point is resolved before it is called (import precedes invocation)
(mempalace/__main__.py:L3-L5).
- Exactly one CLI invocation happens per module execution
(mempalace/__main__.py:L5).

## Error / Edge-Case Behavior

This module adds no error handling of its own. Any failure to resolve the CLI
entry point, or any error raised by it, propagates unchanged to the caller
(mempalace/__main__.py:L3-L5).

## Side Effects

No filesystem, network, process, or environment side effects originate in this
file; all such effects are those of the delegated CLI entry point
(mempalace/__main__.py:L3-L5).
58 changes: 58 additions & 0 deletions specs/src/_stdio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Behavior Spec: `_stdio.py` — Stdio UTF-8 Reconfiguration Helper

## Purpose

This module provides a single shared routine that forces the process's standard I/O streams (stdin, stdout, stderr) to use UTF-8 encoding on Windows, so that non-Latin / non-ASCII UTF-8 text is not corrupted ("mojibaked") by the platform's default ANSI codepage. On all non-Windows platforms the routine does nothing (mempalace/_stdio.py:L1-L9, mempalace/_stdio.py:L49-L50).

## Public Surface

A single public function:

`reconfigure_stdio_utf8_on_windows(*, stdin_errors, stdout_errors, stderr_errors, on_failure) -> None` (mempalace/_stdio.py:L31-L37).

### Parameters (all keyword-only, all optional)

- `stdin_errors`: string error-handling policy applied when reconfiguring stdin. Default value is `"surrogateescape"` (mempalace/_stdio.py:L33). The default ensures malformed bytes from a redirected file or misbehaving client survive as lone surrogates rather than aborting the read with a decode error (mempalace/_stdio.py:L19-L22).
- `stdout_errors`: string error-handling policy applied when reconfiguring stdout. Default value is `"strict"` (mempalace/_stdio.py:L34).
- `stderr_errors`: string error-handling policy applied when reconfiguring stderr. Default value is `"strict"` (mempalace/_stdio.py:L35).
- `on_failure`: optional callback invoked as `on_failure(stream_name, exception)` for any stream whose reconfiguration raises an error. If not provided (i.e. `None`), a default failure behavior is used instead (mempalace/_stdio.py:L36-L47).

### Return value

Returns nothing / no value (mempalace/_stdio.py:L37-L38).

## Behavior

### Platform gating

If the current platform is not Windows (`win32`), the function returns immediately and performs no reconfiguration and no side effects (mempalace/_stdio.py:L49-L50).

### Reconfiguration order and processing

On Windows, the function processes exactly three streams in this fixed order: stdin first, then stdout, then stderr. Each stream is paired with its caller-chosen error policy (mempalace/_stdio.py:L52-L57).

For each stream, in order:
1. The stream object is looked up by name on the standard I/O namespace; if the named stream is absent it is treated as missing (mempalace/_stdio.py:L58).
2. The stream's reconfigure capability is looked up; if the stream does not support reconfiguration, that stream is skipped entirely (no error, no callback) and processing continues to the next stream (mempalace/_stdio.py:L59-L61).
3. Otherwise the stream is reconfigured to encoding UTF-8 using that stream's error policy (mempalace/_stdio.py:L62-L63).

### Error handling per stream

If reconfiguring a given stream raises any exception, the failure is isolated to that stream and does not stop processing of the remaining streams (mempalace/_stdio.py:L62-L71). On such a failure:

- If an `on_failure` callback was supplied, it is invoked with the stream name and the raised exception (mempalace/_stdio.py:L65-L66).
- If no callback was supplied, a warning line is written to the standard error stream in the exact form `WARNING: Could not reconfigure {name} to UTF-8: {exc}`, where `{name}` is the stream name (one of `stdin`, `stdout`, `stderr`) and `{exc}` is the textual rendering of the exception (mempalace/_stdio.py:L67-L71).

## Caller-policy contract (documented intent)

The per-stream error policy is intentionally caller-chosen so callers can align behavior across entry points (mempalace/_stdio.py:L11-L22):
- A server emitting only self-controlled JSON-RPC is expected to use `strict` on stdout/stderr so any encode failure surfaces loudly as a bug (mempalace/_stdio.py:L13-L15).
- A CLI or tool that prints verbatim text possibly containing round-tripped surrogate halves is expected to use `replace` on stdout/stderr to avoid crashing mid-print (mempalace/_stdio.py:L16-L18).
- All callers are expected to use `surrogateescape` on stdin so a single malformed byte does not kill the read loop (mempalace/_stdio.py:L19-L22).

## Invariants and Edge Cases

- Idempotent in effect: calling on a non-Windows platform is always a no-op (mempalace/_stdio.py:L49-L50).
- Missing or non-reconfigurable streams are silently skipped without raising or invoking the failure callback (mempalace/_stdio.py:L58-L61).
- A reconfiguration failure on one stream never prevents the remaining streams from being attempted (loop continues over all three) (mempalace/_stdio.py:L57-L71).
- The only side effects are: (a) reconfiguring the three standard streams to UTF-8 on Windows, and (b) on a failure with no callback, writing one warning line per failing stream to standard error (mempalace/_stdio.py:L62-L71). The function performs no filesystem, network, or environment access.
42 changes: 42 additions & 0 deletions specs/src/backends/__init__.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Spec: `mempalace/backends/__init__.py`

## Purpose

This is the public package facade for MemPalace storage backends (RFC 001). It defines no behavior of its own; it aggregates and re-exports the public surface from sibling modules so that consumers import everything from one stable namespace (`mempalace/backends/__init__.py:L1-L15`). An implementation in any language should expose a single package/namespace that surfaces the symbols listed below, sourced from the corresponding submodules.

## Re-exported Contract Symbols (from the `base` submodule)

The package re-exports the following abstract contract and value types from `base` (`mempalace/backends/__init__.py:L17-L37`):

- Abstract contracts: `BaseBackend` (per-palace factory contract) and `BaseCollection` (per-collection read/write contract) (`mempalace/backends/__init__.py:L5-L6`, `mempalace/backends/__init__.py:L21-L22`).
- Value/identity object: `PalaceRef` — identifies a palace for a backend (`mempalace/backends/__init__.py:L7`, `mempalace/backends/__init__.py:L32`).
- Typed read returns: `QueryResult`, `GetResult` (`mempalace/backends/__init__.py:L8`, `mempalace/backends/__init__.py:L26`, `mempalace/backends/__init__.py:L33`).
- Lexical/health/maintenance result types: `HealthStatus`, `LexicalHit`, `LexicalResult`, `MaintenanceResult` (`mempalace/backends/__init__.py:L27-L30`).
- Error classes: `BackendError`, `BackendClosedError`, `BackendMismatchError`, `CollectionNotInitializedError`, `DimensionMismatchError`, `EmbedderIdentityMismatchError`, `PalaceNotFoundError`, `UnsupportedCapabilityError`, `UnsupportedFilterError`, `UnsupportedMaintenanceKindError` (`mempalace/backends/__init__.py:L18-L36`).

## Re-exported Concrete Backends (from backend submodules)

The package re-exports concrete backend implementations and their collection classes, one pair per storage engine (`mempalace/backends/__init__.py:L38-L41`):

- `ChromaBackend` / `ChromaCollection` — the in-tree default backend (`mempalace/backends/__init__.py:L14`, `mempalace/backends/__init__.py:L38`).
- `PgVectorBackend` / `PgVectorCollection` (`mempalace/backends/__init__.py:L39`).
- `QdrantBackend` / `QdrantCollection` (`mempalace/backends/__init__.py:L40`).
- `SQLiteExactBackend` / `SQLiteExactCollection` (`mempalace/backends/__init__.py:L41`).

## Re-exported Registry Functions (from the `registry` submodule)

The package re-exports the backend registry API (`mempalace/backends/__init__.py:L42-L52`):

- `get_backend`, `get_backend_class` — resolve a backend instance / class (`mempalace/backends/__init__.py:L46-L47`).
- `register`, `unregister`, `reset_backends` — mutate the registry of available backends (`mempalace/backends/__init__.py:L48-L49`, `mempalace/backends/__init__.py:L51`).
- `available_backends` — enumerate registered backends (`mempalace/backends/__init__.py:L43`).
- `detect_backend_for_path`, `detect_backends_for_path` — infer the backend(s) for a given on-disk palace path (`mempalace/backends/__init__.py:L44-L45`).
- `resolve_backend_for_palace` — resolve the backend for a palace reference (`mempalace/backends/__init__.py:L50`).

## Public Surface Invariant

The exported public namespace is explicitly enumerated and is the authoritative list of symbols this package promises to consumers; it contains exactly the 37 names listed and they must all be importable from the package root (`mempalace/backends/__init__.py:L54-L91`). The enumerated public list is a superset of the docstring summary: every concrete backend pair, every registry function, and every contract/error/result type re-exported above appears in it (`mempalace/backends/__init__.py:L54-L91`). Any symbol not present in this list is not part of the package's public contract.

## Side Effects

Importing the package transitively imports the `base`, `chroma`, `pgvector`, `qdrant`, `sqlite_exact`, and `registry` submodules; any import-time side effects of those modules (e.g. backend registration) occur as a consequence of loading this facade (`mempalace/backends/__init__.py:L17-L52`). This file itself performs no filesystem, network, process, or environment access beyond importing those submodules.
Loading