-
-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture Test Framework
Spector's testing strategy goes beyond traditional unit tests. The project implements a comprehensive E2E test framework with a novel LLM-as-Judge system that uses a language model to semantically validate recall results β catching semantic drift and relevance degradation that deterministic assertions cannot detect.
Semantic search systems face a fundamental testing challenge: correct behavior is subjective. When you query "PostgreSQL connection pool exhaustion" and get back a memory about "HikariCP timeout configuration," is that relevant? A traditional assertion checking for exact string matches would miss it. A human would consider it highly relevant.
The LLM Judge bridges this gap by asking a language model to evaluate whether test results are semantically relevant to the query β the same way a human reviewer would, but automated and repeatable.
flowchart LR
subgraph Traditional["Traditional Testing"]
T1["assertThat(results).isNotEmpty()"] --> T2["assertThat(results.get(0).id()).isEqualTo('db-001')"]
end
subgraph LLMJudge["LLM Judge Testing"]
L1["llmAssertRecall(query, results)"] --> L2[".warnIfIrrelevant('database connection')"]
L2 --> L3[".hasGoodRanking()"]
L3 --> L4[".coversTopics('database', 'pool')"]
end
Traditional --> Brittle["β Brittle: breaks when<br/>model changes ranking"]
LLMJudge --> Semantic["β
Semantic: validates<br/>meaning, not IDs"]
The testing infrastructure spans two Maven modules:
| Module | Purpose |
|---|---|
spector-test-support |
Shared LLM Judge framework (module-agnostic) |
spector-memory (test scope) |
E2E test classes, seed data, context management |
graph TB
subgraph TestClasses["E2E Test Classes (11)"]
Ingestion["IngestionE2ETest"]
Scoring["ScoringPipelineE2ETest"]
Decay["DecayAndLtpE2ETest"]
Graph["GraphE2ETest"]
Inhibition["InhibitionE2ETest"]
Persistence["PersistenceE2ETest"]
Prospective["ProspectiveMemoryE2ETest"]
Lifecycle["FullLifecycleE2ETest"]
Negative["NegativeTestingE2ETest"]
Profiles["CognitiveProfileE2ETest"]
Agentic["AgenticConversationE2ETest"]
end
subgraph Infrastructure["Test Infrastructure"]
Abstract["AbstractE2ETest<br/>Base class"]
Context["E2EMemoryContext<br/>Singleton lifecycle"]
SeedData["E2ESeedData<br/>Markdown parser"]
Assertions["E2EAssertions<br/>Helper methods"]
EntityExtractor["TestEntityExtractor<br/>Mock entities"]
end
subgraph JudgeFramework["LLM Judge Framework"]
LlmAssertions["LlmAssertions<br/>Fluent API"]
LlmTestJudge["LlmTestJudge<br/>Core engine"]
JudgePrompts["JudgePromptTemplates<br/>3 prompt types"]
JudgeVerdict["JudgeVerdict<br/>Structured result"]
JudgeConfig["LlmJudgeConfig<br/>Environment config"]
end
subgraph Providers["LLM Providers"]
Ollama["OllamaLlmProvider<br/>Ollama REST API"]
EmbedAPI["TextGenerationProvider<br/>Interface"]
end
subgraph SeedFiles["Seed Data (16 files, 213 memories)"]
S01["01-project-architecture"]
S02["02-database-work"]
S03["..."]
S14["14-agentic-conversation"]
S15["15-elevate-x-conversation"]
S16["16-spector-engine-conversation"]
end
TestClasses --> Abstract
Abstract --> Context
Abstract --> LlmAssertions
Context --> SeedData
SeedData --> SeedFiles
Context --> Ollama
LlmAssertions --> LlmTestJudge
LlmTestJudge --> JudgePrompts
LlmTestJudge --> JudgeVerdict
LlmTestJudge --> EmbedAPI
JudgeConfig --> Context
Ollama -.-> EmbedAPI
All E2E tests share a single SpectorMemory instance through the E2EMemoryContext singleton. This ensures:
- Seed data is loaded once β 213 memories are embedded and ingested at the start of the test suite
- Tests are independent β each test queries the shared memory and validates results
-
Ollama gating β tests are skipped unless
OLLAMA_LIVE=trueis set
sequenceDiagram
participant Runner as JUnit 5
participant Abstract as AbstractE2ETest
participant Context as E2EMemoryContext
participant Ollama as Ollama (embeddings)
participant Memory as SpectorMemory
participant Judge as LlmTestJudge
Runner->>Abstract: @BeforeAll initContext()
Abstract->>Abstract: Check OLLAMA_LIVE=true
Abstract->>Context: E2EMemoryContext.get()
Context->>Ollama: Create EmbeddingProvider
Context->>Memory: Create SpectorMemory
Context->>Context: Parse seed markdown files
Context->>Ollama: Embed all 213 memories
Context->>Memory: Ingest all memories
opt LLM_JUDGE=true
Context->>Judge: Create LlmTestJudge
end
Runner->>Abstract: Execute test methods
Abstract->>Memory: memory.recall(query, options)
Memory-->>Abstract: List of CognitiveResult
opt LLM Judge enabled
Abstract->>Judge: judgeRelevance(query, results, criteria)
Judge->>Ollama: Generate judgment prompt
Ollama-->>Judge: JSON verdict
Judge-->>Abstract: JudgeVerdict
end
Seed memories are authored in Markdown with YAML front matter. The E2ESeedData parser reads all .md files from src/test/resources/e2e/memories/:
---
id: db-001
type: EPISODIC
source: OBSERVED
tags: database, postgresql, connection-pool
valence: -10
---
The PostgreSQL connection pool kept exhausting under load.
Increased HikariCP maximum pool size from 10 to 25 and added
connection timeout of 30 seconds. The root cause was a missing
connection release in the batch processing loop.Supported fields:
| Field | Required | Values | Default |
|---|---|---|---|
id |
β | Unique string identifier | β |
type |
β |
EPISODIC, SEMANTIC, PROCEDURAL, PROSPECTIVE
|
β |
source |
β |
OBSERVED, REFLECTED, IMAGINED
|
OBSERVED |
tags |
β | Comma-separated tag list | [] |
valence |
β | Integer from -20 to 20 | 0 |
The test suite uses 16 seed data files organized by domain:
| # | File | Memories | Domain |
|---|---|---|---|
| 01 | project-architecture |
10 | Java project structure, Spring Boot, microservices |
| 02 | database-work |
13 | PostgreSQL, HikariCP, migrations, query optimization |
| 03 | deployment-cicd |
10 | Docker, Kubernetes, GitHub Actions, CI/CD |
| 04 | authentication-security |
8 | OAuth2, JWT, security incidents |
| 05 | known-facts |
14 | Factual Java knowledge (patterns, frameworks) |
| 06 | procedures |
6 | Step-by-step runbooks |
| 07 | entity-relationships |
8 | People, teams, projects with entity links |
| 08 | surprise-and-lateral |
10 | Unexpected discoveries, lateral connections |
| 09 | preferences-and-context |
9 | User preferences, tool choices |
| 10 | edge-cases |
20 | Adversarial: duplicates, contradictions, multilingual |
| 11 | temporal-sequences |
12 | Time-ordered Redis migration chain |
| 12 | ambiguous-queries |
11 | Multi-meaning terms ("pool", "spring", "node") |
| 13 | negative-evidence |
7 | Anti-patterns, failed approaches |
| 14 | agentic-conversation |
17 | Real Promptly app development sessions |
| 15 | elevate-x-conversation |
14 | Real Elevate-X fitness app sessions |
| 16 | spector-engine-conversation |
26 | Real Spector engine debugging sessions |
Total: 213 memories across all categories.
| Class | Tests | Focus |
|---|---|---|
IngestionE2ETest |
4 | Memory ingestion, embedding, storage verification |
ScoringPipelineE2ETest |
21 | 6-phase scoring, valence filtering, deduplication |
DecayAndLtpE2ETest |
5 | Temporal decay, long-term potentiation |
GraphE2ETest |
7 | Hebbian graph, entity-aware recall, co-activation |
InhibitionE2ETest |
7 | Suppression, habituation, retrieval-induced forgetting |
PersistenceE2ETest |
5 | WAL, disk persistence, crash recovery |
ProspectiveMemoryE2ETest |
3 | Future intents, deadline tracking |
FullLifecycleE2ETest |
23 | 13-step lifecycle from ingestion to reflection |
NegativeTestingE2ETest |
18 | Adversarial: gibberish, wrong domain, empty results |
CognitiveProfileE2ETest |
12 | Profile auto-detection, scoring weight verification |
AgenticConversationE2ETest |
11 | Agentic conversation recall, cross-domain isolation |
The LLM Judge follows a simple pipeline:
flowchart TB
A["Test calls<br/>llmAssertRecall(query, results)"] --> B["LlmAssertions<br/>builds fluent chain"]
B --> C{".warnIfIrrelevant()<br/>.isRelevantTo()<br/>.hasGoodRanking()<br/>.coversTopics()"}
C --> D["JudgePromptTemplates<br/>builds structured prompt"]
D --> E["LlmTestJudge<br/>calls LLM with low temperature"]
E --> F["Ollama<br/>generates JSON verdict"]
F --> G["Parse JSON<br/>extract relevant, confidence, reasoning"]
G --> H{Verdict?}
H -->|"RELEVANT β
"| I["Log success"]
H -->|"NOT_RELEVANT β"| J{"Assertion type?"}
J -->|isRelevantTo| K["β Fail test<br/>with reasoning"]
J -->|warnIfIrrelevant| L["β οΈ Log warning<br/>test continues"]
Each judgment type uses a carefully engineered prompt that:
- Sets the role: "You are a test validation judge"
- Provides context: The query, relevance criteria, and truncated results
- Defines leniency: "If at least 30% of results are relevant, judge as relevant"
- Forces structured output: "Respond ONLY with this exact JSON format"
- Limits scope: Max 10 results, 200 chars each to fit token budgets
Example prompt for relevance judgment:
You are a test validation judge. Your job is to determine whether
a set of memory recall results is relevant to a given query.
QUERY: "PostgreSQL connection pool exhaustion timeout"
RELEVANCE CRITERIA: Results should contain memories about database
connection issues
RESULTS:
- The PostgreSQL connection pool kept exhausting under load...
- Increased HikariCP maximum pool size from 10 to 25...
- Switched from Flyway to Liquibase for database migrations...
Respond ONLY with this exact JSON format, no other text:
{"relevant": true, "confidence": 0.85, "reasoning": "Brief explanation"}
The LlmTestJudge handles common LLM output quirks:
-
Thinking tags: Strips
<think>...</think>blocks (qwen3 models) -
Markdown fences: Extracts JSON from
```jsonblocks - Extra text: Uses regex to find the JSON object anywhere in the response
- Retry logic: Up to 2 retries on parse failure before giving up gracefully
llmAssertRecall(query, results)
.isRelevantTo("Results must contain security-related memories");
// β Fails test if LLM judges NOT_RELEVANTUse when: The semantic relationship is a hard business requirement. Example: security queries must never return cooking recipes.
llmAssertRecall(query, results)
.warnIfIrrelevant("Results should relate to database connection pooling");
// β Logs: β LLM Judge [llama3.1]: NOT_RELEVANT (confidence=0.75) β ...Use when: Semantic quality is important but model-dependent ranking makes hard assertions flaky.
llmAssertRecall(query, results)
.hasGoodRanking();
// β Warns if #5 is clearly more relevant than #1Use when: Verifying that the scoring pipeline produces sensible ordering.
llmAssertRecall(query, results)
.coversTopics("database", "connection pool", "timeout");
// β Warns if expected topics are missing from resultsUse when: Ensuring recall results span the expected knowledge domains.
All assertions return this for fluent chaining:
if (isLlmJudgeEnabled()) {
llmAssertRecall("AI safety guardrails workout generation", results)
.isRelevantTo("Results must describe fitness safety validation")
.hasGoodRanking()
.coversTopics("safety", "workout", "calorie");
}| Variable | Default | Description |
|---|---|---|
OLLAMA_LIVE |
false |
Gate for all E2E tests (embedding + recall) |
LLM_JUDGE |
false |
Enable LLM-based semantic validation |
LLM_JUDGE_MODEL |
llama3.1 |
Ollama model for judging |
LLM_JUDGE_URL |
http://localhost:11434 |
Ollama server URL |
LLM_JUDGE_CONFIDENCE |
0.6 |
Minimum confidence threshold |
# Run all E2E tests with Ollama embeddings (no LLM judge)
mvn test -pl spector-memory -DOLLAMA_LIVE=true
# Run with LLM judge enabled
mvn test -pl spector-memory -DOLLAMA_LIVE=true -DLLM_JUDGE=true
# Run specific test class
mvn test -pl spector-memory -DOLLAMA_LIVE=true \
-Dtest=AgenticConversationE2ETest
# Run with custom model
mvn test -pl spector-memory -DOLLAMA_LIVE=true \
-DLLM_JUDGE=true -DLLM_JUDGE_MODEL=qwen3:0.6bLLM models are non-deterministic β the same prompt can produce different verdicts across runs, especially with smaller models. Making all LLM assertions hard-fail by default would create flaky tests. Instead:
-
warnIfIrrelevantis the recommended default β it provides semantic signal in test logs without blocking CI/CD -
isRelevantTois reserved for invariants that should never be violated (e.g., cross-domain isolation) - The
LLM_JUDGE_FAIL_ON_REJECTflag exists for strict validation environments
- Local-first: No API keys, no network dependency, no cost
- Reproducible: Pin to a specific model version for consistent results
- Fast: Small models (qwen3:0.6b) produce judgments in < 500ms
- CI-friendly: Ollama runs as a sidecar container in CI pipelines
spector-test-support is a standalone module rather than test-scoped code inside spector-memory because:
- Reusability: Any Spector module can depend on it for LLM-based testing
-
Clean dependencies: The judge framework depends on
spector-embed-apiandspector-embed-ollama, not onspector-memory - Independent versioning: Test infrastructure evolves on its own schedule
| Metric | Value |
|---|---|
| Seed memory files | 16 |
| Total seed memories | 213 |
| E2E test classes | 11 |
| Total E2E tests | 116 |
| LLM assertions | 19 |
Hard assertions (isRelevantTo) |
4 |
Soft assertions (warnIfIrrelevant) |
12 |
Topic coverage (coversTopics) |
2 |
Ranking checks (hasGoodRanking) |
1 |
- Home
- Getting Started
-
Cognitive Memory
- Overview
- Getting Started
- Use Cases & Configuration
- API Reference
- Architecture
- The 6-Phase Scoring Pipeline
- Cognitive Profiles
- Salience & Importance
-
Biological Systems
- Overview
- Cortex β Tier Stores
- Hippocampus β Sleep Consolidation
- Synapse β Tags & Scoring
- Dopamine β Surprise Detection
- Amygdala β Emotional Valence
- 3-Layer Cognitive Graph
- Habituation β Anti-Filter Bubble
- Inhibition β Suppression
- Interference β Deduplication
- Prospective β Future Intents
- Metamemory β Self-Reflection
- Sync β Persistence & Replication
- Performance & Internals
- Cognitive Evaluation
-
Architecture
- System Overview
- Core Concepts
- Ingestion Pipeline
- RAG Pipeline
- MCP Integration
- Distributed Mode
- Event Notifications
- Namespace Sharding
- GPU Acceleration
- Performance Tuning
- Test Framework & LLM Judge
- JDK API Status
- Security & Data
- Deep Dives
- Cortex Dashboard
-
Community
- Contributing
- FAQ
- Roadmap
- π¬ Labs