An intelligent routing and caching layer that acts as a drop-in replacement for the Anthropic Messages API.
Point your ANTHROPIC_BASE_URL at it and get automatic model selection, multi-layer caching, quality escalation, and a cost dashboard — transparently.
Your app → POST /v1/messages → Complexity Classifier → Cache Check → Claude API
↓
Quality Validator
↓
Escalate if uncertain
| Feature | Description |
|---|---|
| Smart routing | Routes each query to the cheapest model that can handle it (Haiku → Sonnet → Opus) |
| Hybrid classifier | 3-layer: heuristic (0 ms) → embedding similarity (5 ms) → Haiku self-classification (300 ms) |
| Exact cache | SHA-256 keyed; in-memory or Redis; free repeat answers |
| Semantic cache | Embedding similarity (cosine ≥ 0.95); catches paraphrased duplicates across teams |
| Quality escalation | Detects uncertainty language and automatically retries with a stronger model |
| Cost dashboard | Live HTML dashboard at /dashboard — charts, KPI cards, recent requests table |
Pricing as of May 2026:
| Tier | Model | Input / 1M tokens | Output / 1M tokens |
|---|---|---|---|
| Fast | claude-haiku-4-5-20251001 |
$1.00 | $5.00 |
| Balanced | claude-sonnet-4-6 |
$3.00 | $15.00 |
| Premium | claude-opus-4-7 |
$5.00 | $25.00 |
Estimated savings vs. always using Opus (1,000 queries/day, ~200 input / ~300 output tokens avg):
| Scenario | Daily cost | Monthly |
|---|---|---|
| All Opus (baseline) | ~$8.50 | ~$255 |
| Smart routing (60% Haiku / 30% Sonnet / 10% Opus) | ~$3.40 | ~$102 |
| + Semantic cache (40% hit rate) | ~$2.05 | ~$62 |
| Total savings | ~$193/mo (76%) |
claude-llm-gateway/
├── gateway/
│ ├── main.py # FastAPI app — endpoints and startup
│ ├── router.py # Orchestrates classifier → cache → API → escalation
│ ├── classifier.py # Hybrid 3-layer complexity classifier
│ ├── cache.py # ExactCache + SemanticCache + CacheLayer
│ ├── validator.py # Escalation signal detection
│ ├── metrics.py # RouteDecision, MetricsStore, cost math
│ └── dashboard.py # HTML dashboard served at /dashboard
├── test_gateway.py # End-to-end tests (routing, cache, escalation)
├── config.yaml # All tunable settings
├── requirements.txt
└── .env.example
# 1. Install dependencies
pip install -r requirements.txt
# 2. Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# 3. Start the server
uvicorn gateway.main:app --reload --port 8000The gateway is now running at http://localhost:8000.
Copy .env.example to .env and set ANTHROPIC_API_KEY.
All other settings live in config.yaml:
| Key | Default | Description |
|---|---|---|
use_haiku_classifier |
true |
Use Haiku API call as 3rd-layer classifier fallback |
max_escalations |
2 |
Max quality-driven tier upgrades per request |
cache.redis_url |
null |
Redis URL for persistent exact cache (omit for in-memory) |
cache.exact_ttl |
86400 |
Exact cache TTL in seconds (24 h) |
cache.use_semantic_cache |
true |
Enable embedding similarity cache |
cache.semantic_threshold |
0.95 |
Cosine similarity required for a semantic cache hit |
cache.semantic_ttl |
604800 |
Semantic cache TTL in seconds (7 days) |
cache.embedding_model |
all-MiniLM-L6-v2 |
Local sentence-transformers model |
| Method | Path | Description |
|---|---|---|
POST |
/v1/messages |
Drop-in replacement for Anthropic Messages API |
GET |
/dashboard |
Live HTML dashboard — charts, KPI cards, recent requests |
GET |
/v1/metrics |
Aggregated cost and routing stats (JSON) |
GET |
/v1/metrics/recent?n=50 |
Last N routing decisions (JSON) |
GET |
/health |
Health check |
Every /v1/messages response includes:
X-Gateway-Model: claude-haiku-4-5-20251001
X-Gateway-Cache: none | exact | semantic
X-Gateway-Saved-USD: 0.000041
X-Gateway-Classifier: heuristic | embedding | haiku | n/a
X-Gateway-Query-ID: a3f9c12b
| Package | Required | Purpose |
|---|---|---|
anthropic |
Yes | Claude API calls |
fastapi + uvicorn |
Yes | HTTP server |
pyyaml |
Yes | Config file parsing |
redis |
No | Persistent exact cache (falls back to in-memory) |
sentence-transformers + numpy |
No | Semantic cache (disabled if missing) |
The gateway starts and works with only the required packages installed. Optional packages are detected at runtime; missing ones silently disable that feature.
Query
│
├─ Heuristic (0 ms, free)
│ High-confidence patterns → route directly (covers ~70% of queries)
│
├─ Embedding similarity (5 ms, free)
│ Compare to labeled examples using local model
│ cosine ≥ 0.90 → route directly
│
└─ Haiku self-classification (~300 ms, ~$0.0001)
Ask claude-haiku-4-5 to label as SIMPLE/MODERATE/COMPLEX
Only reached for genuinely ambiguous queries
Every escalation is logged as a training signal. You can periodically review GET /v1/metrics/recent to tune classifier thresholds.
When a response comes back from a model, the validator checks it before returning to the caller. If it looks uncertain, it retries transparently with the next tier.
Triggers:
| Signal | Example |
|---|---|
| Response truncated | stop_reason == "max_tokens" |
| Uncertainty phrase | "I'm not sure", "I cannot", "I don't have", "I'm unable" |
| Excessive hedging | >5 occurrences of "perhaps", "possibly", "might", "allegedly" |
Flow:
Query → Haiku answers → validator detects uncertainty
→ escalate to Sonnet → validator checks again
→ escalate to Opus → stop (max_escalations = 2)
The X-Gateway-Model header always reflects the model that gave the final answer, not the one the classifier initially chose. Escalation count and whether it happened are visible in /v1/metrics/recent and on the dashboard.
Run the included end-to-end test suite (gateway must be running):
python test_gateway.pyTests covered:
- Routing to Haiku / Sonnet / Opus based on query complexity
- Exact cache hit on repeated queries
- Escalation via
max_tokenstruncation - Escalation via uncertainty phrase detection
- System prompt passthrough
- Metrics summary
