Claude LLM Gateway

An intelligent routing and caching layer that acts as a drop-in replacement for the Anthropic Messages API. Point your ANTHROPIC_BASE_URL at it and get automatic model selection, multi-layer caching, quality escalation, and a cost dashboard — transparently.

What it does

Your app  →  POST /v1/messages  →  Complexity Classifier  →  Cache Check  →  Claude API
                                                                                  ↓
                                                                       Quality Validator
                                                                                  ↓
                                                                    Escalate if uncertain

Feature	Description
Smart routing	Routes each query to the cheapest model that can handle it (Haiku → Sonnet → Opus)
Hybrid classifier	3-layer: heuristic (0 ms) → embedding similarity (5 ms) → Haiku self-classification (300 ms)
Exact cache	SHA-256 keyed; in-memory or Redis; free repeat answers
Semantic cache	Embedding similarity (cosine ≥ 0.95); catches paraphrased duplicates across teams
Quality escalation	Detects uncertainty language and automatically retries with a stronger model
Cost dashboard	Live HTML dashboard at `/dashboard` — charts, KPI cards, recent requests table

Dashboard

Model tiers and costs

Pricing as of May 2026:

Tier	Model	Input / 1M tokens	Output / 1M tokens
Fast	`claude-haiku-4-5-20251001`	$1.00	$5.00
Balanced	`claude-sonnet-4-6`	$3.00	$15.00
Premium	`claude-opus-4-7`	$5.00	$25.00

Estimated savings vs. always using Opus (1,000 queries/day, ~200 input / ~300 output tokens avg):

Scenario	Daily cost	Monthly
All Opus (baseline)	~$8.50	~$255
Smart routing (60% Haiku / 30% Sonnet / 10% Opus)	~$3.40	~$102
+ Semantic cache (40% hit rate)	~$2.05	~$62
Total savings		~$193/mo (76%)

Architecture

claude-llm-gateway/
├── gateway/
│   ├── main.py          # FastAPI app — endpoints and startup
│   ├── router.py        # Orchestrates classifier → cache → API → escalation
│   ├── classifier.py    # Hybrid 3-layer complexity classifier
│   ├── cache.py         # ExactCache + SemanticCache + CacheLayer
│   ├── validator.py     # Escalation signal detection
│   ├── metrics.py       # RouteDecision, MetricsStore, cost math
│   └── dashboard.py     # HTML dashboard served at /dashboard
├── test_gateway.py      # End-to-end tests (routing, cache, escalation)
├── config.yaml          # All tunable settings
├── requirements.txt
└── .env.example

Quick start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# 3. Start the server
uvicorn gateway.main:app --reload --port 8000

The gateway is now running at http://localhost:8000.

Configuration

Copy .env.example to .env and set ANTHROPIC_API_KEY.
All other settings live in config.yaml:

Key	Default	Description
`use_haiku_classifier`	`true`	Use Haiku API call as 3rd-layer classifier fallback
`max_escalations`	`2`	Max quality-driven tier upgrades per request
`cache.redis_url`	`null`	Redis URL for persistent exact cache (omit for in-memory)
`cache.exact_ttl`	`86400`	Exact cache TTL in seconds (24 h)
`cache.use_semantic_cache`	`true`	Enable embedding similarity cache
`cache.semantic_threshold`	`0.95`	Cosine similarity required for a semantic cache hit
`cache.semantic_ttl`	`604800`	Semantic cache TTL in seconds (7 days)
`cache.embedding_model`	`all-MiniLM-L6-v2`	Local sentence-transformers model

Endpoints

Method	Path	Description
`POST`	`/v1/messages`	Drop-in replacement for Anthropic Messages API
`GET`	`/dashboard`	Live HTML dashboard — charts, KPI cards, recent requests
`GET`	`/v1/metrics`	Aggregated cost and routing stats (JSON)
`GET`	`/v1/metrics/recent?n=50`	Last N routing decisions (JSON)
`GET`	`/health`	Health check

Response headers

Every /v1/messages response includes:

X-Gateway-Model:      claude-haiku-4-5-20251001
X-Gateway-Cache:      none | exact | semantic
X-Gateway-Saved-USD:  0.000041
X-Gateway-Classifier: heuristic | embedding | haiku | n/a
X-Gateway-Query-ID:   a3f9c12b

Dependencies

Package	Required	Purpose
`anthropic`	Yes	Claude API calls
`fastapi` + `uvicorn`	Yes	HTTP server
`pyyaml`	Yes	Config file parsing
`redis`	No	Persistent exact cache (falls back to in-memory)
`sentence-transformers` + `numpy`	No	Semantic cache (disabled if missing)

The gateway starts and works with only the required packages installed. Optional packages are detected at runtime; missing ones silently disable that feature.

How the classifier works

Query
  │
  ├─ Heuristic (0 ms, free)
  │    High-confidence patterns → route directly (covers ~70% of queries)
  │
  ├─ Embedding similarity (5 ms, free)
  │    Compare to labeled examples using local model
  │    cosine ≥ 0.90 → route directly
  │
  └─ Haiku self-classification (~300 ms, ~$0.0001)
       Ask claude-haiku-4-5 to label as SIMPLE/MODERATE/COMPLEX
       Only reached for genuinely ambiguous queries

Every escalation is logged as a training signal. You can periodically review GET /v1/metrics/recent to tune classifier thresholds.

How escalation works

When a response comes back from a model, the validator checks it before returning to the caller. If it looks uncertain, it retries transparently with the next tier.

Triggers:

Signal	Example
Response truncated	`stop_reason == "max_tokens"`
Uncertainty phrase	`"I'm not sure"`, `"I cannot"`, `"I don't have"`, `"I'm unable"`
Excessive hedging	>5 occurrences of `"perhaps"`, `"possibly"`, `"might"`, `"allegedly"`

Flow:

Query → Haiku answers → validator detects uncertainty
      → escalate to Sonnet → validator checks again
      → escalate to Opus   → stop (max_escalations = 2)

The X-Gateway-Model header always reflects the model that gave the final answer, not the one the classifier initially chose. Escalation count and whether it happened are visible in /v1/metrics/recent and on the dashboard.

Testing

Run the included end-to-end test suite (gateway must be running):

python test_gateway.py

Tests covered:

Routing to Haiku / Sonnet / Opus based on query complexity
Exact cache hit on repeated queries
Escalation via max_tokens truncation
Escalation via uncertainty phrase detection
System prompt passthrough
Metrics summary

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
gateway		gateway
.env.example		.env.example
.gitignore		.gitignore
HOW_TO_USE.md		HOW_TO_USE.md
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt
test_gateway.py		test_gateway.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Claude LLM Gateway

What it does

Dashboard

Model tiers and costs

Architecture

Quick start

Configuration

Endpoints

Response headers

Dependencies

How the classifier works

How escalation works

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Claude LLM Gateway

What it does

Dashboard

Model tiers and costs

Architecture

Quick start

Configuration

Endpoints

Response headers

Dependencies

How the classifier works

How escalation works

Testing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages