Skip to content

sunilp303/claude-cost-gateway

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Claude LLM Gateway

An intelligent routing and caching layer that acts as a drop-in replacement for the Anthropic Messages API. Point your ANTHROPIC_BASE_URL at it and get automatic model selection, multi-layer caching, quality escalation, and a cost dashboard — transparently.


What it does

Your app  →  POST /v1/messages  →  Complexity Classifier  →  Cache Check  →  Claude API
                                                                                  ↓
                                                                       Quality Validator
                                                                                  ↓
                                                                    Escalate if uncertain
Feature Description
Smart routing Routes each query to the cheapest model that can handle it (Haiku → Sonnet → Opus)
Hybrid classifier 3-layer: heuristic (0 ms) → embedding similarity (5 ms) → Haiku self-classification (300 ms)
Exact cache SHA-256 keyed; in-memory or Redis; free repeat answers
Semantic cache Embedding similarity (cosine ≥ 0.95); catches paraphrased duplicates across teams
Quality escalation Detects uncertainty language and automatically retries with a stronger model
Cost dashboard Live HTML dashboard at /dashboard — charts, KPI cards, recent requests table

Dashboard

Dashboard


Model tiers and costs

Pricing as of May 2026:

Tier Model Input / 1M tokens Output / 1M tokens
Fast claude-haiku-4-5-20251001 $1.00 $5.00
Balanced claude-sonnet-4-6 $3.00 $15.00
Premium claude-opus-4-7 $5.00 $25.00

Estimated savings vs. always using Opus (1,000 queries/day, ~200 input / ~300 output tokens avg):

Scenario Daily cost Monthly
All Opus (baseline) ~$8.50 ~$255
Smart routing (60% Haiku / 30% Sonnet / 10% Opus) ~$3.40 ~$102
+ Semantic cache (40% hit rate) ~$2.05 ~$62
Total savings ~$193/mo (76%)

Architecture

claude-llm-gateway/
├── gateway/
│   ├── main.py          # FastAPI app — endpoints and startup
│   ├── router.py        # Orchestrates classifier → cache → API → escalation
│   ├── classifier.py    # Hybrid 3-layer complexity classifier
│   ├── cache.py         # ExactCache + SemanticCache + CacheLayer
│   ├── validator.py     # Escalation signal detection
│   ├── metrics.py       # RouteDecision, MetricsStore, cost math
│   └── dashboard.py     # HTML dashboard served at /dashboard
├── test_gateway.py      # End-to-end tests (routing, cache, escalation)
├── config.yaml          # All tunable settings
├── requirements.txt
└── .env.example

Quick start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# 3. Start the server
uvicorn gateway.main:app --reload --port 8000

The gateway is now running at http://localhost:8000.


Configuration

Copy .env.example to .env and set ANTHROPIC_API_KEY.
All other settings live in config.yaml:

Key Default Description
use_haiku_classifier true Use Haiku API call as 3rd-layer classifier fallback
max_escalations 2 Max quality-driven tier upgrades per request
cache.redis_url null Redis URL for persistent exact cache (omit for in-memory)
cache.exact_ttl 86400 Exact cache TTL in seconds (24 h)
cache.use_semantic_cache true Enable embedding similarity cache
cache.semantic_threshold 0.95 Cosine similarity required for a semantic cache hit
cache.semantic_ttl 604800 Semantic cache TTL in seconds (7 days)
cache.embedding_model all-MiniLM-L6-v2 Local sentence-transformers model

Endpoints

Method Path Description
POST /v1/messages Drop-in replacement for Anthropic Messages API
GET /dashboard Live HTML dashboard — charts, KPI cards, recent requests
GET /v1/metrics Aggregated cost and routing stats (JSON)
GET /v1/metrics/recent?n=50 Last N routing decisions (JSON)
GET /health Health check

Response headers

Every /v1/messages response includes:

X-Gateway-Model:      claude-haiku-4-5-20251001
X-Gateway-Cache:      none | exact | semantic
X-Gateway-Saved-USD:  0.000041
X-Gateway-Classifier: heuristic | embedding | haiku | n/a
X-Gateway-Query-ID:   a3f9c12b

Dependencies

Package Required Purpose
anthropic Yes Claude API calls
fastapi + uvicorn Yes HTTP server
pyyaml Yes Config file parsing
redis No Persistent exact cache (falls back to in-memory)
sentence-transformers + numpy No Semantic cache (disabled if missing)

The gateway starts and works with only the required packages installed. Optional packages are detected at runtime; missing ones silently disable that feature.


How the classifier works

Query
  │
  ├─ Heuristic (0 ms, free)
  │    High-confidence patterns → route directly (covers ~70% of queries)
  │
  ├─ Embedding similarity (5 ms, free)
  │    Compare to labeled examples using local model
  │    cosine ≥ 0.90 → route directly
  │
  └─ Haiku self-classification (~300 ms, ~$0.0001)
       Ask claude-haiku-4-5 to label as SIMPLE/MODERATE/COMPLEX
       Only reached for genuinely ambiguous queries

Every escalation is logged as a training signal. You can periodically review GET /v1/metrics/recent to tune classifier thresholds.


How escalation works

When a response comes back from a model, the validator checks it before returning to the caller. If it looks uncertain, it retries transparently with the next tier.

Triggers:

Signal Example
Response truncated stop_reason == "max_tokens"
Uncertainty phrase "I'm not sure", "I cannot", "I don't have", "I'm unable"
Excessive hedging >5 occurrences of "perhaps", "possibly", "might", "allegedly"

Flow:

Query → Haiku answers → validator detects uncertainty
      → escalate to Sonnet → validator checks again
      → escalate to Opus   → stop (max_escalations = 2)

The X-Gateway-Model header always reflects the model that gave the final answer, not the one the classifier initially chose. Escalation count and whether it happened are visible in /v1/metrics/recent and on the dashboard.


Testing

Run the included end-to-end test suite (gateway must be running):

python test_gateway.py

Tests covered:

  • Routing to Haiku / Sonnet / Opus based on query complexity
  • Exact cache hit on repeated queries
  • Escalation via max_tokens truncation
  • Escalation via uncertainty phrase detection
  • System prompt passthrough
  • Metrics summary

About

An intelligent gateway for Claude APIs that dynamically routes requests to the most cost-efficient model, caches responses, and escalates based on confidence signals — reducing LLM spend without sacrificing quality.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages