Headroom Gateway

Centralized context compression service for enterprise LLM usage. Reduces token costs across all users automatically — no client-side installation, no user changes.

Powered by headroom (Apache 2.0, 28k+ stars).

The Problem

LLM coding agents (Claude Code, Codex, Copilot) generate massive context through tool calls — file reads, build logs, API responses, search results. A typical coding session reaches 150K+ tokens, of which 80% is tool output. Every request sends the full conversation history. Token costs scale linearly.

Most of that tool output is old, repetitive, and compressible without affecting the model's reasoning. But compression must happen transparently — users shouldn't have to think about it.

How It Works

The headroom compression service integrates with the MaaS platform via a IPP plugin. Users point at MaaS as they do today — compression is automatic.

Client (Claude Code / Codex / Cursor)
  │
  ▼
MaaS Gateway (Envoy + Istio)
  ├─ Kuadrant: API key validation, user identification
  ├─ IPP ext-proc (payload-processing):
  │   ├─ model-extractor: model → X-Gateway-Model-Name
  │   ├─ metering-check: balance check, user tracking
  │   ├─ model-provider-resolver: resolve ExternalModel → provider
  │   ├─ headroom plugin ──── POST /v1/compress ──▶ Headroom Service
  │   │                    ◀── compressed messages ─┘  (this repo)
  │   ├─ api-translation: passthrough
  │   └─ apikey-injection: inject provider key
  │
  ▼
Provider (api.anthropic.com / api.openai.com)

How It Fits Together

This repo is one component in a larger platform. Here's what each piece does and who owns it:

Component	Repo	What It Does	Relationship to Headroom
MaaS Gateway	Envoy + Istio (infra)	Routes requests, TLS termination	Headroom service sits behind it as an internal ClusterIP service
Kuadrant	kuadrant.io (operator)	API key validation, user identity, rate limiting	Identifies the user → `x-maas-username` header flows to headroom for per-user stats
IPP / payload-processing	ai-gateway-payload-processing	Plugin chain for request/response processing (ext-proc)	The headroom plugin lives here — it calls our service at `POST /v1/compress`
Headroom plugin (Go)	yossiovadia/ai-gateway-payload-processing branch `feat/headroom-on-metering`	Sends messages to headroom service, replaces with compressed versions	3 files, 15 tests. Reads username from CycleState (set by metering plugin)
Headroom service (Python)	This repo	Compression engine — wraps headroom's `compress()` with stats, persistence, pricing	Standalone deployment. Only dependency: metering Postgres for per-model pricing
Metering service	noyitz/ai-gateway-metering-service	Token usage tracking, per-user billing, model pricing DB	Headroom reads `model_pricing` table for cost calculations. Compression tab embedded via iframe
MaaS controller	models-as-a-service	Generates AuthPolicy, ExternalModel CRDs, manages API keys	Configures the auth pipeline that identifies users. Not running on dogfood cluster — patched manually
IPP config (ConfigMap)	Deployed on cluster	Defines the plugin chain order	Headroom must be listed after `model-provider-resolver` and before `api-translation`

Build independence: This repo (headroom service + dashboard) deploys independently — no IPP rebuild needed. The Go plugin is part of the payload-processing image build. The only shared contract is POST /v1/compress with {messages, model} → {messages, tokens_before, tokens_after, tokens_saved, compression_ratio}.

Zero user changes. Same MaaS URL, same API key:

export ANTHROPIC_BASE_URL=https://maas.company.com/llm/ext-opus
export ANTHROPIC_API_KEY=<MaaS-key>
claude --model claude-opus-4-8

Compression Results

From production deployment on OpenShift:

Content Type	Savings	Engine
Kubernetes API responses (JSON)	60-74%	SmartCrusher
Build/test logs	65-80%	LogCompressor / SearchCompressor
Search/grep results	70-80%	SearchCompressor
Free text / documentation	10-35%	Kompress ML
Git diffs	40-60%	DiffCompressor
Code editing sessions (mixed)	5-15%	Kompress ML (most content is excluded — see below)

Compression Engines

Headroom's ContentRouter auto-detects content type and routes to the best compressor:

Engine	Active	What It Does	Typical Savings
SmartCrusher	YES	Compresses JSON arrays — deduplicates structure, keeps important items and anomalies. Rust-backed via PyO3.	60-74%
Kompress ML	YES	Custom ML model (ModernBERT tokenizer + ONNX) trained on AI agent traces. Scores token importance, removes low-value tokens.	10-35%
SearchCompressor	YES	Compresses grep/ripgrep results — keeps matching lines, deduplicates context.	70-80%
LogCompressor	YES	Compresses build logs and test output — keeps errors/failures, deduplicates repetitive lines.	65-80%
DiffCompressor	YES	Compresses git diffs — keeps hunks, deduplicates similar changes.	40-60%
HTMLExtractor	YES	Extracts meaningful text from HTML, strips tags and boilerplate.	varies
CodeAwareCompressor	NO	AST-based code compression (Python, JS, Go, Rust, Java, C++). Disabled by default upstream — headroom recommends code graph MCP tools instead.	30-50%
ImageCompressor	N/A	ML router for image size reduction. Not applicable in our text-only pipeline.	40-90%

When Each Engine Fires

The ContentRouter selects an engine based on the shape of the content, not its meaning:

Engine	Triggers When	Examples
SmartCrusher	Content starts with `[` (JSON array) or is a JSON object containing arrays	`kubectl get pods -o json`, `pip list --format=json`, any API returning a list of items
LogCompressor	Lines start with timestamp patterns (`2026-06-24T...`, `[INFO]`, `[ERROR]`) — structured, repetitive	`docker build` output, `pytest -v` with PASSED/FAILED lines, application logs
SearchCompressor	Content looks like grep/ripgrep results with file paths, line numbers, and match context	`grep -rn "pattern" .`, `rg "pattern"`, search result outputs
DiffCompressor	Content matches unified diff format (`---`, `+++`, `@@`, `+`/`-` prefixed lines)	`git diff`, `git show`, patch files
HTMLExtractor	Content contains HTML tags (`<html>`, `<div>`, `<p>`)	Web scraping results, API responses with HTML bodies
Kompress ML	Fallback — fires when nothing above matches. Any free-form text.	Documentation, meeting notes, prose, descriptions
None (skipped)	Content is from an excluded tool (Read, Write, Edit), is a user/system message, is an error with stack traces, or is below 500 tokens	File reads, code edits, short responses, error tracebacks

Content Protection

Not everything gets compressed. Headroom protects content that must remain exact:

Content	Action	Reason
User messages	Protected	Model needs exact user input
System prompts	Protected	Cache-hot instruction bytes
Read tool outputs (fresh)	Excluded	Exact file content needed for code editing
Write/Edit tool outputs	Excluded	Mutation records must be exact to prevent duplicate edits
Error outputs	Protected	Tracebacks and stack traces preserved verbatim for debugging
Recent tool outputs	Protected	Last N turns kept verbatim
Stale Read outputs	Compressed	File was edited after reading — content is factually wrong
Superseded Read outputs	Compressed	File was re-read later — content is redundant

Why Code Editing Sessions Show Lower Savings

Coding sessions are dominated by Read/Edit tool outputs which headroom deliberately excludes. Without retrieval support (CCR), compressing a file read would mean the LLM works from a summary instead of exact code — risking wrong edits. Headroom's ReadLifecycle catches stale reads (67% of all reads) and superseded reads (12%), but fresh reads (20%) stay verbatim. DevOps/debugging sessions with logs, JSON, and command output see much higher savings (40-70%).

Features

Per-user session cache — avoids re-compressing content seen in earlier requests (same user, same session)
Per-model pricing — real costs from metering Postgres, not hardcoded estimates
Durable savings tracking — proxy_savings.json on PVC, lifetime stats survive restarts
CCR (Compress-Cache-Retrieve) — originals stored on PVC, LLM can retrieve via headroom_retrieve tool
Image compression — optional, reduces image token cost by 40-90%
GPU acceleration — onnxruntime-gpu auto-detects NVIDIA GPUs, falls back to CPU
Built-in dashboard — /dashboard with compression stats, cache hits, agent usage, savings breakdown
Playground + CCR demo — interactive compression demo with retrieval visualization
Prometheus metrics — /metrics endpoint

Deployment

Prerequisites (not ours — must exist first)

These are deployed by the MaaS platform team before headroom:

#	Component	Purpose	How to verify
1	OpenShift cluster	Infrastructure	`oc whoami` succeeds
2	Istio / Envoy gateway	Traffic routing, TLS	Gateway pods running
3	Kuadrant	API key validation, auth	`oc get authpolicy -A` returns policies
4	MaaS controller + maas-api	API key management, user identity	`oc get pods -n opendatahub` shows maas-api
5	IPP / payload-processing	Plugin chain (metering, api-translation, apikey-injection)	`oc get deployment payload-processing -n openshift-ingress`
6	Metering service + Postgres	Usage tracking, model pricing	`oc get statefulset metering-postgresql -n openshift-ingress`

Headroom does NOT require any IPP plugin or ConfigMap changes. It sits before MaaS as a transparent proxy. The IPP pipeline runs after MaaS as usual.

Deploy headroom proxy

# 1. Clone this repo
git clone https://github.com/yossiovadia/headroom-gateway.git
cd headroom-gateway

# 2. Deploy (one command)
./scripts/deploy-proxy.sh --hf-token hf_xxx

# Optional: custom MaaS URL
./scripts/deploy-proxy.sh --maas-url https://maas.company.com/llm/ext-opus

The script:

Builds the headroom proxy image on-cluster (CUDA base + headroom-ai + models)
Creates a PVC for persistent stats and CCR store
Deploys the proxy with correct env vars
Creates a Service + Route
Runs a smoke test

What gets deployed

headroom-proxy (Deployment)
  ├── image: headroom proxy with GPU + image support
  ├── port: 8787
  ├── env: ANTHROPIC_TARGET_API_URL → MaaS gateway
  └── volume: PVC at /opt/app/.headroom (savings + CCR store)

headroom-proxy (Service)
  └── ClusterIP port 8787

headroom-proxy (Route)
  └── TLS edge → https://headroom-proxy-<namespace>.apps.<cluster>

No dashboard pod needed — the proxy serves /dashboard built-in.

Point users at headroom

# Users set ONE env var (instead of pointing at MaaS directly)
export ANTHROPIC_BASE_URL="https://headroom-proxy-<namespace>.apps.<cluster>"
export ANTHROPIC_API_KEY="<MaaS-API-key>"       # same key as before
claude --model claude-opus-4-8                    # works normally

GPU Support

GPU is auto-detected at runtime. No flags needed.

Resource	CPU-only	With GPU
Kompress ML latency	~3s	<100ms
Concurrent users	~50	~200+
GPU required	—	1x NVIDIA L4/T4
Memory	4-8Gi	4-8Gi

Emergency Controls

Rollback — users change one env var to bypass headroom entirely:

export ANTHROPIC_BASE_URL="https://maas.company.com/llm/ext-opus"  # direct to MaaS

No deployment changes needed. Compression stops, everything else works.

Scale to zero — stop headroom without affecting MaaS:

oc scale deployment/headroom-proxy --replicas=0

API Endpoints

Endpoint	Method	Purpose
`/v1/compress`	POST	Compress messages (IPP plugin calls this)
`/stats`	GET	Stats for dashboard (aggregates, per-user, recent)
`/stats/insights`	GET	Engine breakdown, by-model, hourly trends
`/stats-history`	GET	Lifetime stats with history
`/pricing`	GET	Per-model pricing from metering Postgres
`/sessions`	GET	Active user sessions and cache hit rates
`/readyz`	GET	Readiness probe (checks SQLite)
`/health`	GET	Liveness probe
`/metrics`	GET	Prometheus counters

Testing

37 tests run against the live deployed service:

# Fast tests (compression, stats, health, dashboard)
HEADROOM_TEST_URL=https://headroom-service-... pytest tests/ -v

# Including persistence tests (restarts pod, ~60s)
HEADROOM_TEST_URL=https://headroom-service-... pytest tests/ -v --run-slow

# Benchmark compression latency
./scripts/benchmark-compression.sh --requests 10

Project Structure

headroom-gateway/
├── service/
│   ├── headroom_service.py     # FastAPI compression service
│   └── Dockerfile              # CUDA + headroom-ai + onnxruntime-gpu
├── dashboard/
│   └── index.html              # Dashboard + Playground (stats, per-user, pipeline viz)
├── scripts/
│   ├── deploy-headroom.sh      # Idempotent OpenShift deployment
│   ├── benchmark-compression.sh # Latency benchmark
│   └── ab-cache-test.py        # A/B cache hit comparison
├── tests/
│   ├── test_compress.py        # 14 compression tests
│   ├── test_stats.py           # 9 stats tests
│   ├── test_dashboard.py       # 7 dashboard contract tests
│   ├── test_health.py          # 5 health/metrics tests
│   └── test_persistence.py     # 2 persistence tests (pod restart)
├── docs/
│   └── architecture.md         # Full design doc with gap analysis
├── LICENSE                     # Apache 2.0
└── NOTICE                      # Third-party attributions

Related Repos

Repo	Purpose
headroom	Upstream compression library (Apache 2.0)
ai-gateway-payload-processing branch `feat/headroom-on-metering`	IPP Go plugin
ai-gateway-metering-service	MaaS dashboard (compression tab embedded)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Headroom Gateway

The Problem

How It Works

How It Fits Together

Compression Results

Compression Engines

When Each Engine Fires

Content Protection

Why Code Editing Sessions Show Lower Savings

Features

Deployment

Prerequisites (not ours — must exist first)

Deploy headroom proxy

What gets deployed

Point users at headroom

GPU Support

Emergency Controls

API Endpoints

Testing

Project Structure

Related Repos

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
dashboard		dashboard
deploy		deploy
docs		docs
scripts		scripts
service		service
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
docker-compose.yaml.tmp		docker-compose.yaml.tmp

Folders and files

Latest commit

History

Repository files navigation

Headroom Gateway

The Problem

How It Works

How It Fits Together

Compression Results

Compression Engines

When Each Engine Fires

Content Protection

Why Code Editing Sessions Show Lower Savings

Features

Deployment

Prerequisites (not ours — must exist first)

Deploy headroom proxy

What gets deployed

Point users at headroom

GPU Support

Emergency Controls

API Endpoints

Testing

Project Structure

Related Repos

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages