llm-d Planner

From business requirements to running llm-d deployments — sizing, estimation, and deployment in one workflow.

Overview

Deploying LLMs in production means navigating a fragmented landscape of model selection, GPU sizing, performance estimation, and Kubernetes configuration. Platform teams typically piece together separate tools and manual calculations, leading to expensive trial-and-error.

llm-d Planner unifies these concerns into a single platform with three capabilities:

Planner — The main workflow. Guides users from business requirements to a running LLM deployment through conversational requirements gathering, SLO-driven recommendations, and one-click Kubernetes deployment.
Capacity Planner — Estimate GPU memory requirements for any HuggingFace model: model weights, KV cache, activation memory, and system overhead. Determine minimum GPU count, maximum context length, and concurrent request capacity for a given hardware configuration.
GPU Recommender — Estimate inference performance (TTFT, ITL, throughput) across GPU types without running actual benchmarks. Compare latency and performance tradeoffs to find the optimal GPU for your workload.

Planner is the main workflow — it makes recommendations based on available benchmarks, and leverages the Capacity Planner and GPU Recommender to fill gaps and expand the search surface beyond what benchmarks alone can cover. The Capacity Planner and GPU Recommender are also useful on their own for deeper analysis in their respective areas, each with its own UI page, CLI command, and API endpoint.

Key Features

Conversational Requirements Gathering - Describe your AI-powered use case in natural language
SLO-Driven Capacity Planning - Translate business needs into technical specifications (traffic profiles, latency targets, cost constraints)
GPU Memory Estimation - Calculate model weight memory, KV cache, activation overhead, and minimum GPU requirements for any HuggingFace model
Performance Estimation - Estimate TTFT, ITL, and throughput across GPU types without running benchmarks
SLO-Driven Recommendations - Get optimal model + GPU configurations backed by real benchmarks, with estimated performance fallback
Multi-Criteria Ranking - Score configurations on accuracy, price, latency, and complexity with 5 ranked views
What-If Analysis - Explore alternatives and compare cost vs. latency tradeoffs
One-Click Deployment - Generate production-ready KServe/vLLM YAML and deploy to Kubernetes
Performance Monitoring - Track actual deployment status and test inference in real-time
CLI and API - Use the planner CLI or REST API for programmatic access to all capabilities
GPU-Free Development - vLLM simulator enables local testing without GPU hardware

How It Works

Planner flow:

Extract Intent - LLM-powered analysis converts your description into structured requirements
Map to Traffic Profile - Map use case to appropriate GuideLLM benchmark configurations
Set SLO Targets - Auto-generate TTFT, ITL, and E2E targets based on experience class
Query Benchmarks - Exact match on (model, GPU, traffic profile) from benchmark database
Estimate if Missing - For models/GPUs without benchmarks, generate estimated performance via GPU Recommender
Filter by Latency - Find configurations meeting all latency targets
Plan Capacity - Calculate required replicas based on throughput requirements
Score and Rank Solutions - Score and rank solutions for quality, cost, and latency.
Present to User - Present top options to user in each category.
Generate & Deploy - Create validated Kubernetes YAML and deploy to local or production clusters

Capacity analysis flow:

Fetch Model Config - Retrieve architecture details from HuggingFace (layers, heads, quantization)
Calculate Memory - Model weights, KV cache per token, activation memory, system overhead
Evaluate Hardware Fit - Determine valid tensor parallelism values and whether the model fits on a given GPU
Report Capacity - Maximum context length, concurrent requests, KV cache blocks for the configuration

Performance analysis flow:

Select GPUs - Choose from a catalog of GPU types (H100, A100, L40, etc.) or evaluate all
Estimate Performance - Estimate TTFT, ITL, and throughput per GPU (currently via BentoML LLM Optimizer roofline model, with additional backends planned)
Apply Constraints - Filter by max TTFT, max ITL, max latency, and GPU count limits
Rank Results - Compare qualifying GPUs sorted by performance with cost details

Prerequisites

Required before running make setup:

macOS or Linux (Windows via WSL2)
Docker or Podman (must be running)
Python 3.13 - brew install python@3.13 (Linux: use your package manager or pyenv)
uv - curl -LsSf https://astral.sh/uv/install.sh | sh
Ollama - brew install ollama (Linux: ollama.com/download)
kubectl - brew install kubectl (Linux: kubernetes.io/docs/tasks/tools)
KIND - brew install kind (Linux: kind.sigs.k8s.io/docs/user/quick-start)

Quick Start

Get up and running in 4 commands:

make setup          # Install dependencies, pull Ollama model
make start          # Start all services (DB + Ollama + Backend + UI)
make db-load-blis   # Load BLIS benchmark data
make cluster-start  # Optional: Create local KIND cluster with vLLM simulator for testing deployments

Then open http://localhost:8501 in your browser.

Note: PostgreSQL runs as a Docker container (planner-postgres) with benchmark data. All db-load-* commands append to existing data. Use make db-reset first for a clean database. Benchmark data can also be uploaded and managed via the UI's Configuration tab or the REST API (/api/v1/db/upload-benchmarks).

Stop everything:

make stop           # Stop Backend + UI (leaves Ollama and DB running)
make stop-all       # Stop all services including Ollama and DB
make cluster-stop   # Delete cluster (optional)

Using Planner

Recommendations (main page):

Describe your use case in the chat interface
- Example: "I need a customer service chatbot for 5000 users with low latency"
Analyze use case - Click "Analyze Use Case" to extract intent
Generate specification - Click "Generate Specification" to create traffic profile and SLO targets
Review specification - Edit SLO targets, priorities, or constraints if needed
Generate recommendations - Click "Generate Recommendations" to find recommended configurations
Select a recommendation - Review ranked options and click "Select"
Deploy - Go to the "Deployment" tab to review, copy, or download generated deployment files, or deploy to Kubernetes cluster if connected

Capacity Planner (sidebar page): Enter a HuggingFace model ID and GPU configuration to see memory breakdown (model weights, KV cache, activation, overhead), maximum context length, and concurrent request capacity.

GPU Recommender (sidebar page): Enter a model ID and workload parameters (input/output token lengths) to get estimated inference performance across GPU types, ranked by cost.

Architecture

See docs/ARCHITECTURE.md for detailed system design.

Implemented Features

✅ Planner: Intent extraction, traffic profiling, multi-criteria model + GPU recommendation, SLO-driven capacity planning
✅ Capacity Planner: GPU memory estimation (model weights, KV cache for MHA/GQA/MQA/MLA, activation, overhead), model-fits-GPU analysis, maximum context length and concurrent request calculations
✅ GPU Recommender: Performance estimation (TTFT, ITL, throughput) across GPU types, cost comparison, latency constraint filtering
✅ Estimated Performance Fallback: When benchmarks are unavailable, roofline estimates fill the gap in the recommendation pipeline (visually differentiated as "Estimated" vs "Benchmarked")
✅ Traffic Profile Framework: GuideLLM standard configurations with experience-driven SLOs (9 use cases, 5 experience classes, latency targets)
✅ FastAPI Backend: REST endpoints for recommendations, capacity planning, GPU estimation, and database management (/api/v1/model-info, /api/v1/calculate, /api/v1/estimate)
✅ Streamlit UI: Chat interface, recommendation display, specification editor, Capacity Planner page, GPU Recommender page
✅ CLI: planner plan for capacity planning, planner estimate for GPU performance estimation
✅ Deployment Automation: YAML generation (KServe/vLLM/HPA/ServiceMonitor), Kubernetes deployment
✅ Local Kubernetes: KIND cluster support, KServe installation, cluster management
✅ vLLM Simulator: GPU-free development mode with realistic latency simulation
✅ Monitoring & Testing: Real-time deployment status, inference testing UI, cluster observability
✅ Database Management: PostgreSQL benchmark storage, upload/reset via REST API or UI Configuration tab

Key Technologies

Component	Technology
Backend	FastAPI, Pydantic
Frontend	Streamlit
LLM	Ollama (qwen2.5:7b)
Data	PostgreSQL
Performance Estimation	BentoML llm-optimizer (roofline model)
Model Configs	HuggingFace Hub API, transformers
YAML Generation	Jinja2 templates
Kubernetes	KIND (local), KServe v0.14.0
Deployment	kubectl

CLI

The planner CLI provides direct access to the Capacity Planner and GPU Recommender without running the web services.

# Capacity planning — memory breakdown for a model
planner plan --model Qwen/Qwen3-32B

# With GPU memory specified — shows allocatable KV cache, max concurrent requests
planner plan --model Qwen/Qwen3-32B --gpu-memory 80 --tp 4

# Auto-calculate maximum context length that fits
planner plan --model Qwen/Qwen3-32B --gpu-memory 80 --max-model-len -1

# GPU performance estimation — estimate TTFT/ITL/throughput across GPUs
planner estimate --model Qwen/Qwen3-32B --input-len 512 --output-len 128

# Estimate with specific GPUs and latency constraints
planner estimate --model Qwen/Qwen3-32B --input-len 512 --output-len 128 \
  --gpu-list H100,A100,L40 --max-ttft 100 --max-itl 10

# Human-readable output sorted by cost
planner estimate --model Qwen/Qwen3-32B --input-len 512 --output-len 128 --pretty

Run planner --help, planner plan --help, or planner estimate --help for all options.

Development Commands

make help                    # Show all available commands
make start                   # Start all services (DB + Ollama + Backend + UI)
make stop                    # Stop Backend + UI (leaves Ollama and DB running)
make stop-all                # Stop everything including Ollama and DB
make restart                 # Restart all services
make logs-backend            # Show backend logs
make logs-ui                 # Show UI logs

# Database (PostgreSQL)
make db-start                # Start PostgreSQL (initializes schema on first run)
make db-load-blis            # Load BLIS benchmark data (appends)
make db-load-estimated       # Load estimated performance data (appends)
make db-load-interpolated    # Load interpolated benchmark data (appends)
make db-load-guidellm        # Load benchmark data created with GuideLLM (not included in repo yet) (appends)
make db-reset                # Reset database (remove all data and reinitialize)
make db-shell                # Open PostgreSQL shell
make db-query-models         # Query available models in database
make db-query-traffic        # Query traffic patterns in database

# Kubernetes
make cluster-status          # Check Kubernetes cluster status
make clean-deployments       # Delete all InferenceServices

# Testing
make test                    # Run all tests (requires DB and Ollama)
make test-unit               # Run unit tests (no external dependencies)
make test-db                 # Run database tests (requires PostgreSQL)
make test-integration        # Run integration tests (requires Ollama and DB)

make clean                   # Remove generated files

vLLM Simulator Mode

Planner includes a GPU-free simulator for local development:

No GPU required - Run deployments on any laptop
OpenAI-compatible API - /v1/completions and /v1/chat/completions
Realistic latency - Uses benchmark data to simulate TTFT/ITL
Fast deployment - Pods become Ready in ~10-15 seconds

The deployment mode defaults to production (real vLLM with GPUs). Switch between production and simulator modes at runtime using the Configuration tab in the UI, or via the REST API:

GET /api/v1/deployment-mode - Check current mode
PUT /api/v1/deployment-mode - Set mode ({"mode": "simulator"} or {"mode": "production"})

See docs/DEVELOPER_GUIDE.md for details.

Documentation

Developer Guide - Development workflows, testing, debugging
Architecture - Detailed system design and component specifications
Traffic and SLOs - Traffic profile framework and experience-driven SLOs (Phase 2)
Logging Guide - Logging system and debugging
Claude Code Guidance - AI assistant instructions for contributors

Future Enhancements

Prefill/Decode Disaggregation - Support P/D disaggregation as a deployment topology option
llm-d Stack Configuration - Expose stack-level configuration including routing, P/D disaggregation, and fine-grained vLLM parameters
vLLM Parameter Search - Expand configuration search to include serving-stack knobs (batch size, scheduling policy, memory utilization)
Security Hardening - YAML validation, RBAC, network policies

Some of these enhancements may involve collaboration with other llm-d SIGs.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

See CLAUDE.md for AI assistant guidance when making changes.

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 625 Commits
.github		.github
.streamlit		.streamlit
config		config
data		data
deploy/kubernetes		deploy/kubernetes
docs		docs
generated_configs		generated_configs
scripts		scripts
simulator		simulator
src/planner		src/planner
tests		tests
ui		ui
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
README.md		README.md
_typos.toml		_typos.toml
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llm-d Planner

Overview

Key Features

How It Works

Prerequisites

Quick Start

Using Planner

Architecture

Implemented Features

Key Technologies

CLI

Development Commands

vLLM Simulator Mode

Documentation

Future Enhancements

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

llm-d Planner

Overview

Key Features

How It Works

Prerequisites

Quick Start

Using Planner

Architecture

Implemented Features

Key Technologies

CLI

Development Commands

vLLM Simulator Mode

Documentation

Future Enhancements

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages