llm-d · llm-d-router-release-notes · Jun 16, 2026 · Jun 16, 2026
@@ -1,3 +1,45 @@
+RELEASE v0.9.0-rc.2 2026-06-16
+2026-05-25 https://github.com/llm-d/llm-d-router/pull/1030 InFlightLoadProducer now reliably tracks global token and request counts in the presence of timeouts, disconnects, and long-lived streams.
+2026-05-25 https://github.com/llm-d/llm-d-router/pull/1218 EPP can now run without a Kubernetes cluster. When `dataLayer.discovery.pluginRef` is set, the runner skips controller-manager setup and drives endpoint discovery through the file-discovery plugin ("file-discovery"). See docs/discovery.md and pkg/epp/framework/plugins/datalayer/discovery/file/README.md for more details.
+2026-05-25 https://github.com/llm-d/llm-d-router/pull/1247 The EndpointPickerConfig API has been refactored to provide a more structured and logical grouping of configuration fields. These changes improve schema clarity and provide a cleaner foundation for future feature extensions. Existing YAML/JSON configuration files must be updated to reflect the new nested structure, old fields will continue to work for two releases: - Saturation Detector Migration: The SaturationDetector field has been moved from the top-level configuration into the FlowControl block. - Parser Encapsulation: A new requestHandler struct has been introduced to house request-handler component configurations. The parser field has been moved into this new block.
+2026-05-25 https://github.com/llm-d/llm-d-router/pull/1276 Fix cached prompt-token usage extraction in the EPP OpenAI parser. `cached_tokens` is now read from `usage.prompt_tokens_details` (and from `usage.input_tokens_details` for the Responses API), so prompt cache-hit metrics are recorded instead of being silently dropped.
+2026-05-26 https://github.com/llm-d/llm-d-router/pull/1244 Add `agent-identity` plugin that derives `FairnessID` from agent session headers (Claude Code, OpenCode, Codex).
+2026-05-26 https://github.com/llm-d/llm-d-router/pull/1302 The helm charts are now released as part of the llm-d router project, they went through several structural changes and validation safeguards for users migrating from gateway-api-inference-extension ; see the "Migrating from gateway-api-inference-extension" section in https://github.com/llm-d/llm-d-router/blob/main/config/charts/README.md for the migration guide.
+2026-05-27 https://github.com/llm-d/llm-d-router/pull/1234 Update the default vLLM and simulator images, and remove the UDS Tokenizer and the `UDS_TOKENIZER_IMAGE` environment variable. Use `VLLM_RENDER_IMAGE` environment variable to define the render image name.
+2026-05-27 https://github.com/llm-d/llm-d-router/pull/1372 Added a `session-id-producer` DataProducer plugin (type: `session-id-producer`), which extracts a session identifier from a configured request header or cookie and publishes it as the `SessionID` attribute on the request attribute store for use by future affinity-aware scorers and filters.
+2026-05-28 https://github.com/llm-d/llm-d-router/pull/1248 New `/inference/v1/generate` endpoint is added, that accepts pre-tokenized prompts (`token_ids`) and optional multimodal features (image/audio/video hashes and placeholder ranges). To enable, configure the new `vllmhttp-parser` (Helm value `router.epp.parser=vllmhttp-parser`, or set `parser: vllmhttp-parser` in EPP configuration). The parser handles `/inference/v1/generate` locally and delegates all other paths to the OpenAI parser, so a single instance covers both vLLM-specific and OpenAI-compatible HTTP traffic on the same endpoint. Existing `openai-parser` deployments are unaffected and need no changes unless `/inference/v1/generate` support is desired.
+2026-05-30 https://github.com/llm-d/llm-d-router/pull/1402 Fix encode disaggregation not triggering for `audio_url` content type requests
+2026-05-31 https://github.com/llm-d/llm-d-router/pull/1418 Remove the deprecated pkg/epp/backend/metrics package and the enableLegacyMetrics feature gate. All metrics collection now goes through the datalayer pipeline exclusively. Configurations referencing the enableLegacyMetrics feature gate should remove it.
+2026-06-02 https://github.com/llm-d/llm-d-router/pull/1121 `precise-prefix-cache-scorer` is now a thin compatibility wrapper around `precise-prefix-cache-producer` and the `prefix-cache-scorer`. Existing configurations continue to work. Deployments without `endpoint-notification-source` wired must add it (or use global socket mode); the legacy in-Score subscriber discovery path is removed. The plugin is deprecated; configure `precise-prefix-cache-producer` + `prefix-cache-scorer` with `prefixMatchInfoProducerName: precise-prefix-cache-producer` directly for new deployments.
+2026-06-02 https://github.com/llm-d/llm-d-router/pull/1160 The approximate prefix-cache plugin's autotune path now clamps blockSizeTokens at a minimum of 64 to bound EPP indexer memory. Manually configured values below 64 are still honored but log a deploy-time warning. This is a deliberate routing-precision / memory-stability tradeoff: the routing scorer measures prefix matches at coarser granularity than the model server's true block size.
+2026-06-02 https://github.com/llm-d/llm-d-router/pull/1288 - When a request carries the standard HTTP `Prefer: if-available` header (RFC 7240), the EPP routes to a decode worker only if its KV cache already covers the prompt; otherwise it returns HTTP 412 Precondition Failed so the coordinator restarts the pipeline at encode/prefill/decode. - The cache check reads `PrefixCacheMatchInfo` from the chosen endpoint using the default-named approximate-prefix producer's key. Deployments using the auto-created `approx-prefix-cache-producer` (the canonical decode-EPP recipe) get the optimization. Deployments using a custom-named approx producer or `precise-prefix-cache-producer` write under a different key, so the gate misses and the coordinator receives 412 on every conditional-decode request — falling back to the full pipeline (correctness preserved, optimization effectively disabled) for those configurations.
+2026-06-02 https://github.com/llm-d/llm-d-router/pull/1436 Metrics emitted by plugins will have `plugin_name` and `plugin_type` labels.
+2026-06-02 https://github.com/llm-d/llm-d-router/pull/1449 If both legacy (inference.networking.x-k8s.io) and new (llm-d.ai) InferenceObjective/InferenceModelRewrite CRDs are installed. EPP will only reconcile the new group and IGNORE legacy resources.
+2026-06-05 https://github.com/llm-d/llm-d-router/pull/1475 EPP now supports configuring multiple parsers under `requestHandler.parsers` in the `EndpointPickerConfig`. The router matches the request path suffix to select the appropriate parser (first match wins for duplicate parsers supporting suffix match).
+2026-06-05 https://github.com/llm-d/llm-d-router/pull/1488 Enable openai, anthropic, and vllmhttp parsers by default in EPP.
+2026-06-05 https://github.com/llm-d/llm-d-router/pull/1493 `inflight-load-producer`: a new `prefixMatchInfoProducerName` parameter selects which prefix-cache producer supplies the cached-prefix discount - the approximate-prefix producer by default, or a precise-prefix-cache producer when set.
+2026-06-06 https://github.com/llm-d/llm-d-router/pull/1509 `anthropic-parser`: supports the `/v1/messages/count_tokens` endpoint; the body is forwarded unchanged as a raw payload.
+2026-06-07 https://github.com/llm-d/llm-d-router/pull/1426 Requests that omit the model field can now be handled by generic InferenceModelRewrite rules instead of being rejected with BadRequest.
+2026-06-08 https://github.com/llm-d/llm-d-router/pull/1513 Consolidated tracing initialization and tracer retrieval. Added a `--tracing` flag to `pd-sidecar` (defaulting to `false`) to allow conditionally enabling tracing and avoiding unwanted OTLP connection attempts by default.
+2026-06-08 https://github.com/llm-d/llm-d-router/pull/1515 EPP: route /v1/chat/completions/render and /v1/completions/render through the OpenAI parser.
+2026-06-09 https://github.com/llm-d/llm-d-router/pull/1444 New EC-NIXL encoder disaggregation connector: Use `--ec-connector=ec-nixl` to the sidecar options, to route multimodal encoder requests through NIXL prior to the prefill phase. OpenTelemetry (OTel) Span Attribute Renaming: Span attributes emitted by the encoder-disaggregation path have been updated from the llm_d.epd_proxy.* namespace to llm_d.ec_proxy.*.
+2026-06-09 https://github.com/llm-d/llm-d-router/pull/1536 EPP: added `modality` label to encoder_cache_queries_total and encoder_cache_hits_total.
+2026-06-09 https://github.com/llm-d/llm-d-router/pull/1539 EPP trace spans now consistently carry the build version and commit SHA across all instrumentation scopes, so a full request trace can be attributed to a single build.
+2026-06-10 https://github.com/llm-d/llm-d-router/pull/1554 Approximate prefix cache affinity routing now considers tools
+2026-06-11 https://github.com/llm-d/llm-d-router/pull/1548 The approximate prefix-cache producer (approx-prefix-cache-producer) now defaults maxPrefixTokensToMatch to 131072 (128K tokens), matching the context window of large production models such as gpt-oss 120b. This token-based cap takes precedence over maxPrefixBlocksToMatch, so by default up to 131072 / blockSizeTokens prefix blocks are matched per request instead of the previous 256-block cap. Set maxPrefixTokensToMatch: 0 to restore the block-based cap.
+2026-06-11 https://github.com/llm-d/llm-d-router/pull/1575 The standalone Helm chart now supports `router.proxy.mode=service`, deploying the Envoy proxy as a separate horizontally scalable Service (instead of an EPP sidecar) that reaches EPP over the EPP Service with fail-open ext_proc for active/passive resiliency. Default remains `sidecar`.
+2026-06-12 https://github.com/llm-d/llm-d-router/pull/1206 sidecar: replace fallback-to-decode on prefill failure with configurable retry logic (--prefill-max-retries, --prefill-retry-backoff). Prefill errors are now returned to the client instead of silently falling back to unaccelerated decode.
+2026-06-12 https://github.com/llm-d/llm-d-router/pull/1603 Added an opt-in `--enable-grpc-stream-metrics` flag to the EPP exposing ext_proc gRPC stream metrics: in-flight stream count, hold duration, and completions by gRPC status code.
+2026-06-12 https://github.com/llm-d/llm-d-router/pull/1607 The EPP gRPC health check port is configurable via `router.epp.grpcHealthPort` (default 9003).
+2026-06-12 https://github.com/llm-d/llm-d-router/pull/1608 Fixed a regression where a priority band defined only via an InferenceObjective (not in the static EPP config) could be garbage-collected after a period of inactivity, causing subsequent requests at that priority to be rejected with "priority band not found".
+2026-06-12 https://github.com/llm-d/llm-d-router/pull/1626 EPP now records an error status on the gateway.request and gateway.request_orchestration trace spans when a request fails, so failed requests can be filtered by error status in the trace backend.
+2026-06-12 https://github.com/llm-d/llm-d-router/pull/1631 - Add a new `session-affinity-filter` scheduling plugin: pins a session to its previously selected pod as a hard constraint, falling back to all candidates when that pod is no longer available. Complements the existing `session-affinity-scorer` (soft preference). - `session-affinity-scorer` and `session-affinity-filter` now accept an optional `headerName` parameter to carry the session token on a custom request/response header instead of the default `x-session-token`.
+2026-06-13 https://github.com/llm-d/llm-d-router/pull/1550 Add a bundled Grafana dashboard for EPP, inference pool, vLLM, and flow-control metrics.
+2026-06-15 https://github.com/llm-d/llm-d-router/pull/1429 Added `llm_d_router_epp_encoder_cache_hit_ratio` histogram metric.
+2026-06-15 https://github.com/llm-d/llm-d-router/pull/1651 `llm_d_router_epp_plugin_duration_seconds` was previously limited to scheduler plugins, now it's extended to record all plugins.
+2026-06-15 https://github.com/llm-d/llm-d-router/pull/1653 Session affinity filter/scorer can optionally pick the scheduling profile to inject the routed endpoint from. This enables P/D disaggregation support.
+2026-06-16 https://github.com/llm-d/llm-d-router/pull/1661 The metrics prefix changed to llm_d_epp
+
 RELEASE pre-fragments 2026-05-24
 2026-05-24 https://github.com/llm-d/llm-d-router/pull/1134 EPP now strictly parses plugin configurations — unknown fields cause plugin initialization to fail with a clear error rather than being silently ignored. Deprecated fields continue to be accepted with a warning until they are removed.
 2026-05-24 https://github.com/llm-d/llm-d-router/pull/1079 deprecated UDS-backend in `token-producer`

@@ -26,7 +26,7 @@ IMAGE_REGISTRY=${IMAGE_REGISTRY:-ghcr.io/llm-d}
 AGENTGATEWAY_TAG=${AGENTGATEWAY_TAG:-${EXTRA_TAG}}
 CHART_SUFFIX=${CHART_SUFFIX:-""}
 EPP_RELEASE_IMAGE_REPOSITORY=${EPP_RELEASE_IMAGE_REPOSITORY:-llm-d-router-endpoint-picker}
-LATENCY_PREDICTOR_TAG=${LATENCY_PREDICTOR_TAG:-latest}
+LATENCY_PREDICTOR_TAG=${LATENCY_PREDICTOR_TAG:-"v0.8.0-rc.1"}
 export EXTRA_TAG AGENTGATEWAY_TAG IMAGE_REGISTRY EPP_RELEASE_IMAGE_REPOSITORY LATENCY_PREDICTOR_TAG CHART_SUFFIX
 
 HELM_CHART_REPO=${HELM_CHART_REPO:-${IMAGE_REGISTRY}/charts}