Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions RELEASE-NOTES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,45 @@
RELEASE v0.9.0-rc.2 2026-06-16
2026-05-25 https://github.com/llm-d/llm-d-router/pull/1030 InFlightLoadProducer now reliably tracks global token and request counts in the presence of timeouts, disconnects, and long-lived streams.
2026-05-25 https://github.com/llm-d/llm-d-router/pull/1218 EPP can now run without a Kubernetes cluster. When `dataLayer.discovery.pluginRef` is set, the runner skips controller-manager setup and drives endpoint discovery through the file-discovery plugin ("file-discovery"). See docs/discovery.md and pkg/epp/framework/plugins/datalayer/discovery/file/README.md for more details.
2026-05-25 https://github.com/llm-d/llm-d-router/pull/1247 The EndpointPickerConfig API has been refactored to provide a more structured and logical grouping of configuration fields. These changes improve schema clarity and provide a cleaner foundation for future feature extensions. Existing YAML/JSON configuration files must be updated to reflect the new nested structure, old fields will continue to work for two releases: - Saturation Detector Migration: The SaturationDetector field has been moved from the top-level configuration into the FlowControl block. - Parser Encapsulation: A new requestHandler struct has been introduced to house request-handler component configurations. The parser field has been moved into this new block.
2026-05-25 https://github.com/llm-d/llm-d-router/pull/1276 Fix cached prompt-token usage extraction in the EPP OpenAI parser. `cached_tokens` is now read from `usage.prompt_tokens_details` (and from `usage.input_tokens_details` for the Responses API), so prompt cache-hit metrics are recorded instead of being silently dropped.
2026-05-26 https://github.com/llm-d/llm-d-router/pull/1244 Add `agent-identity` plugin that derives `FairnessID` from agent session headers (Claude Code, OpenCode, Codex).
2026-05-26 https://github.com/llm-d/llm-d-router/pull/1302 The helm charts are now released as part of the llm-d router project, they went through several structural changes and validation safeguards for users migrating from gateway-api-inference-extension ; see the "Migrating from gateway-api-inference-extension" section in https://github.com/llm-d/llm-d-router/blob/main/config/charts/README.md for the migration guide.
2026-05-27 https://github.com/llm-d/llm-d-router/pull/1234 Update the default vLLM and simulator images, and remove the UDS Tokenizer and the `UDS_TOKENIZER_IMAGE` environment variable. Use `VLLM_RENDER_IMAGE` environment variable to define the render image name.
2026-05-27 https://github.com/llm-d/llm-d-router/pull/1372 Added a `session-id-producer` DataProducer plugin (type: `session-id-producer`), which extracts a session identifier from a configured request header or cookie and publishes it as the `SessionID` attribute on the request attribute store for use by future affinity-aware scorers and filters.
2026-05-28 https://github.com/llm-d/llm-d-router/pull/1248 New `/inference/v1/generate` endpoint is added, that accepts pre-tokenized prompts (`token_ids`) and optional multimodal features (image/audio/video hashes and placeholder ranges). To enable, configure the new `vllmhttp-parser` (Helm value `router.epp.parser=vllmhttp-parser`, or set `parser: vllmhttp-parser` in EPP configuration). The parser handles `/inference/v1/generate` locally and delegates all other paths to the OpenAI parser, so a single instance covers both vLLM-specific and OpenAI-compatible HTTP traffic on the same endpoint. Existing `openai-parser` deployments are unaffected and need no changes unless `/inference/v1/generate` support is desired.
2026-05-30 https://github.com/llm-d/llm-d-router/pull/1402 Fix encode disaggregation not triggering for `audio_url` content type requests
2026-05-31 https://github.com/llm-d/llm-d-router/pull/1418 Remove the deprecated pkg/epp/backend/metrics package and the enableLegacyMetrics feature gate. All metrics collection now goes through the datalayer pipeline exclusively. Configurations referencing the enableLegacyMetrics feature gate should remove it.
2026-06-02 https://github.com/llm-d/llm-d-router/pull/1121 `precise-prefix-cache-scorer` is now a thin compatibility wrapper around `precise-prefix-cache-producer` and the `prefix-cache-scorer`. Existing configurations continue to work. Deployments without `endpoint-notification-source` wired must add it (or use global socket mode); the legacy in-Score subscriber discovery path is removed. The plugin is deprecated; configure `precise-prefix-cache-producer` + `prefix-cache-scorer` with `prefixMatchInfoProducerName: precise-prefix-cache-producer` directly for new deployments.
2026-06-02 https://github.com/llm-d/llm-d-router/pull/1160 The approximate prefix-cache plugin's autotune path now clamps blockSizeTokens at a minimum of 64 to bound EPP indexer memory. Manually configured values below 64 are still honored but log a deploy-time warning. This is a deliberate routing-precision / memory-stability tradeoff: the routing scorer measures prefix matches at coarser granularity than the model server's true block size.
2026-06-02 https://github.com/llm-d/llm-d-router/pull/1288 - When a request carries the standard HTTP `Prefer: if-available` header (RFC 7240), the EPP routes to a decode worker only if its KV cache already covers the prompt; otherwise it returns HTTP 412 Precondition Failed so the coordinator restarts the pipeline at encode/prefill/decode. - The cache check reads `PrefixCacheMatchInfo` from the chosen endpoint using the default-named approximate-prefix producer's key. Deployments using the auto-created `approx-prefix-cache-producer` (the canonical decode-EPP recipe) get the optimization. Deployments using a custom-named approx producer or `precise-prefix-cache-producer` write under a different key, so the gate misses and the coordinator receives 412 on every conditional-decode request — falling back to the full pipeline (correctness preserved, optimization effectively disabled) for those configurations.
2026-06-02 https://github.com/llm-d/llm-d-router/pull/1436 Metrics emitted by plugins will have `plugin_name` and `plugin_type` labels.
2026-06-02 https://github.com/llm-d/llm-d-router/pull/1449 If both legacy (inference.networking.x-k8s.io) and new (llm-d.ai) InferenceObjective/InferenceModelRewrite CRDs are installed. EPP will only reconcile the new group and IGNORE legacy resources.
2026-06-05 https://github.com/llm-d/llm-d-router/pull/1475 EPP now supports configuring multiple parsers under `requestHandler.parsers` in the `EndpointPickerConfig`. The router matches the request path suffix to select the appropriate parser (first match wins for duplicate parsers supporting suffix match).
2026-06-05 https://github.com/llm-d/llm-d-router/pull/1488 Enable openai, anthropic, and vllmhttp parsers by default in EPP.
2026-06-05 https://github.com/llm-d/llm-d-router/pull/1493 `inflight-load-producer`: a new `prefixMatchInfoProducerName` parameter selects which prefix-cache producer supplies the cached-prefix discount - the approximate-prefix producer by default, or a precise-prefix-cache producer when set.
2026-06-06 https://github.com/llm-d/llm-d-router/pull/1509 `anthropic-parser`: supports the `/v1/messages/count_tokens` endpoint; the body is forwarded unchanged as a raw payload.
2026-06-07 https://github.com/llm-d/llm-d-router/pull/1426 Requests that omit the model field can now be handled by generic InferenceModelRewrite rules instead of being rejected with BadRequest.
2026-06-08 https://github.com/llm-d/llm-d-router/pull/1513 Consolidated tracing initialization and tracer retrieval. Added a `--tracing` flag to `pd-sidecar` (defaulting to `false`) to allow conditionally enabling tracing and avoiding unwanted OTLP connection attempts by default.
2026-06-08 https://github.com/llm-d/llm-d-router/pull/1515 EPP: route /v1/chat/completions/render and /v1/completions/render through the OpenAI parser.
2026-06-09 https://github.com/llm-d/llm-d-router/pull/1444 New EC-NIXL encoder disaggregation connector: Use `--ec-connector=ec-nixl` to the sidecar options, to route multimodal encoder requests through NIXL prior to the prefill phase. OpenTelemetry (OTel) Span Attribute Renaming: Span attributes emitted by the encoder-disaggregation path have been updated from the llm_d.epd_proxy.* namespace to llm_d.ec_proxy.*.
2026-06-09 https://github.com/llm-d/llm-d-router/pull/1536 EPP: added `modality` label to encoder_cache_queries_total and encoder_cache_hits_total.
2026-06-09 https://github.com/llm-d/llm-d-router/pull/1539 EPP trace spans now consistently carry the build version and commit SHA across all instrumentation scopes, so a full request trace can be attributed to a single build.
2026-06-10 https://github.com/llm-d/llm-d-router/pull/1554 Approximate prefix cache affinity routing now considers tools
2026-06-11 https://github.com/llm-d/llm-d-router/pull/1548 The approximate prefix-cache producer (approx-prefix-cache-producer) now defaults maxPrefixTokensToMatch to 131072 (128K tokens), matching the context window of large production models such as gpt-oss 120b. This token-based cap takes precedence over maxPrefixBlocksToMatch, so by default up to 131072 / blockSizeTokens prefix blocks are matched per request instead of the previous 256-block cap. Set maxPrefixTokensToMatch: 0 to restore the block-based cap.
2026-06-11 https://github.com/llm-d/llm-d-router/pull/1575 The standalone Helm chart now supports `router.proxy.mode=service`, deploying the Envoy proxy as a separate horizontally scalable Service (instead of an EPP sidecar) that reaches EPP over the EPP Service with fail-open ext_proc for active/passive resiliency. Default remains `sidecar`.
2026-06-12 https://github.com/llm-d/llm-d-router/pull/1206 sidecar: replace fallback-to-decode on prefill failure with configurable retry logic (--prefill-max-retries, --prefill-retry-backoff). Prefill errors are now returned to the client instead of silently falling back to unaccelerated decode.
2026-06-12 https://github.com/llm-d/llm-d-router/pull/1603 Added an opt-in `--enable-grpc-stream-metrics` flag to the EPP exposing ext_proc gRPC stream metrics: in-flight stream count, hold duration, and completions by gRPC status code.
2026-06-12 https://github.com/llm-d/llm-d-router/pull/1607 The EPP gRPC health check port is configurable via `router.epp.grpcHealthPort` (default 9003).
2026-06-12 https://github.com/llm-d/llm-d-router/pull/1608 Fixed a regression where a priority band defined only via an InferenceObjective (not in the static EPP config) could be garbage-collected after a period of inactivity, causing subsequent requests at that priority to be rejected with "priority band not found".
2026-06-12 https://github.com/llm-d/llm-d-router/pull/1626 EPP now records an error status on the gateway.request and gateway.request_orchestration trace spans when a request fails, so failed requests can be filtered by error status in the trace backend.
2026-06-12 https://github.com/llm-d/llm-d-router/pull/1631 - Add a new `session-affinity-filter` scheduling plugin: pins a session to its previously selected pod as a hard constraint, falling back to all candidates when that pod is no longer available. Complements the existing `session-affinity-scorer` (soft preference). - `session-affinity-scorer` and `session-affinity-filter` now accept an optional `headerName` parameter to carry the session token on a custom request/response header instead of the default `x-session-token`.
2026-06-13 https://github.com/llm-d/llm-d-router/pull/1550 Add a bundled Grafana dashboard for EPP, inference pool, vLLM, and flow-control metrics.
2026-06-15 https://github.com/llm-d/llm-d-router/pull/1429 Added `llm_d_router_epp_encoder_cache_hit_ratio` histogram metric.
2026-06-15 https://github.com/llm-d/llm-d-router/pull/1651 `llm_d_router_epp_plugin_duration_seconds` was previously limited to scheduler plugins, now it's extended to record all plugins.
2026-06-15 https://github.com/llm-d/llm-d-router/pull/1653 Session affinity filter/scorer can optionally pick the scheduling profile to inject the routed endpoint from. This enables P/D disaggregation support.
2026-06-16 https://github.com/llm-d/llm-d-router/pull/1661 The metrics prefix changed to llm_d_epp

RELEASE pre-fragments 2026-05-24
2026-05-24 https://github.com/llm-d/llm-d-router/pull/1134 EPP now strictly parses plugin configurations — unknown fields cause plugin initialization to fail with a clear error rather than being silently ignored. Deprecated fields continue to be accepted with a warning until they are removed.
2026-05-24 https://github.com/llm-d/llm-d-router/pull/1079 deprecated UDS-backend in `token-producer`
Expand Down
2 changes: 1 addition & 1 deletion hack/push-chart.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ IMAGE_REGISTRY=${IMAGE_REGISTRY:-ghcr.io/llm-d}
AGENTGATEWAY_TAG=${AGENTGATEWAY_TAG:-${EXTRA_TAG}}
CHART_SUFFIX=${CHART_SUFFIX:-""}
EPP_RELEASE_IMAGE_REPOSITORY=${EPP_RELEASE_IMAGE_REPOSITORY:-llm-d-router-endpoint-picker}
LATENCY_PREDICTOR_TAG=${LATENCY_PREDICTOR_TAG:-latest}
LATENCY_PREDICTOR_TAG=${LATENCY_PREDICTOR_TAG:-"v0.8.0-rc.1"}
export EXTRA_TAG AGENTGATEWAY_TAG IMAGE_REGISTRY EPP_RELEASE_IMAGE_REPOSITORY LATENCY_PREDICTOR_TAG CHART_SUFFIX

HELM_CHART_REPO=${HELM_CHART_REPO:-${IMAGE_REGISTRY}/charts}
Expand Down
7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1030.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1121.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1160.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1206.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1218.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1234.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1244.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1247.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1248.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1276.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1288.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1302.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1372.md

This file was deleted.

7 changes: 0 additions & 7 deletions release-notes.d/unreleased/1402.md

This file was deleted.

Loading
Loading