Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,78 @@ Metrics defined by llm-d Router are in addition to Inference Gateway metrics. Fo
> This metric is maintained for backward compatibility with the deprecated
> `pd-profile-handler`. New deployments should use `disagg_decision_total`.

## Flow Control Metrics

Exposed when the `flowControl` feature gate is enabled. All carry the `llm_d_epp_` prefix.

### `flow_control_request_queue_duration_seconds`

* **Type:** Histogram
* **Labels:**
* `fairness_id`: string (the tenant or flow identifier for fairness rotation)
* `priority`: string (the priority band, e.g., "0", "10")
* `outcome`: string (`Dispatched`, `RejectedCapacity`, `RejectedOther`, `EvictedTTL`, `EvictedContextCancelled`, `EvictedOther`)
* `inference_pool`: string
* `model_name`: string
* `target_model_name`: string
* **Release Stage:** ALPHA
* **Description:** Total time a request spends in the Flow Control layer, from enqueue to final outcome.
* **Usage:** Primary latency signal for flow control. Rising p99 indicates backends are saturated or capacity limits are too tight.

### `flow_control_dispatch_cycle_duration_seconds`

* **Type:** Histogram
* **Release Stage:** ALPHA
* **Description:** Time taken for each internal dispatch cycle.
* **Usage:** Measures the overhead of the dispatch loop itself. Rising values indicate increasing cost per cycle from saturation detection, priority band iteration, or fairness evaluation.

### `flow_control_request_enqueue_duration_seconds`

* **Type:** Histogram
* **Labels:**
* `fairness_id`: string (the tenant or flow identifier)
* `priority`: string (the priority band)
* `outcome`: string
* **Release Stage:** ALPHA
* **Description:** Time taken to enqueue a request into the Flow Control layer.
* **Usage:** Measures the time spent in capacity checks and queue insertion within the processor.

### `flow_control_queue_size`

* **Type:** Gauge
* **Labels:**
* `fairness_id`: string (the tenant or flow identifier)
* `priority`: string (the priority band)
* `inference_pool`: string
* `model_name`: string
* `target_model_name`: string
* **Release Stage:** ALPHA
* **Description:** Current number of requests actively held in the Flow Control queue.
* **Usage:** Tracks queue depth per priority band and tenant. A steadily growing value indicates the dispatch rate is lower than the arrival rate.

### `flow_control_queue_bytes`

* **Type:** Gauge
* **Labels:**
* `fairness_id`: string (the tenant or flow identifier)
* `priority`: string (the priority band)
* `inference_pool`: string
* `model_name`: string
* `target_model_name`: string
* **Release Stage:** ALPHA
* **Description:** Current total size in bytes of requests actively held in the Flow Control queue.
* **Usage:** Tracks memory pressure from queued requests. Compare against the configured `maxBytes` capacity to gauge how close a band is to rejecting new requests.

### `flow_control_pool_saturation`

* **Type:** Gauge
* **Labels:**
* `inference_pool`: string
* **Release Stage:** ALPHA
* **Description:** Current saturation level of the inference pool (0.0 = empty, 1.0 = fully saturated).
* **Usage:** When saturation reaches the usage limit threshold, the dispatch cycle skips dispatching and requests remain queued. Sustained 1.0 indicates all backends are at capacity.


## Opt-in ext_proc Stream Metrics

Three metrics covering ext_proc gRPC stream lifecycle. Disabled by default; enable with `--enable-grpc-stream-metrics`. These metrics are emitted under the `llm_d_epp_` prefix (separate from `llm_d_inference_scheduler_*`).
Expand Down
Loading