diff --git a/docs/metrics.md b/docs/metrics.md index 5b968f8c65..0bd916804a 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -42,6 +42,78 @@ Metrics defined by llm-d Router are in addition to Inference Gateway metrics. Fo > This metric is maintained for backward compatibility with the deprecated > `pd-profile-handler`. New deployments should use `disagg_decision_total`. +## Flow Control Metrics + +Exposed when the `flowControl` feature gate is enabled. All carry the `llm_d_epp_` prefix. + +### `flow_control_request_queue_duration_seconds` + +* **Type:** Histogram +* **Labels:** + * `fairness_id`: string (the tenant or flow identifier for fairness rotation) + * `priority`: string (the priority band, e.g., "0", "10") + * `outcome`: string (`Dispatched`, `RejectedCapacity`, `RejectedOther`, `EvictedTTL`, `EvictedContextCancelled`, `EvictedOther`) + * `inference_pool`: string + * `model_name`: string + * `target_model_name`: string +* **Release Stage:** ALPHA +* **Description:** Total time a request spends in the Flow Control layer, from enqueue to final outcome. +* **Usage:** Primary latency signal for flow control. Rising p99 indicates backends are saturated or capacity limits are too tight. + +### `flow_control_dispatch_cycle_duration_seconds` + +* **Type:** Histogram +* **Release Stage:** ALPHA +* **Description:** Time taken for each internal dispatch cycle. +* **Usage:** Measures the overhead of the dispatch loop itself. Rising values indicate increasing cost per cycle from saturation detection, priority band iteration, or fairness evaluation. + +### `flow_control_request_enqueue_duration_seconds` + +* **Type:** Histogram +* **Labels:** + * `fairness_id`: string (the tenant or flow identifier) + * `priority`: string (the priority band) + * `outcome`: string +* **Release Stage:** ALPHA +* **Description:** Time taken to enqueue a request into the Flow Control layer. +* **Usage:** Measures the time spent in capacity checks and queue insertion within the processor. + +### `flow_control_queue_size` + +* **Type:** Gauge +* **Labels:** + * `fairness_id`: string (the tenant or flow identifier) + * `priority`: string (the priority band) + * `inference_pool`: string + * `model_name`: string + * `target_model_name`: string +* **Release Stage:** ALPHA +* **Description:** Current number of requests actively held in the Flow Control queue. +* **Usage:** Tracks queue depth per priority band and tenant. A steadily growing value indicates the dispatch rate is lower than the arrival rate. + +### `flow_control_queue_bytes` + +* **Type:** Gauge +* **Labels:** + * `fairness_id`: string (the tenant or flow identifier) + * `priority`: string (the priority band) + * `inference_pool`: string + * `model_name`: string + * `target_model_name`: string +* **Release Stage:** ALPHA +* **Description:** Current total size in bytes of requests actively held in the Flow Control queue. +* **Usage:** Tracks memory pressure from queued requests. Compare against the configured `maxBytes` capacity to gauge how close a band is to rejecting new requests. + +### `flow_control_pool_saturation` + +* **Type:** Gauge +* **Labels:** + * `inference_pool`: string +* **Release Stage:** ALPHA +* **Description:** Current saturation level of the inference pool (0.0 = empty, 1.0 = fully saturated). +* **Usage:** When saturation reaches the usage limit threshold, the dispatch cycle skips dispatching and requests remain queued. Sustained 1.0 indicates all backends are at capacity. + + ## Opt-in ext_proc Stream Metrics Three metrics covering ext_proc gRPC stream lifecycle. Disabled by default; enable with `--enable-grpc-stream-metrics`. These metrics are emitted under the `llm_d_epp_` prefix (separate from `llm_d_inference_scheduler_*`).