Performance optimizations for vllm render api in precise prefix-cache aware routing for multimidal workloads

**What would you like to be added**:
Performance optimizations for precise prefix-cache aware routing, specifically aimed at reducing the overhead and latency introduced by the vLLM render API during tokenizer loads for multimodal workloads.

**Why is this needed**:

While evaluating precise prefix-cache aware routing with multimodal workloads, the functionality works as intended (expected prefix cache hits are observed). However, the vLLM render API acts as a significant bottleneck, introducing excessive latency that results in overall latency spikes and throughput degradation.

During benchmarking (using 8 vLLM replicas on H200 GPUs with the [multimodal optimized-baseline workloads](https://github.com/llm-d/llm-d/blob/main/guides/multimodal/optimized-baseline/benchmark-templates/guide.yaml)), the default 30s vllm.mmTimeout resulted in tokenproducer timeout errors at 30 and 35 RPS. Even after increasing the timeout to 120s to bypass the errors, the model server metrics show that the render API overhead remains a primary bottleneck compared to approximate routing. See details https://github.com/llm-d/llm-d-router/issues/1289#issuecomment-4654415540

Reducing the vLLM render API overhead for precise cache aware routing is necessary to make it performant and viable at scale, especially for high-throughput multimodal deployments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance optimizations for vllm render api in precise prefix-cache aware routing for multimidal workloads #1552

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Performance optimizations for vllm render api in precise prefix-cache aware routing for multimidal workloads #1552

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions