Skip to content

Performance optimizations for vllm render api in precise prefix-cache aware routing for multimidal workloads #1552

Description

@capri-xiyue

What would you like to be added:
Performance optimizations for precise prefix-cache aware routing, specifically aimed at reducing the overhead and latency introduced by the vLLM render API during tokenizer loads for multimodal workloads.

Why is this needed:

While evaluating precise prefix-cache aware routing with multimodal workloads, the functionality works as intended (expected prefix cache hits are observed). However, the vLLM render API acts as a significant bottleneck, introducing excessive latency that results in overall latency spikes and throughput degradation.

During benchmarking (using 8 vLLM replicas on H200 GPUs with the multimodal optimized-baseline workloads), the default 30s vllm.mmTimeout resulted in tokenproducer timeout errors at 30 and 35 RPS. Even after increasing the timeout to 120s to bypass the errors, the model server metrics show that the render API overhead remains a primary bottleneck compared to approximate routing. See details #1289 (comment)

Reducing the vLLM render API overhead for precise cache aware routing is necessary to make it performant and viable at scale, especially for high-throughput multimodal deployments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    [mm]Multimodalneeds-triageIndicates an issue or PR lacks a triage label and requires one.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions