Skip to content
This repository was archived by the owner on Oct 15, 2025. It is now read-only.
This repository was archived by the owner on Oct 15, 2025. It is now read-only.

Getting 504 Http Connection timeou errors for requests waiting for > 300sec #339

Description

@pallavijaini0525

Component

Other

Describe the bug

Deployed the llm-d setup for a llama 3.1 70b model with 1 prefill, 1 decoder and redis as cache server. All the components are up.

triggered the benchmark test using

python3 benchmark_serving.py --port 80 --seed $(date +%s) --host llm-d-inference-gateway.llm-d.svc.cluster.local --model meta-llama/Llama-3.1-70B-Instruct --tokenizer /models/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b --dataset-name random --random-input-len 2048 --random-output-len 256 --num-prompts 256 --request-rate 3.6 --metric-percentiles 95 --burstiness 100 --backend openai --endpoint /v1/completions --ignore-eos | tee benchmark_1P1D_withPD.log

in the epp logs for one of the request example that failed with 504.

`Line 6980: {"level":"Level(-4)","ts":"2025-06-25T01:05:33Z","caller":"scheduling/scheduler.go:176","msg":"Running scorer","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","request":"TargetModel: meta-llama/Llama-3.1-70B-Instruct, Critical: false, PromptLength: 12404, Headers: map[:authority:llm-d-inference-gateway.llm-d.svc.cluster.local :method:POST :path:/v1/completions :scheme:http accept:/ accept-encoding:gzip, deflate authorization:Bearer None content-length:13174 content-type:application/json user-agent:Python/3.10 aiohttp/3.11.18 x-envoy-external-address:100.68.248.44 x-forwarded-for:100.68.248.44 x-forwarded-proto:http x-request-id:02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a]","scorer":"load-aware-scorer"}

Line 10805: {"level":"Level(-4)","ts":"2025-06-25T01:10:33Z","caller":"requestcontrol/director.go:150","msg":"LLM response assembled","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","response":{"RequestId":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","Headers":{":status":"504","content-length":"14","content-type":"text/plain"},"Body":"","IsStreaming":false,"EndOfStream":false}}

Line 10806: {"level":"error","ts":"2025-06-25T01:10:33Z","caller":"handlers/server.go:290","msg":"Error unmarshaling request body","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","error":"invalid character 's' looking for beginning of value","stacktrace":"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/handlers.(*StreamingServer).Process\n\t/go/pkg/mod/sigs.k8s.io/gateway-api-inference-extension@v0.0.0-20250515212313-6e8a2effa41c/pkg/epp/handlers/server.go:290\ngithub.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3._ExternalProcessor_Process_Handler\n\t/go/pkg/mod/github.com/envoyproxy/go-control-plane/envoy@v1.32.4/service/ext_proc/v3/external_processor_grpc.pb.go:106\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1695\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1819\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1035"}

Search "504" (338 hits in 1 file of 1 searched) [Normal]

`

Verified the httpRequests. timeouts are set to 0s for both request and backendRequest.

Not able to update the envoy config to update the idletimeout. Also tried updating the CRD's and created the https://kgateway.dev/docs/resiliency/connection/#http , no luck.

70b_updated_600_epp.log
70b_updated_600_routing_proxy.log
70b_updated_600_decode.log
70b_updated_600_prefill.log

70b_updated_600_epp.log
70b_updated_600_routing_proxy.log
70b_updated_600_decode.log
70b_updated_600_prefill.log

Steps to reproduce

Deployed the llm-d setup for a llama 3.1 70b model with 1 prefill, 1 decoder and redis as cache server. All the components are up.

triggered the benchmark test using

python3 benchmark_serving.py --port 80 --seed $(date +%s) --host llm-d-inference-gateway.llm-d.svc.cluster.local --model meta-llama/Llama-3.1-70B-Instruct --tokenizer /models/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b --dataset-name random --random-input-len 2048 --random-output-len 256 --num-prompts 256 --request-rate 3.6 --metric-percentiles 95 --burstiness 100 --backend openai --endpoint /v1/completions --ignore-eos | tee benchmark_1P1D_withPD.log

Additional context or screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Fields

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions