Getting 504 Http Connection timeou errors for requests waiting for > 300sec

### Component

Other

### Describe the bug

Deployed the llm-d setup for a llama 3.1 70b model with 1 prefill, 1 decoder and redis as cache server. All the components are up. 

triggered the benchmark test using 

 python3 benchmark_serving.py --port 80 --seed $(date +%s) --host llm-d-inference-gateway.llm-d.svc.cluster.local  --model meta-llama/Llama-3.1-70B-Instruct  --tokenizer /models/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b  --dataset-name random  --random-input-len 2048 --random-output-len 256     --num-prompts 256  --request-rate 3.6 --metric-percentiles 95 --burstiness 100  --backend openai --endpoint /v1/completions --ignore-eos | tee benchmark_1P1D_withPD.log

in the epp logs for one of the request example that failed with 504.

`Line  6980: {"level":"Level(-4)","ts":"2025-06-25T01:05:33Z","caller":"scheduling/scheduler.go:176","msg":"Running scorer","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","request":"TargetModel: meta-llama/Llama-3.1-70B-Instruct, Critical: false, PromptLength: 12404, Headers: map[:authority:llm-d-inference-gateway.llm-d.svc.cluster.local :method:POST :path:/v1/completions :scheme:http accept:*/* accept-encoding:gzip, deflate authorization:Bearer None content-length:13174 content-type:application/json user-agent:Python/3.10 aiohttp/3.11.18 x-envoy-external-address:100.68.248.44 x-forwarded-for:100.68.248.44 x-forwarded-proto:http x-request-id:02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a]","scorer":"load-aware-scorer"}


Line 10805: {"level":"Level(-4)","ts":"2025-06-25T01:10:33Z","caller":"requestcontrol/director.go:150","msg":"LLM response assembled","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","response":{"RequestId":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","Headers":{":status":"504","content-length":"14","content-type":"text/plain"},"Body":"","IsStreaming":false,"EndOfStream":false}}


	Line 10806: {"level":"error","ts":"2025-06-25T01:10:33Z","caller":"handlers/server.go:290","msg":"Error unmarshaling request body","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","error":"invalid character 's' looking for beginning of value","stacktrace":"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/handlers.(*StreamingServer).Process\n\t/go/pkg/mod/sigs.k8s.io/gateway-api-inference-extension@v0.0.0-20250515212313-6e8a2effa41c/pkg/epp/handlers/server.go:290\ngithub.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3._ExternalProcessor_Process_Handler\n\t/go/pkg/mod/github.com/envoyproxy/go-control-plane/envoy@v1.32.4/service/ext_proc/v3/external_processor_grpc.pb.go:106\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1695\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1819\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1035"}
Search "504" (338 hits in 1 file of 1 searched) [Normal]

`

Verified the httpRequests. timeouts are set to 0s for both request and backendRequest.

Not able to update the envoy config to update the idletimeout. Also tried updating the CRD's and created the https://kgateway.dev/docs/resiliency/connection/#http , no luck.

[70b_updated_600_epp.log](https://github.com/user-attachments/files/20914628/70b_updated_600_epp.log)
[70b_updated_600_routing_proxy.log](https://github.com/user-attachments/files/20914630/70b_updated_600_routing_proxy.log)
[70b_updated_600_decode.log](https://github.com/user-attachments/files/20914627/70b_updated_600_decode.log)
[70b_updated_600_prefill.log](https://github.com/user-attachments/files/20914629/70b_updated_600_prefill.log)

[70b_updated_600_epp.log](https://github.com/user-attachments/files/20914658/70b_updated_600_epp.log)
[70b_updated_600_routing_proxy.log](https://github.com/user-attachments/files/20914657/70b_updated_600_routing_proxy.log)
[70b_updated_600_decode.log](https://github.com/user-attachments/files/20914656/70b_updated_600_decode.log)
[70b_updated_600_prefill.log](https://github.com/user-attachments/files/20914659/70b_updated_600_prefill.log)



### Steps to reproduce

Deployed the llm-d setup for a llama 3.1 70b model with 1 prefill, 1 decoder and redis as cache server. All the components are up. 

triggered the benchmark test using 

 python3 benchmark_serving.py --port 80 --seed $(date +%s) --host llm-d-inference-gateway.llm-d.svc.cluster.local  --model meta-llama/Llama-3.1-70B-Instruct  --tokenizer /models/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b  --dataset-name random  --random-input-len 2048 --random-output-len 256     --num-prompts 256  --request-rate 3.6 --metric-percentiles 95 --burstiness 100  --backend openai --endpoint /v1/completions --ignore-eos | tee benchmark_1P1D_withPD.log

### Additional context or screenshots

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting 504 Http Connection timeou errors for requests waiting for > 300sec #339

Component

Describe the bug

Steps to reproduce

Additional context or screenshots

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Getting 504 Http Connection timeou errors for requests waiting for > 300sec #339

Description

Component

Describe the bug

Steps to reproduce

Additional context or screenshots

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions