Component
Other
Describe the bug
Deployed the llm-d setup for a llama 3.1 70b model with 1 prefill, 1 decoder and redis as cache server. All the components are up.
triggered the benchmark test using
python3 benchmark_serving.py --port 80 --seed $(date +%s) --host llm-d-inference-gateway.llm-d.svc.cluster.local --model meta-llama/Llama-3.1-70B-Instruct --tokenizer /models/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b --dataset-name random --random-input-len 2048 --random-output-len 256 --num-prompts 256 --request-rate 3.6 --metric-percentiles 95 --burstiness 100 --backend openai --endpoint /v1/completions --ignore-eos | tee benchmark_1P1D_withPD.log
in the epp logs for one of the request example that failed with 504.
`Line 6980: {"level":"Level(-4)","ts":"2025-06-25T01:05:33Z","caller":"scheduling/scheduler.go:176","msg":"Running scorer","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","request":"TargetModel: meta-llama/Llama-3.1-70B-Instruct, Critical: false, PromptLength: 12404, Headers: map[:authority:llm-d-inference-gateway.llm-d.svc.cluster.local :method:POST :path:/v1/completions :scheme:http accept:/ accept-encoding:gzip, deflate authorization:Bearer None content-length:13174 content-type:application/json user-agent:Python/3.10 aiohttp/3.11.18 x-envoy-external-address:100.68.248.44 x-forwarded-for:100.68.248.44 x-forwarded-proto:http x-request-id:02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a]","scorer":"load-aware-scorer"}
Line 10805: {"level":"Level(-4)","ts":"2025-06-25T01:10:33Z","caller":"requestcontrol/director.go:150","msg":"LLM response assembled","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","response":{"RequestId":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","Headers":{":status":"504","content-length":"14","content-type":"text/plain"},"Body":"","IsStreaming":false,"EndOfStream":false}}
Line 10806: {"level":"error","ts":"2025-06-25T01:10:33Z","caller":"handlers/server.go:290","msg":"Error unmarshaling request body","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","error":"invalid character 's' looking for beginning of value","stacktrace":"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/handlers.(*StreamingServer).Process\n\t/go/pkg/mod/sigs.k8s.io/gateway-api-inference-extension@v0.0.0-20250515212313-6e8a2effa41c/pkg/epp/handlers/server.go:290\ngithub.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3._ExternalProcessor_Process_Handler\n\t/go/pkg/mod/github.com/envoyproxy/go-control-plane/envoy@v1.32.4/service/ext_proc/v3/external_processor_grpc.pb.go:106\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1695\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1819\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1035"}
Search "504" (338 hits in 1 file of 1 searched) [Normal]
`
Verified the httpRequests. timeouts are set to 0s for both request and backendRequest.
Not able to update the envoy config to update the idletimeout. Also tried updating the CRD's and created the https://kgateway.dev/docs/resiliency/connection/#http , no luck.
70b_updated_600_epp.log
70b_updated_600_routing_proxy.log
70b_updated_600_decode.log
70b_updated_600_prefill.log
70b_updated_600_epp.log
70b_updated_600_routing_proxy.log
70b_updated_600_decode.log
70b_updated_600_prefill.log
Steps to reproduce
Deployed the llm-d setup for a llama 3.1 70b model with 1 prefill, 1 decoder and redis as cache server. All the components are up.
triggered the benchmark test using
python3 benchmark_serving.py --port 80 --seed $(date +%s) --host llm-d-inference-gateway.llm-d.svc.cluster.local --model meta-llama/Llama-3.1-70B-Instruct --tokenizer /models/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b --dataset-name random --random-input-len 2048 --random-output-len 256 --num-prompts 256 --request-rate 3.6 --metric-percentiles 95 --burstiness 100 --backend openai --endpoint /v1/completions --ignore-eos | tee benchmark_1P1D_withPD.log
Additional context or screenshots
No response
Component
Other
Describe the bug
Deployed the llm-d setup for a llama 3.1 70b model with 1 prefill, 1 decoder and redis as cache server. All the components are up.
triggered the benchmark test using
python3 benchmark_serving.py --port 80 --seed $(date +%s) --host llm-d-inference-gateway.llm-d.svc.cluster.local --model meta-llama/Llama-3.1-70B-Instruct --tokenizer /models/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b --dataset-name random --random-input-len 2048 --random-output-len 256 --num-prompts 256 --request-rate 3.6 --metric-percentiles 95 --burstiness 100 --backend openai --endpoint /v1/completions --ignore-eos | tee benchmark_1P1D_withPD.log
in the epp logs for one of the request example that failed with 504.
`Line 6980: {"level":"Level(-4)","ts":"2025-06-25T01:05:33Z","caller":"scheduling/scheduler.go:176","msg":"Running scorer","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","request":"TargetModel: meta-llama/Llama-3.1-70B-Instruct, Critical: false, PromptLength: 12404, Headers: map[:authority:llm-d-inference-gateway.llm-d.svc.cluster.local :method:POST :path:/v1/completions :scheme:http accept:/ accept-encoding:gzip, deflate authorization:Bearer None content-length:13174 content-type:application/json user-agent:Python/3.10 aiohttp/3.11.18 x-envoy-external-address:100.68.248.44 x-forwarded-for:100.68.248.44 x-forwarded-proto:http x-request-id:02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a]","scorer":"load-aware-scorer"}
Line 10805: {"level":"Level(-4)","ts":"2025-06-25T01:10:33Z","caller":"requestcontrol/director.go:150","msg":"LLM response assembled","x-request-id":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","response":{"RequestId":"02e75c5c-a0cd-4fa3-8ee4-5284bcdb7c6a","Headers":{":status":"504","content-length":"14","content-type":"text/plain"},"Body":"","IsStreaming":false,"EndOfStream":false}}
Search "504" (338 hits in 1 file of 1 searched) [Normal]
`
Verified the httpRequests. timeouts are set to 0s for both request and backendRequest.
Not able to update the envoy config to update the idletimeout. Also tried updating the CRD's and created the https://kgateway.dev/docs/resiliency/connection/#http , no luck.
70b_updated_600_epp.log
70b_updated_600_routing_proxy.log
70b_updated_600_decode.log
70b_updated_600_prefill.log
70b_updated_600_epp.log
70b_updated_600_routing_proxy.log
70b_updated_600_decode.log
70b_updated_600_prefill.log
Steps to reproduce
Deployed the llm-d setup for a llama 3.1 70b model with 1 prefill, 1 decoder and redis as cache server. All the components are up.
triggered the benchmark test using
python3 benchmark_serving.py --port 80 --seed $(date +%s) --host llm-d-inference-gateway.llm-d.svc.cluster.local --model meta-llama/Llama-3.1-70B-Instruct --tokenizer /models/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b --dataset-name random --random-input-len 2048 --random-output-len 256 --num-prompts 256 --request-rate 3.6 --metric-percentiles 95 --burstiness 100 --backend openai --endpoint /v1/completions --ignore-eos | tee benchmark_1P1D_withPD.log
Additional context or screenshots
No response