Parameterize inference-perf workload profiles#1516
Conversation
Replace hardcoded load stages and conversation_replay settings in the
agentic_code_generation and guide_predicted-latency-routing profiles with
${NUM_REQUESTS}, ${CONCURRENCY_LEVEL}/${CONCURRENCY}, and ${SEED}
environment variables so the profiles can be driven dynamically.
Signed-off-by: Kaushik Mitra <kaushikmitra@google.com>
8d0dbb2 to
47e51c1
Compare
There was a problem hiding this comment.
Pull request overview
This PR aims to make the inference-perf workload profiles configurable at runtime by replacing hardcoded load.stages and conversation_replay values with environment-variable placeholders (e.g. ${NUM_REQUESTS}, ${CONCURRENCY_LEVEL} / ${CONCURRENCY}, ${SEED}).
Changes:
- Replaced hardcoded
num_requests/concurrency_levelvalues with${…}placeholders in two inference-perf profiles. - Replaced fixed
conversation_replay.seedandnum_conversationsvalues with${…}placeholders. - Collapsed
agentic_code_generationfrom a multi-stage sweep to a single stage.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| workload/profiles/inference-perf/guide_predicted-latency-routing_1.yaml | Parameterizes stage sizing and conversation replay seed via ${…} placeholders. |
| workload/profiles/inference-perf/agentic_code_generation.yaml.in | Replaces the prior multi-stage sweep with a single stage and introduces ${…} placeholders for stage sizing and replay settings. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The ${NUM_REQUESTS}/${CONCURRENCY}/${SEED} placeholders were never
expanded: the profile renderer only substitutes REPLACE_ENV_* tokens and
the inference-perf harness wrapper does not run envsubst, so literal
${...} strings reached inference-perf where integers are required.
Switch the agentic_code_generation and guide_predicted-latency-routing
profiles to the supported REPLACE_ENV_LLMDBENCH_RUN_{NUM_REQUESTS,
CONCURRENCY,SEED} tokens, register them in PROFILE_TOKENS (resolved from
experiment.{numRequests,concurrency,seed} with integer defaults), add a
default fallback to TokenDef so profiles render to valid integers even
without config, and document the knobs in experiment defaults.
Signed-off-by: Kaushik Mitra <kaushikmitra@google.com>
Expose the inference-perf load knobs as 'llmdbenchmark run' CLI flags (--num-requests, --concurrency, --seed; env LLMDBENCH_NUM_REQUESTS, LLMDBENCH_CONCURRENCY, LLMDBENCH_SEED) so a single profile can be swept per-invocation without editing config. The flags thread through ExecutionContext into step_05's runtime_values, which take precedence over experiment.* config and the REPLACE_ENV token defaults (CLI flag > config > default). Signed-off-by: Kaushik Mitra <kaushikmitra@google.com>
Vezio
left a comment
There was a problem hiding this comment.
Thanks for this flexibility! I see all major CI passing - but the linter because of a broken link in a md file - I'm ok if we merge around that and in a separate PR fix that
|
@kaushikmitr on a second look - it looks like these are Something like I'm not sure if this is just going to be more work since we already do have the For example: -o/--overrides "key=value,key=value,..." is per-field override it accepts a comma-separated key=value pairs. These get applied as a single inline "treatment" against the selected profile at step_05_render_profiles.py. Useful when you want to keep the profile but tweak a few knobs. Which is what I think you're after, right @kaushikmitr ? |
|
@kaushikmitr first of all, thanks for the contribution. I can see the need for the "operational convenience" of altering a single parameter in a |
Thanks for the thoughtful review, and I fully agree with the goal — consistency across harnesses and less surface area to maintain is the right north star. A couple of things make the pure --overrides route harder than it looks for this particular case, and I want to lay them out so we can decide together.
A flat --overrides collapses all of this into independent scalar assignments and pushes the responsibility for keeping seed, concurrency_level, and num_conversations mutually consistent onto whoever writes the invocation — on a long, per-profile dotted path (it's data.conversation_replay.seed in one profile, tokenizer.data.conversation_replay.seed in the other). A mistake there doesn't error; it silently corrupts the benchmark. |
|
@kaushikmitr Thanks for that! Ah okay - we are not actively testing So assuming this is fixed, would that be sufficient to your needs, or are you suggesting that would not be sufficient? |
Yes, once --overrides can reach list elements (load.stages[0].*), that's fully sufficient for our needs. No need for the harness-specific flags; I'll drop them and rely on the universal --overrides. I will just update the guides to override the relevant variables (num_conversations, num_request, seed, concurrency): https://github.com/llm-d/llm-d/tree/main/guides/predicted-latency-routing |
Summary
Makes the
num_requests,concurrency_level,seed, andnum_conversationssettings in two inference-perf profiles configurable instead of hardcoded, using
the project's existing
REPLACE_ENV_*template-token mechanism, and exposes themas
llmdbenchmark runCLI flags (--num-requests,--concurrency,--seed)for per-run sweeps.
Motivation
The
agentic_code_generationandguide_predicted-latency-routing_1profileshad load parameters baked in (and the guide profile carried
${CONCURRENCY}placeholders that were never actually expanded). An initial attempt used
shell-style
${NUM_REQUESTS}/${CONCURRENCY}/${SEED}placeholders, but:REPLACE_ENV_*tokens(
llmdbenchmark/utilities/profile_renderer.py), andworkload/harnesses/inference-perf-llm-d-benchmark.sh)does not run
envsubst.So those
${...}strings would have reached inference-perf literally, whereintegers are required. This PR switches to the supported token convention so the
values are resolved at render time.
Changes
llmdbenchmark/utilities/profile_renderer.pydefaultfield toTokenDefso profiles render to valid integerseven when a config omits the value.
LLMDBENCH_RUN_NUM_REQUESTS,LLMDBENCH_RUN_CONCURRENCY,LLMDBENCH_RUN_SEED— resolved fromexperiment.numRequests/experiment.concurrency/experiment.seed,with defaults
192/32/42.build_env_mapto apply the registered default when noconfig/runtime value resolves.
config/templates/values/defaults.yaml— documented the new knobs underthe
experiment:section.workload/profiles/inference-perf/agentic_code_generation.yaml.inandworkload/profiles/inference-perf/guide_predicted-latency-routing_1.yaml—replaced the hardcoded values / unexpanded
${...}placeholders with the newREPLACE_ENV_LLMDBENCH_RUN_*tokens.llmdbenchmark/interface/run.py,llmdbenchmark/cli.py,llmdbenchmark/executor/context.py,llmdbenchmark/run/steps/step_05_render_profiles.py— added--num-requests/--concurrency/--seedflags (env:LLMDBENCH_NUM_REQUESTS/LLMDBENCH_CONCURRENCY/LLMDBENCH_SEED) to therunsubcommand, threaded throughExecutionContext, and fed into therenderer's
runtime_valuesso they override config/profile defaults.How to use
Pass the knobs directly on the run command (highest precedence):
llmdbenchmark \ --spec guides/predicted-latency-routing \ run \ --endpoint-url "${ENDPOINT_URL}" \ --gateway-class "${GATEWAY_CLASS}" \ --model "Qwen/Qwen3-32B" \ --namespace "${NAMESPACE}" \ --harness inference-perf \ --workload guide_predicted-latency-routing_1.yaml \ --num-requests 500 \ --concurrency 50 \ --seed 7 \ --analyzeOr set them in an experiment/plan config:
Precedence is CLI flag >
experiment.*config > token default (192/32/
42). Omitting them everywhere falls back to the defaults.Testing
Rendered both profiles through
build_env_map+render_profile:num_requestsandconcurrency_levelparse as YAMLintegers (
192,32) — not strings.experimentoverride, the values flow through (500/50/7).${...}or strayREPLACE_ENVtokens remain (only themodel/endpoint tokens that are resolved at runtime, as before).