Parameterize inference-perf workload profiles by kaushikmitr · Pull Request #1516 · llm-d/llm-d-benchmark

kaushikmitr · 2026-06-17T19:31:12Z

Summary

Makes the num_requests, concurrency_level, seed, and num_conversations
settings in two inference-perf profiles configurable instead of hardcoded, using
the project's existing REPLACE_ENV_* template-token mechanism, and exposes them
as llmdbenchmark run CLI flags (--num-requests, --concurrency, --seed)
for per-run sweeps.

Motivation

The agentic_code_generation and guide_predicted-latency-routing_1 profiles
had load parameters baked in (and the guide profile carried ${CONCURRENCY}
placeholders that were never actually expanded). An initial attempt used
shell-style ${NUM_REQUESTS} / ${CONCURRENCY} / ${SEED} placeholders, but:

the profile renderer only substitutes REPLACE_ENV_* tokens
(llmdbenchmark/utilities/profile_renderer.py), and
the harness wrapper (workload/harnesses/inference-perf-llm-d-benchmark.sh)
does not run envsubst.

So those ${...} strings would have reached inference-perf literally, where
integers are required. This PR switches to the supported token convention so the
values are resolved at render time.

Changes

llmdbenchmark/utilities/profile_renderer.py
- Added a default field to TokenDef so profiles render to valid integers
  even when a config omits the value.
- Registered three tokens — LLMDBENCH_RUN_NUM_REQUESTS,
  LLMDBENCH_RUN_CONCURRENCY, LLMDBENCH_RUN_SEED — resolved from
  experiment.numRequests / experiment.concurrency / experiment.seed,
  with defaults 192 / 32 / 42.
- Updated build_env_map to apply the registered default when no
  config/runtime value resolves.
config/templates/values/defaults.yaml — documented the new knobs under
the experiment: section.
workload/profiles/inference-perf/agentic_code_generation.yaml.in and
workload/profiles/inference-perf/guide_predicted-latency-routing_1.yaml —
replaced the hardcoded values / unexpanded ${...} placeholders with the new
REPLACE_ENV_LLMDBENCH_RUN_* tokens.
llmdbenchmark/interface/run.py, llmdbenchmark/cli.py,
llmdbenchmark/executor/context.py,
llmdbenchmark/run/steps/step_05_render_profiles.py — added
--num-requests / --concurrency / --seed flags (env:
LLMDBENCH_NUM_REQUESTS / LLMDBENCH_CONCURRENCY / LLMDBENCH_SEED) to the
run subcommand, threaded through ExecutionContext, and fed into the
renderer's runtime_values so they override config/profile defaults.

How to use

Pass the knobs directly on the run command (highest precedence):

llmdbenchmark \
    --spec           guides/predicted-latency-routing \
    run \
    --endpoint-url   "${ENDPOINT_URL}" \
    --gateway-class  "${GATEWAY_CLASS}" \
    --model          "Qwen/Qwen3-32B" \
    --namespace      "${NAMESPACE}" \
    --harness        inference-perf \
    --workload       guide_predicted-latency-routing_1.yaml \
    --num-requests   500 \
    --concurrency    50 \
    --seed           7 \
    --analyze

Or set them in an experiment/plan config:

experiment:
  numRequests: 500
  concurrency: 50
  seed: 7

Precedence is CLI flag > experiment.* config > token default (192 / 32
/ 42). Omitting them everywhere falls back to the defaults.

Testing

Rendered both profiles through build_env_map + render_profile:

With defaults, num_requests and concurrency_level parse as YAML
integers (192, 32) — not strings.
With an experiment override, the values flow through (500 / 50 / 7).
No literal ${...} or stray REPLACE_ENV tokens remain (only the
model/endpoint tokens that are resolved at runtime, as before).

Replace hardcoded load stages and conversation_replay settings in the agentic_code_generation and guide_predicted-latency-routing profiles with ${NUM_REQUESTS}, ${CONCURRENCY_LEVEL}/${CONCURRENCY}, and ${SEED} environment variables so the profiles can be driven dynamically. Signed-off-by: Kaushik Mitra <kaushikmitra@google.com>

Copilot

Pull request overview

This PR aims to make the inference-perf workload profiles configurable at runtime by replacing hardcoded load.stages and conversation_replay values with environment-variable placeholders (e.g. ${NUM_REQUESTS}, ${CONCURRENCY_LEVEL} / ${CONCURRENCY}, ${SEED}).

Changes:

Replaced hardcoded num_requests / concurrency_level values with ${…} placeholders in two inference-perf profiles.
Replaced fixed conversation_replay.seed and num_conversations values with ${…} placeholders.
Collapsed agentic_code_generation from a multi-stage sweep to a single stage.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
workload/profiles/inference-perf/guide_predicted-latency-routing_1.yaml	Parameterizes stage sizing and conversation replay seed via `${…}` placeholders.
workload/profiles/inference-perf/agentic_code_generation.yaml.in	Replaces the prior multi-stage sweep with a single stage and introduces `${…}` placeholders for stage sizing and replay settings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The ${NUM_REQUESTS}/${CONCURRENCY}/${SEED} placeholders were never expanded: the profile renderer only substitutes REPLACE_ENV_* tokens and the inference-perf harness wrapper does not run envsubst, so literal ${...} strings reached inference-perf where integers are required. Switch the agentic_code_generation and guide_predicted-latency-routing profiles to the supported REPLACE_ENV_LLMDBENCH_RUN_{NUM_REQUESTS, CONCURRENCY,SEED} tokens, register them in PROFILE_TOKENS (resolved from experiment.{numRequests,concurrency,seed} with integer defaults), add a default fallback to TokenDef so profiles render to valid integers even without config, and document the knobs in experiment defaults. Signed-off-by: Kaushik Mitra <kaushikmitra@google.com>

Expose the inference-perf load knobs as 'llmdbenchmark run' CLI flags (--num-requests, --concurrency, --seed; env LLMDBENCH_NUM_REQUESTS, LLMDBENCH_CONCURRENCY, LLMDBENCH_SEED) so a single profile can be swept per-invocation without editing config. The flags thread through ExecutionContext into step_05's runtime_values, which take precedence over experiment.* config and the REPLACE_ENV token defaults (CLI flag > config > default). Signed-off-by: Kaushik Mitra <kaushikmitra@google.com>

Vezio

Thanks for this flexibility! I see all major CI passing - but the linter because of a broken link in a md file - I'm ok if we merge around that and in a separate PR fix that

Vezio · 2026-06-18T13:34:40Z

@kaushikmitr on a second look - it looks like these are harness specific - could you try to maybe add the ability to scope the cli commands to that specific harness so that it's not confusing for other harness' that don't utilize this?

Something like if inferenceperf then <cli avail options> would be sufficient

I'm not sure if this is just going to be more work since we already do have the --overrides option as part of run

For example:

-o/--overrides "key=value,key=value,..." is per-field override it accepts a comma-separated key=value pairs. These get applied as a single inline "treatment" against the selected profile at step_05_render_profiles.py. Useful when you want to keep the profile but tweak a few knobs. Which is what I think you're after, right @kaushikmitr ?

llmdbenchmark --spec guides/pd-disaggregation run \
  --workload guide_pd-disaggregation_1.yaml \
  --overrides "load.stages.0.rate=10,load.stages.0.duration=60" \
  --endpoint-url "$ENDPOINT_URL" --gateway-class epponly

Vezio

^ See above ^

maugustosilva · 2026-06-18T13:46:51Z

@kaushikmitr first of all, thanks for the contribution. I can see the need for the "operational convenience" of altering a single parameter in a scenario or profile via a CLI parameter. That being said, I believe it would be important, for usability and consistency, to avoid, if possible, parameters which are specific to one of the many harnesses we have to support. As @Vezio mentioned in his last comment, we do have an "universal", --overrides for this exact purpose, and I recommend we use this instead. I see three main upsides here: a) more consistent in terms of usability, b) far less code for us to maintain and c) achieves exactly what you (very appropriately) aimed.

kaushikmitr · 2026-06-18T21:32:46Z

@kaushikmitr first of all, thanks for the contribution. I can see the need for the "operational convenience" of altering a single parameter in a scenario or profile via a CLI parameter. That being said, I believe it would be important, for usability and consistency, to avoid, if possible, parameters which are specific to one of the many harnesses we have to support. As @Vezio mentioned in his last comment, we do have an "universal", --overrides for this exact purpose, and I recommend we use this instead. I see three main upsides here: a) more consistent in terms of usability, b) far less code for us to maintain and c) achieves exactly what you (very appropriately) aimed.

Thanks for the thoughtful review, and I fully agree with the goal — consistency across harnesses and less surface area to maintain is the right north star. A couple of things make the pure --overrides route harder than it looks for this particular case, and I want to lay them out so we can decide together.

--overrides can't currently address the values in question.
The inference-perf load knobs live under a list: load.stages[0].num_requests / load.stages[0].concurrency_level. The current apply_overrides only walks dict keys, so --overrides load.stages.0.num_requests=500 is silently ignored (I verified this — it parses, matches nothing, and the profile renders unchanged). So --overrides doesn't actually reach these fields today; making it work requires extending apply_overrides to support numeric list indices. I'm happy to do that — it's harness-agnostic and benefits everyone — but I want to flag that "use the existing knob" isn't zero-change here.
The bigger issue: seed must vary with concurrency in conversation_replay sweeps — and that coupling can't be expressed as a static override.
This is the part I'd really like to preserve. conversation_replay deterministically generates the synthetic conversations from seed. In a concurrency sweep, if the seed is held constant across concurrency points, every point replays the identical set of prompts/conversations. Because the prompts are byte-for-byte identical (shared system prompt + the same generated turns), vLLM's automatic prefix caching serves cached KV for those shared prefixes, so the higher-concurrency runs get artificially inflated cache-hit rates and depressed TTFT that would never happen under organic traffic. The net effect is that the concurrency levels are no longer comparable — the cache contamination muddles the outcome. That's exactly why the profile originally tied the seed to concurrency (seed: ${CONCURRENCY}): each concurrency point needs an independent prompt set to stay cache-clean and statistically valid. Likewise num_conversations must track concurrency_level.

A flat --overrides collapses all of this into independent scalar assignments and pushes the responsibility for keeping seed, concurrency_level, and num_conversations mutually consistent onto whoever writes the invocation — on a long, per-profile dotted path (it's data.conversation_replay.seed in one profile, tokenizer.data.conversation_replay.seed in the other). A mistake there doesn't error; it silently corrupts the benchmark.

Vezio · 2026-06-22T13:22:08Z

@kaushikmitr Thanks for that! Ah okay - we are not actively testing --overrides in CICD so I suppose we had some major regression there - we should absolutely fix this, thank you for spotting that for us.

So assuming this is fixed, would that be sufficient to your needs, or are you suggesting that would not be sufficient?

kaushikmitr · 2026-06-22T17:50:16Z

@kaushikmitr Thanks for that! Ah okay - we are not actively testing --overrides in CICD so I suppose we had some major regression there - we should absolutely fix this, thank you for spotting that for us.

So assuming this is fixed, would that be sufficient to your needs, or are you suggesting that would not be sufficient?

Yes, once --overrides can reach list elements (load.stages[0].*), that's fully sufficient for our needs. No need for the harness-specific flags; I'll drop them and rely on the universal --overrides. I will just update the guides to override the relevant variables (num_conversations, num_request, seed, concurrency):

https://github.com/llm-d/llm-d/tree/main/guides/predicted-latency-routing
https://github.com/llm-d/llm-d/blob/main/guides/agentic-serving/qwen3-coder-480b-tpu.md

Copilot AI review requested due to automatic review settings June 17, 2026 19:31

kaushikmitr requested review from Vezio, achandrasekar, kalantar, maugustosilva, mengmeiye and namasl as code owners June 17, 2026 19:31

Copilot started reviewing on behalf of kaushikmitr June 17, 2026 19:31 View session

kaushikmitr force-pushed the update-inference-perf-profiles-params branch from 8d0dbb2 to 47e51c1 Compare June 17, 2026 19:32

Copilot AI reviewed Jun 17, 2026

View reviewed changes

kaushikmitr added 2 commits June 17, 2026 19:46

Vezio approved these changes Jun 18, 2026

View reviewed changes

Vezio requested changes Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parameterize inference-perf workload profiles#1516

Parameterize inference-perf workload profiles#1516
kaushikmitr wants to merge 3 commits into
llm-d:mainfrom
kaushikmitr:update-inference-perf-profiles-params

kaushikmitr commented Jun 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Vezio left a comment

Uh oh!

Vezio commented Jun 18, 2026 •

edited

Loading

Uh oh!

Vezio left a comment

Uh oh!

maugustosilva commented Jun 18, 2026

Uh oh!

kaushikmitr commented Jun 18, 2026

Uh oh!

Vezio commented Jun 22, 2026

Uh oh!

kaushikmitr commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

kaushikmitr commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

How to use

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Vezio left a comment

Choose a reason for hiding this comment

Uh oh!

Vezio commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vezio left a comment

Choose a reason for hiding this comment

Uh oh!

maugustosilva commented Jun 18, 2026

Uh oh!

kaushikmitr commented Jun 18, 2026

Uh oh!

Vezio commented Jun 22, 2026

Uh oh!

kaushikmitr commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kaushikmitr commented Jun 17, 2026 •

edited

Loading

Vezio commented Jun 18, 2026 •

edited

Loading

kaushikmitr commented Jun 22, 2026 •

edited

Loading