Skip to content

Parameterize inference-perf workload profiles#1516

Open
kaushikmitr wants to merge 3 commits into
llm-d:mainfrom
kaushikmitr:update-inference-perf-profiles-params
Open

Parameterize inference-perf workload profiles#1516
kaushikmitr wants to merge 3 commits into
llm-d:mainfrom
kaushikmitr:update-inference-perf-profiles-params

Conversation

@kaushikmitr

@kaushikmitr kaushikmitr commented Jun 17, 2026

Copy link
Copy Markdown

Summary

Makes the num_requests, concurrency_level, seed, and num_conversations
settings in two inference-perf profiles configurable instead of hardcoded, using
the project's existing REPLACE_ENV_* template-token mechanism, and exposes them
as llmdbenchmark run CLI flags (--num-requests, --concurrency, --seed)
for per-run sweeps.

Motivation

The agentic_code_generation and guide_predicted-latency-routing_1 profiles
had load parameters baked in (and the guide profile carried ${CONCURRENCY}
placeholders that were never actually expanded). An initial attempt used
shell-style ${NUM_REQUESTS} / ${CONCURRENCY} / ${SEED} placeholders, but:

  • the profile renderer only substitutes REPLACE_ENV_* tokens
    (llmdbenchmark/utilities/profile_renderer.py), and
  • the harness wrapper (workload/harnesses/inference-perf-llm-d-benchmark.sh)
    does not run envsubst.

So those ${...} strings would have reached inference-perf literally, where
integers are required. This PR switches to the supported token convention so the
values are resolved at render time.

Changes

  • llmdbenchmark/utilities/profile_renderer.py
    • Added a default field to TokenDef so profiles render to valid integers
      even when a config omits the value.
    • Registered three tokens — LLMDBENCH_RUN_NUM_REQUESTS,
      LLMDBENCH_RUN_CONCURRENCY, LLMDBENCH_RUN_SEED — resolved from
      experiment.numRequests / experiment.concurrency / experiment.seed,
      with defaults 192 / 32 / 42.
    • Updated build_env_map to apply the registered default when no
      config/runtime value resolves.
  • config/templates/values/defaults.yaml — documented the new knobs under
    the experiment: section.
  • workload/profiles/inference-perf/agentic_code_generation.yaml.in and
    workload/profiles/inference-perf/guide_predicted-latency-routing_1.yaml
    replaced the hardcoded values / unexpanded ${...} placeholders with the new
    REPLACE_ENV_LLMDBENCH_RUN_* tokens.
  • llmdbenchmark/interface/run.py, llmdbenchmark/cli.py,
    llmdbenchmark/executor/context.py,
    llmdbenchmark/run/steps/step_05_render_profiles.py — added
    --num-requests / --concurrency / --seed flags (env:
    LLMDBENCH_NUM_REQUESTS / LLMDBENCH_CONCURRENCY / LLMDBENCH_SEED) to the
    run subcommand, threaded through ExecutionContext, and fed into the
    renderer's runtime_values so they override config/profile defaults.

How to use

Pass the knobs directly on the run command (highest precedence):

llmdbenchmark \
    --spec           guides/predicted-latency-routing \
    run \
    --endpoint-url   "${ENDPOINT_URL}" \
    --gateway-class  "${GATEWAY_CLASS}" \
    --model          "Qwen/Qwen3-32B" \
    --namespace      "${NAMESPACE}" \
    --harness        inference-perf \
    --workload       guide_predicted-latency-routing_1.yaml \
    --num-requests   500 \
    --concurrency    50 \
    --seed           7 \
    --analyze

Or set them in an experiment/plan config:

experiment:
  numRequests: 500
  concurrency: 50
  seed: 7

Precedence is CLI flag > experiment.* config > token default (192 / 32
/ 42). Omitting them everywhere falls back to the defaults.

Testing

Rendered both profiles through build_env_map + render_profile:

  • With defaults, num_requests and concurrency_level parse as YAML
    integers (192, 32) — not strings.
  • With an experiment override, the values flow through (500 / 50 / 7).
  • No literal ${...} or stray REPLACE_ENV tokens remain (only the
    model/endpoint tokens that are resolved at runtime, as before).

Replace hardcoded load stages and conversation_replay settings in the
agentic_code_generation and guide_predicted-latency-routing profiles with
${NUM_REQUESTS}, ${CONCURRENCY_LEVEL}/${CONCURRENCY}, and ${SEED}
environment variables so the profiles can be driven dynamically.

Signed-off-by: Kaushik Mitra <kaushikmitra@google.com>
@kaushikmitr kaushikmitr force-pushed the update-inference-perf-profiles-params branch from 8d0dbb2 to 47e51c1 Compare June 17, 2026 19:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make the inference-perf workload profiles configurable at runtime by replacing hardcoded load.stages and conversation_replay values with environment-variable placeholders (e.g. ${NUM_REQUESTS}, ${CONCURRENCY_LEVEL} / ${CONCURRENCY}, ${SEED}).

Changes:

  • Replaced hardcoded num_requests / concurrency_level values with ${…} placeholders in two inference-perf profiles.
  • Replaced fixed conversation_replay.seed and num_conversations values with ${…} placeholders.
  • Collapsed agentic_code_generation from a multi-stage sweep to a single stage.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
workload/profiles/inference-perf/guide_predicted-latency-routing_1.yaml Parameterizes stage sizing and conversation replay seed via ${…} placeholders.
workload/profiles/inference-perf/agentic_code_generation.yaml.in Replaces the prior multi-stage sweep with a single stage and introduces ${…} placeholders for stage sizing and replay settings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread workload/profiles/inference-perf/agentic_code_generation.yaml.in Outdated
Comment thread workload/profiles/inference-perf/agentic_code_generation.yaml.in Outdated
Comment thread workload/profiles/inference-perf/agentic_code_generation.yaml.in Outdated
Comment thread workload/profiles/inference-perf/guide_predicted-latency-routing_1.yaml Outdated
Comment thread workload/profiles/inference-perf/guide_predicted-latency-routing_1.yaml Outdated
The ${NUM_REQUESTS}/${CONCURRENCY}/${SEED} placeholders were never
expanded: the profile renderer only substitutes REPLACE_ENV_* tokens and
the inference-perf harness wrapper does not run envsubst, so literal
${...} strings reached inference-perf where integers are required.

Switch the agentic_code_generation and guide_predicted-latency-routing
profiles to the supported REPLACE_ENV_LLMDBENCH_RUN_{NUM_REQUESTS,
CONCURRENCY,SEED} tokens, register them in PROFILE_TOKENS (resolved from
experiment.{numRequests,concurrency,seed} with integer defaults), add a
default fallback to TokenDef so profiles render to valid integers even
without config, and document the knobs in experiment defaults.

Signed-off-by: Kaushik Mitra <kaushikmitra@google.com>
Expose the inference-perf load knobs as 'llmdbenchmark run' CLI flags
(--num-requests, --concurrency, --seed; env LLMDBENCH_NUM_REQUESTS,
LLMDBENCH_CONCURRENCY, LLMDBENCH_SEED) so a single profile can be swept
per-invocation without editing config.

The flags thread through ExecutionContext into step_05's runtime_values,
which take precedence over experiment.* config and the REPLACE_ENV token
defaults (CLI flag > config > default).

Signed-off-by: Kaushik Mitra <kaushikmitra@google.com>

@Vezio Vezio left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this flexibility! I see all major CI passing - but the linter because of a broken link in a md file - I'm ok if we merge around that and in a separate PR fix that

@Vezio

Vezio commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

@kaushikmitr on a second look - it looks like these are harness specific - could you try to maybe add the ability to scope the cli commands to that specific harness so that it's not confusing for other harness' that don't utilize this?

Something like if inferenceperf then <cli avail options> would be sufficient

I'm not sure if this is just going to be more work since we already do have the --overrides option as part of run

For example:

-o/--overrides "key=value,key=value,..." is per-field override it accepts a comma-separated key=value pairs. These get applied as a single inline "treatment" against the selected profile at step_05_render_profiles.py. Useful when you want to keep the profile but tweak a few knobs. Which is what I think you're after, right @kaushikmitr ?

llmdbenchmark --spec guides/pd-disaggregation run \
  --workload guide_pd-disaggregation_1.yaml \
  --overrides "load.stages.0.rate=10,load.stages.0.duration=60" \
  --endpoint-url "$ENDPOINT_URL" --gateway-class epponly

@Vezio Vezio left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ See above ^

@maugustosilva

Copy link
Copy Markdown
Collaborator

@kaushikmitr first of all, thanks for the contribution. I can see the need for the "operational convenience" of altering a single parameter in a scenario or profile via a CLI parameter. That being said, I believe it would be important, for usability and consistency, to avoid, if possible, parameters which are specific to one of the many harnesses we have to support. As @Vezio mentioned in his last comment, we do have an "universal", --overrides for this exact purpose, and I recommend we use this instead. I see three main upsides here: a) more consistent in terms of usability, b) far less code for us to maintain and c) achieves exactly what you (very appropriately) aimed.

@kaushikmitr

Copy link
Copy Markdown
Author

@kaushikmitr first of all, thanks for the contribution. I can see the need for the "operational convenience" of altering a single parameter in a scenario or profile via a CLI parameter. That being said, I believe it would be important, for usability and consistency, to avoid, if possible, parameters which are specific to one of the many harnesses we have to support. As @Vezio mentioned in his last comment, we do have an "universal", --overrides for this exact purpose, and I recommend we use this instead. I see three main upsides here: a) more consistent in terms of usability, b) far less code for us to maintain and c) achieves exactly what you (very appropriately) aimed.

Thanks for the thoughtful review, and I fully agree with the goal — consistency across harnesses and less surface area to maintain is the right north star. A couple of things make the pure --overrides route harder than it looks for this particular case, and I want to lay them out so we can decide together.

  1. --overrides can't currently address the values in question.
    The inference-perf load knobs live under a list: load.stages[0].num_requests / load.stages[0].concurrency_level. The current apply_overrides only walks dict keys, so --overrides load.stages.0.num_requests=500 is silently ignored (I verified this — it parses, matches nothing, and the profile renders unchanged). So --overrides doesn't actually reach these fields today; making it work requires extending apply_overrides to support numeric list indices. I'm happy to do that — it's harness-agnostic and benefits everyone — but I want to flag that "use the existing knob" isn't zero-change here.

  2. The bigger issue: seed must vary with concurrency in conversation_replay sweeps — and that coupling can't be expressed as a static override.
    This is the part I'd really like to preserve. conversation_replay deterministically generates the synthetic conversations from seed. In a concurrency sweep, if the seed is held constant across concurrency points, every point replays the identical set of prompts/conversations. Because the prompts are byte-for-byte identical (shared system prompt + the same generated turns), vLLM's automatic prefix caching serves cached KV for those shared prefixes, so the higher-concurrency runs get artificially inflated cache-hit rates and depressed TTFT that would never happen under organic traffic. The net effect is that the concurrency levels are no longer comparable — the cache contamination muddles the outcome. That's exactly why the profile originally tied the seed to concurrency (seed: ${CONCURRENCY}): each concurrency point needs an independent prompt set to stay cache-clean and statistically valid. Likewise num_conversations must track concurrency_level.

A flat --overrides collapses all of this into independent scalar assignments and pushes the responsibility for keeping seed, concurrency_level, and num_conversations mutually consistent onto whoever writes the invocation — on a long, per-profile dotted path (it's data.conversation_replay.seed in one profile, tokenizer.data.conversation_replay.seed in the other). A mistake there doesn't error; it silently corrupts the benchmark.

@Vezio

Vezio commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

@kaushikmitr Thanks for that! Ah okay - we are not actively testing --overrides in CICD so I suppose we had some major regression there - we should absolutely fix this, thank you for spotting that for us.

So assuming this is fixed, would that be sufficient to your needs, or are you suggesting that would not be sufficient?

@kaushikmitr

kaushikmitr commented Jun 22, 2026

Copy link
Copy Markdown
Author

@kaushikmitr Thanks for that! Ah okay - we are not actively testing --overrides in CICD so I suppose we had some major regression there - we should absolutely fix this, thank you for spotting that for us.

So assuming this is fixed, would that be sufficient to your needs, or are you suggesting that would not be sufficient?

Yes, once --overrides can reach list elements (load.stages[0].*), that's fully sufficient for our needs. No need for the harness-specific flags; I'll drop them and rely on the universal --overrides. I will just update the guides to override the relevant variables (num_conversations, num_request, seed, concurrency):

https://github.com/llm-d/llm-d/tree/main/guides/predicted-latency-routing
https://github.com/llm-d/llm-d/blob/main/guides/agentic-serving/qwen3-coder-480b-tpu.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants