Skip to content

Add minimal support for llm-d manifest generation#274

Open
jgchn wants to merge 7 commits into
llm-d-incubation:mainfrom
jgchn:feat/llm-d-config-generation
Open

Add minimal support for llm-d manifest generation#274
jgchn wants to merge 7 commits into
llm-d-incubation:mainfrom
jgchn:feat/llm-d-config-generation

Conversation

@jgchn

@jgchn jgchn commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Adds the ability to generate llm-d stack deployment configs as an alternative to standalone vLLM (KServe InferenceService) configs.

When users select "llm-d" as the deployment stack, the planner generates:

  • Kustomize overlay (kustomization.yaml + patch-vllm.yaml) for model server deployment, referencing the llm-d base manifests at guides/recipes/modelserver/base/single-host/default/
  • Helm values (values.yaml) for the EPP (Endpoint Picker Pod) + InferencePool, using the standalone chart from oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone

The default router config uses the standard EPP plugins: prefix-cache-scorer, decode-filter, max-score-picker, single-profile-handler.

Key design decisions:

  • Not a new DeploymentMode — vLLM vs llm-d is an output format choice, orthogonal to simulator/production mode
  • Separate generator class (LlmdDeploymentGenerator) since the output shape is structurally different from KServe InferenceService
  • No routing intelligence yet — uses default EPP config; routing profile selection is future work
  • Choice at generation time via ?stack=llm-d query param on /api/v1/deploy

How Has This Been Tested?

  • 17 new unit tests covering generator output structure, kustomization content, patch content, helm values, API endpoint routing, and YAML injection prevention
  • Full unit test suite passes (361 tests)
  • Manual UI testing: toggling between vLLM and llm-d stacks correctly generates and displays the appropriate files
  • Verified model_id input validation prevents YAML injection attacks
cd src && uv run pytest ../tests/unit/test_llmd_generator.py -v  # 17 pass
cd src && uv run pytest ../tests/unit/ -v                        # 361 pass

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

jgchn added 6 commits June 23, 2026 14:27
Remove old disaggregated templates (scheduler.values, patch-decode,
patch-prefill, objectives, deploy.sh) and replace with unified
kustomization + patch + helm values that generate a single deployment
topology with decode-only mode.

New templates:
- kustomization.yaml.j2: references llm-d base recipe, applies namePrefix
  and labels
- patch-vllm.yaml.j2: patches decode deployment with model, replicas,
  tensor_parallel, and GPU resources
- values.yaml.j2: helm values for llm-d-inference-scheduler with
  inferencePool selector

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Simplified LlmdDeploymentGenerator to render the three new llm-d templates:
- kustomization.yaml.j2 (kustomize base reference + patches)
- patch-vllm.yaml.j2 (vLLM deployment with model/GPU/replica config)
- values.yaml.j2 (EPP + InferencePool Helm values)

Removed complex routing topology logic and BLIS-specific features.
Now focused on core model serving parameters: model_id, gpu_count,
tensor_parallel, replicas.

Tests rewritten to validate:
- 3 output files (kustomization, patch_vllm, helm_values)
- Valid YAML rendering for all files
- Template variable substitution (model_id, tensor_parallel, replicas, etc.)
- EPP config structure in Helm values

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Add regex validation for model_id in LlmdDeploymentGenerator._prepare_context()
to prevent YAML injection attacks. Change stack parameter in /api/v1/deploy
endpoint from str to Literal["vllm", "llm-d"] for automatic validation.

Add test_invalid_model_id_raises to verify ValueError is raised for malicious
model_id formats.

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Add radio button in deployment tab to choose between vLLM (standalone) and
llm-d (router + pool) deployment stacks. The selected stack is passed as a
query parameter to the backend API and determines which YAML files are
generated and displayed.

Changes:
- Add stack radio selector in deployment.py before YAML generation
- Update deploy_and_generate_yaml() in api_client.py to accept stack param
- Pass stack from session state when selecting recommendations
- Display correct YAML file labels based on selected stack

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
When the user switches between vLLM and llm-d, clear the previously
generated YAML so it regenerates with the correct stack. Without this,
switching to llm-d after selecting a recommendation would show no files
because the stored YAML had vllm keys but the display expected llm-d keys.

Signed-off-by: Jing Chen <jing.chen2@ibm.com>
@jgchn jgchn requested review from amito and anfredette and removed request for amito June 24, 2026 01:45

@amito amito left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jing,

Thank you for this important work.
Please see my comments.

Amit

Comment thread src/planner/api/routes/configuration.py Outdated
Comment thread src/planner/api/routes/configuration.py Outdated
Comment thread src/planner/api/routes/configuration.py Outdated
detail=f"Generated YAML validation failed: {str(e)}",
) from e
if stack == "llm-d":
result = llmd_generator.generate_all(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAML validation is skipped for llm-d.

Comment thread src/planner/configuration/templates/llmd/patch-vllm.yaml.j2 Outdated
Comment thread src/planner/configuration/llmd_generator.py Outdated
image:
registry: ghcr.io
repository: llm-d/llm-d-inference-scheduler
tag: v0.8.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we pin this version per release of llm-d-planner?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Now I'm just using the latest available for the core components. We probably want to store the pinned versions we're generating somewhere.

Comment thread src/planner/configuration/templates/llmd/kustomization.yaml.j2 Outdated
)


class TestLlmdGeneratorOutput:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are missing marks (@pytest.mark.unit etc.).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread tests/unit/test_llmd_generator.py Outdated
@jgchn jgchn requested a review from amito June 26, 2026 13:36

@amito amito left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jing,
Great work, thanks for addressing my comments.
Please address these few minor comments - mostly around testing.
Thanks,
Amit

Comment on lines +123 to +124
gpu_count=2,
tensor_parallel=2,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting different values for gpu_count and tensor_parallel here would help catch regressions.

resources:
requests:
nvidia.com/gpu: "{{ gpu_count }}"
nvidia.com/gpu: "{{ gpus_per_replica }}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to populate gpu_count now that it's not used in the templates?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread tests/unit/test_llmd_generator.py Outdated
"""Create a test client with mocked app state (no DB required)."""
app = FastAPI()
# Mock app state without requiring DB connection
app.state.deployment_generator = DeploymentGenerator(simulator_mode=False)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still creates a bunch of stuff which is not mocked and communicates with real entities on the disk, e.g., L#54 in src/planner/configuration/generator.py does self._catalog = ModelCatalog() and some others create output dirs, etc.
Maybe this whole thing can be mocked?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

kind: Kustomization

resources:
# TODO: pin to release tag (e.g. ?ref=v0.1.0) per llm-d-planner release

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider automating this in the release workflow.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. As a follow-up item, when we cut a release, the CI/release workflow should inject a pinned ref tag for both the kustomize base resource and the EPP image tag in values.yaml.j2. For now it tracks latest.

@amito amito left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jing,
Thanks for addressing all of my comments, this looks great.
Please squash the commits (either to one or into multiple by topic - feature / tests ... with a short commit message) when merging.
Thanks :)
Amit

@amito

amito commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

I see that only some commits have verified signatures, better squash locally and sign the new set of commits.

Signed-off-by: Jing Chen <jing.chen2@ibm.com>
@jgchn jgchn force-pushed the feat/llm-d-config-generation branch from 0a4c969 to e550453 Compare June 30, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants