Add minimal support for llm-d manifest generation#274
Conversation
Remove old disaggregated templates (scheduler.values, patch-decode, patch-prefill, objectives, deploy.sh) and replace with unified kustomization + patch + helm values that generate a single deployment topology with decode-only mode. New templates: - kustomization.yaml.j2: references llm-d base recipe, applies namePrefix and labels - patch-vllm.yaml.j2: patches decode deployment with model, replicas, tensor_parallel, and GPU resources - values.yaml.j2: helm values for llm-d-inference-scheduler with inferencePool selector Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Simplified LlmdDeploymentGenerator to render the three new llm-d templates: - kustomization.yaml.j2 (kustomize base reference + patches) - patch-vllm.yaml.j2 (vLLM deployment with model/GPU/replica config) - values.yaml.j2 (EPP + InferencePool Helm values) Removed complex routing topology logic and BLIS-specific features. Now focused on core model serving parameters: model_id, gpu_count, tensor_parallel, replicas. Tests rewritten to validate: - 3 output files (kustomization, patch_vllm, helm_values) - Valid YAML rendering for all files - Template variable substitution (model_id, tensor_parallel, replicas, etc.) - EPP config structure in Helm values Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Add regex validation for model_id in LlmdDeploymentGenerator._prepare_context() to prevent YAML injection attacks. Change stack parameter in /api/v1/deploy endpoint from str to Literal["vllm", "llm-d"] for automatic validation. Add test_invalid_model_id_raises to verify ValueError is raised for malicious model_id formats. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Add radio button in deployment tab to choose between vLLM (standalone) and llm-d (router + pool) deployment stacks. The selected stack is passed as a query parameter to the backend API and determines which YAML files are generated and displayed. Changes: - Add stack radio selector in deployment.py before YAML generation - Update deploy_and_generate_yaml() in api_client.py to accept stack param - Pass stack from session state when selecting recommendations - Display correct YAML file labels based on selected stack Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Jing Chen <jing.chen2@ibm.com>
When the user switches between vLLM and llm-d, clear the previously generated YAML so it regenerates with the correct stack. Without this, switching to llm-d after selecting a recommendation would show no files because the stored YAML had vllm keys but the display expected llm-d keys. Signed-off-by: Jing Chen <jing.chen2@ibm.com>
amito
left a comment
There was a problem hiding this comment.
Hi Jing,
Thank you for this important work.
Please see my comments.
Amit
| detail=f"Generated YAML validation failed: {str(e)}", | ||
| ) from e | ||
| if stack == "llm-d": | ||
| result = llmd_generator.generate_all( |
There was a problem hiding this comment.
YAML validation is skipped for llm-d.
| image: | ||
| registry: ghcr.io | ||
| repository: llm-d/llm-d-inference-scheduler | ||
| tag: v0.8.0 |
There was a problem hiding this comment.
Will we pin this version per release of llm-d-planner?
There was a problem hiding this comment.
Good point. Now I'm just using the latest available for the core components. We probably want to store the pinned versions we're generating somewhere.
| ) | ||
|
|
||
|
|
||
| class TestLlmdGeneratorOutput: |
There was a problem hiding this comment.
Tests are missing marks (@pytest.mark.unit etc.).
amito
left a comment
There was a problem hiding this comment.
Hi Jing,
Great work, thanks for addressing my comments.
Please address these few minor comments - mostly around testing.
Thanks,
Amit
| gpu_count=2, | ||
| tensor_parallel=2, |
There was a problem hiding this comment.
Setting different values for gpu_count and tensor_parallel here would help catch regressions.
| resources: | ||
| requests: | ||
| nvidia.com/gpu: "{{ gpu_count }}" | ||
| nvidia.com/gpu: "{{ gpus_per_replica }}" |
There was a problem hiding this comment.
Do we still need to populate gpu_count now that it's not used in the templates?
| """Create a test client with mocked app state (no DB required).""" | ||
| app = FastAPI() | ||
| # Mock app state without requiring DB connection | ||
| app.state.deployment_generator = DeploymentGenerator(simulator_mode=False) |
There was a problem hiding this comment.
This still creates a bunch of stuff which is not mocked and communicates with real entities on the disk, e.g., L#54 in src/planner/configuration/generator.py does self._catalog = ModelCatalog() and some others create output dirs, etc.
Maybe this whole thing can be mocked?
| kind: Kustomization | ||
|
|
||
| resources: | ||
| # TODO: pin to release tag (e.g. ?ref=v0.1.0) per llm-d-planner release |
There was a problem hiding this comment.
We need to consider automating this in the release workflow.
There was a problem hiding this comment.
Agreed. As a follow-up item, when we cut a release, the CI/release workflow should inject a pinned ref tag for both the kustomize base resource and the EPP image tag in values.yaml.j2. For now it tracks latest.
|
I see that only some commits have verified signatures, better squash locally and sign the new set of commits. |
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
0a4c969 to
e550453
Compare
Adds the ability to generate llm-d stack deployment configs as an alternative to standalone vLLM (KServe InferenceService) configs.
When users select "llm-d" as the deployment stack, the planner generates:
kustomization.yaml+patch-vllm.yaml) for model server deployment, referencing the llm-d base manifests atguides/recipes/modelserver/base/single-host/default/values.yaml) for the EPP (Endpoint Picker Pod) + InferencePool, using thestandalonechart fromoci://registry.k8s.io/gateway-api-inference-extension/charts/standaloneThe default router config uses the standard EPP plugins:
prefix-cache-scorer,decode-filter,max-score-picker,single-profile-handler.Key design decisions:
LlmdDeploymentGenerator) since the output shape is structurally different from KServe InferenceService?stack=llm-dquery param on/api/v1/deployHow Has This Been Tested?
Merge criteria: