Validate capacity planner accuracy and compile extensive report with next steps by jgchn · Pull Request #205 · llm-d-incubation/llm-d-planner

jgchn · 2026-04-24T01:18:16Z

Description

Addresses #194

Blog article draft: https://github.com/jgchn/llm-d-planner/blob/accuracy/accuracy/blog-gpu-capacity.md
Included reproducibility guide for validating server deployments and analysis
Included actual vLLM startup logs

How Has This Been Tested?

Does not affect current UI or API workflow. No effect on llm-d-planner functionality

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Next steps

Recalibrate activation constants in capacity planner
Support --dtype, --kv-cache-dtype, and --quantization override in UI and API
Fix find_possible_tps(): TP must also be divisible by vocab size. vocab_size is usually powers of 2 times a multiplier

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

namasl · 2026-04-24T19:24:00Z

+
+### The headline: accurate where it counts most
+
+**Weight memory: 0.89% mean absolute error** across 53 of the 57 runs. (The remaining 4 used parameters the planner doesn't yet model, float32 dtype and runtime fp8 quantization, and are discussed below.) This is the single largest memory component; for a model like Llama-3.1-8B at TP=1, weights consume about 15 GiB of the 79 GiB available. It's also the hardest to get right across a diverse model set.


When using planner, will I know if I enter some setup or model that planner doesn't fully support (will it give me some sort of error message, or proceed to give me results with large errors without warning)?

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

results Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

jgchn added 15 commits April 21, 2026 18:34

Initial accuracy analysis

cb66659

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Initial results

6417620

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Update models list

2c89f47

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Update gemma and codellam res

82b1fdf

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Update report

30ec05f

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Fix multiplier

c9f015d

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Clean up report

fbd6851

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Update report

a95e9f6

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

More coverage and detailed report

3c876c8

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Add logs

d3121e0

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Draft blog

3134768

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Shorten summary

a95ae28

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

rm temp files

56dc5b7

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Remove logs

2489dbb

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Update readme instructions:

60cb93f

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

namasl reviewed Apr 24, 2026

View reviewed changes

Comment thread accuracy/blog-gpu-capacity.md Outdated

namasl reviewed Apr 24, 2026

View reviewed changes

Comment thread accuracy/blog-gpu-capacity.md Outdated

namasl reviewed Apr 24, 2026

View reviewed changes

Comment thread accuracy/blog-gpu-capacity.md Outdated

jgchn added 10 commits April 26, 2026 13:26

Address comments

f0197dd

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

update with mimo

875a4fd

results Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Merge branch 'main' into accuracy

84ad077

Refinements

3ecd91c

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Fix numbers

f3265ef

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Address feedbcak

2fe116c

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Simplification

427ef31

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Several revisions

91397a2

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Tone: story rather than technical report

892cd90

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Better table

151baf3

Signed-off-by: Jing Chen <jing.chen2@ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate capacity planner accuracy and compile extensive report with next steps #205

Validate capacity planner accuracy and compile extensive report with next steps #205
jgchn wants to merge 25 commits into
llm-d-incubation:mainfrom
jgchn:accuracy

jgchn commented Apr 24, 2026 •

edited

Loading

Uh oh!

namasl Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### The headline: accurate where it counts most

		Weight memory: 0.89% mean absolute error across 53 of the 57 runs. (The remaining 4 used parameters the planner doesn't yet model, float32 dtype and runtime fp8 quantization, and are discussed below.) This is the single largest memory component; for a model like Llama-3.1-8B at TP=1, weights consume about 15 GiB of the 79 GiB available. It's also the hardest to get right across a diverse model set.

Uh oh!

Conversation

jgchn commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Next steps

Uh oh!

namasl Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jgchn commented Apr 24, 2026 •

edited

Loading