Skip to content

Validate capacity planner accuracy and compile extensive report with next steps #205

Open
jgchn wants to merge 25 commits into
llm-d-incubation:mainfrom
jgchn:accuracy
Open

Validate capacity planner accuracy and compile extensive report with next steps #205
jgchn wants to merge 25 commits into
llm-d-incubation:mainfrom
jgchn:accuracy

Conversation

@jgchn

@jgchn jgchn commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Description

Addresses #194

How Has This Been Tested?

Does not affect current UI or API workflow. No effect on llm-d-planner functionality

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Next steps

  • Recalibrate activation constants in capacity planner
  • Support --dtype, --kv-cache-dtype, and --quantization override in UI and API
  • Fix find_possible_tps(): TP must also be divisible by vocab size. vocab_size is usually powers of 2 times a multiplier

jgchn added 15 commits April 21, 2026 18:34
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Comment thread accuracy/blog-gpu-capacity.md Outdated

### The headline: accurate where it counts most

**Weight memory: 0.89% mean absolute error** across 53 of the 57 runs. (The remaining 4 used parameters the planner doesn't yet model, float32 dtype and runtime fp8 quantization, and are discussed below.) This is the single largest memory component; for a model like Llama-3.1-8B at TP=1, weights consume about 15 GiB of the 79 GiB available. It's also the hardest to get right across a diverse model set.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using planner, will I know if I enter some setup or model that planner doesn't fully support (will it give me some sort of error message, or proceed to give me results with large errors without warning)?

Comment thread accuracy/blog-gpu-capacity.md Outdated
Comment thread accuracy/blog-gpu-capacity.md Outdated
Comment thread accuracy/blog-gpu-capacity.md Outdated
jgchn added 10 commits April 26, 2026 13:26
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
 results

Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Signed-off-by: Jing Chen <jing.chen2@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants