Skip to content

feat(epp): add OpenTelemetry spans for the scheduler scoring path#1834

Open
mvanhorn wants to merge 1 commit into
llm-d:mainfrom
mvanhorn:feat/1692-epp-scheduler-scoring-spans
Open

feat(epp): add OpenTelemetry spans for the scheduler scoring path#1834
mvanhorn wants to merge 1 commit into
llm-d:mainfrom
mvanhorn:feat/1692-epp-scheduler-scoring-spans

Conversation

@mvanhorn

Copy link
Copy Markdown

What type of PR is this?

/kind feature

Summary

Adds fine-grained OpenTelemetry spans to the EPP scheduler scoring path so a single request trace shows which scorers ran, how long each took, and the aggregate score signals each produced.

SchedulerProfile.runScorerPlugins now opens a parent llm_d.epp.scoring span over the scorer chain, and each scorer invocation runs inside an llm_d.epp.scorer.<type> child span. The child spans nest under the parent, and the existing plugin-internal span (e.g. score_prefix_cache) nests under its scorer span. This mirrors the span-emission pattern already used by the precise-prefix-cache scorer.

Attributes are request- and chain-level only:

  • parent: llm_d.epp.scorer.count, llm_d.epp.scoring.candidate_endpoints, plus gen_ai.request.model / gen_ai.request.id when present.
  • per-scorer: llm_d.epp.scorer.type, llm_d.epp.scorer.name, llm_d.epp.scorer.weight, llm_d.epp.scorer.candidate_endpoints, and aggregate llm_d.epp.scorer.score.max / .avg / endpoints_scored derived from the returned score map.

No per-pod / per-endpoint attribute keys are emitted, keeping span cardinality bounded.

Why this matters

Issue #1692 ("Add span for scheduler score") is the scoped sub-task of umbrella #1483, which asks for spans across the scheduling/scoring pipeline so operators can see scorer-level behavior inside a request trace instead of only the top-level gateway.request / gateway.request_orchestration spans. Today the scorer chain is invisible in traces: there is no way to tell which scorer dominated a routing decision or which one was slow.

This package is allocation-sensitive (the surrounding code documents per-request allocation work in runScorerPlugins and runPickerPlugin), so the scoring path stays allocation-free when tracing is disabled: the parent span's IsRecording() is checked once and all attribute and child-span construction is skipped on the default no-op path, with the tracer and span-kind option resolved once rather than per scorer. BenchmarkSchedule confirms the disabled path holds at the prior baseline.

Testing

  • Parent llm_d.epp.scoring span recorded with one llm_d.epp.scorer.<type> child per scorer, children nested under the parent, with correct weight, candidate-count, and aggregate max/avg attributes.
  • No per-pod / per-endpoint attribute keys present on any emitted span.
  • Scoring results and scorer call counts are unchanged with a no-op tracer; no spans recorded on the disabled path.
  • A scorer returning an empty score map still emits its span and omits aggregates without dividing by zero.

Which issue(s) this PR fixes:

Fixes #1692

Release note:

Add OpenTelemetry spans for the EPP scheduler scoring path: a parent `llm_d.epp.scoring` span over the scorer chain and per-scorer `llm_d.epp.scorer.<type>` child spans carrying scorer identity, weight, candidate count, and aggregate score signals. No per-endpoint attributes are emitted, and spans are skipped entirely when tracing is disabled.

Wrap the scorer chain in a parent llm_d.epp.scoring span and each scorer
invocation in an llm_d.epp.scorer.<type> child span, so a request trace
shows which scorers ran, how long they took, and aggregate score signals.
Span attributes are request- and chain-level only (scorer type/name/weight,
candidate count, score max/avg); no per-endpoint keys are emitted, keeping
span cardinality bounded. Spans are no-ops when tracing is uninitialized.

Signed-off-by: Matt Van Horn <mvanhorn@gmail.com>
@mvanhorn mvanhorn requested a review from a team as a code owner June 27, 2026 11:09
@mvanhorn mvanhorn requested review from ahg-g and liu-cong June 27, 2026 11:09
@github-actions github-actions Bot added kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add span for scheduler score

1 participant