Skip to content

Add picker tracing decorator to scheduling pipeline#1708

Open
chethanuk wants to merge 3 commits into
llm-d:mainfrom
chethanuk:issue-1694-picker-span
Open

Add picker tracing decorator to scheduling pipeline#1708
chethanuk wants to merge 3 commits into
llm-d:mainfrom
chethanuk:issue-1694-picker-span

Conversation

@chethanuk

Copy link
Copy Markdown

What

Add a single OTel pick_endpoints span over the scheduling picker stage,
carrying the candidate/selected endpoint counts and request-correlation keys.

Implements #1694 (sub-task of #1483).

Design

Inline, single-step span, following the merged tracing convention from #1565
(the repo standard: pick_pd_profile / score_prefix_cache are traced inline,
not via a decorator) and the maintainer guidance on #1693:

  • runPickerPlugin starts one pick_endpoints span (SpanKindInternal) via
    tracing.Tracer(schedplugins.TracerScope), with defer span.End() so it ends
    on every path including a nil result.
  • Attributes: llm_d.epp.picker.candidate_endpoints (scored-endpoint count),
    llm_d.epp.picker.selected_endpoints (len(result.TargetEndpoints),
    nil-guarded), plus the conditional shared gen_ai.request.model /
    gen_ai.request.id keys — matching score_prefix_cache. No type/name
    attrs (the span name identifies the operation).
  • request is threaded into runPickerPlugin (sole caller Run) so the span
    carries the request keys, consistent with the filter and scorer spans. The span
    context is threaded into picker.Pick(...), so inner spans nest. Picking
    behavior and the per-plugin latency metric are unchanged.

Scope: schedplugins.TracerScope = llm-d-router/pkg/epp/framework/plugins/scheduling.

Review feedback addressed

  • Per-request wrapper allocation (gemini, coderabbit): the TracedPicker
    decorator is removed entirely; there is no per-request wrapper.
  • Tracer caching / otel.Tracer() (gemini): now uses tracing.Tracer(...), so
    spans carry the BuildRef / commit-sha instrumentation metadata.
  • Test comment paraphrasing code (codeant): the old decorator test is gone; the
    new test comments capture only non-obvious rationale.
  • Span name, granularity, and attribute shape: aligned to the maintainer's Add span for filter #1693
    decision (single-step span, llm_d.epp.picker.* keys, package scope).

Tests

scheduler_profile_picker_tracing_test.go (spans read as a slice via a
tracetest recorder): single selected (candidate 3 -> selected 1) with
name/kind/parent and counts; multiple selected (== 2); nil result -> span ended,
selected_endpoints == 0, no panic; gen_ai.* omitted when request fields are
empty; inner delegate span nests under pick_endpoints.

Gates: go build ./..., go test ./pkg/epp/scheduling/... -race, go vet,
and make lint (new-only) all green.

Refs: #1694, #1483

@chethanuk chethanuk requested a review from a team as a code owner June 22, 2026 16:18
@chethanuk chethanuk requested review from elevran and liu-cong June 22, 2026 16:18
@github-actions github-actions Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 22, 2026
Comment thread pkg/epp/scheduling/scheduler_profile.go Outdated
logger.V(logutil.VERBOSE).Info("Running picker plugin", "plugin", p.picker.TypedName())
logger.V(logutil.DEBUG).Info("Candidate pods for picking", "endpoints-weighted-score", scoredEndpoints)

ctx, span := tracing.Tracer(schedplugins.TracerScope).Start(ctx, "pick_endpoints",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceScope should be llm-d-router/pkg/epp/scheduling

Wrap the picker call in runPickerPlugin in a single inline "pick_endpoints" span
via tracing.Tracer(schedplugins.TracerScope) (SpanKindInternal), recording the
candidate and selected endpoint counts (llm_d.epp.picker.candidate_endpoints,
llm_d.epp.picker.selected_endpoints = len(result.TargetEndpoints), nil-guarded)
plus the conditional gen_ai.request.{model,id} keys. request is threaded into
runPickerPlugin (sole caller Run) so the span carries the request keys, matching
the filter and scorer spans. Follows the single-step span convention from llm-d#1565
and llm-d#1693; picking behavior and the per-plugin latency metric are unchanged.

Refs: llm-d#1694, llm-d#1483
Signed-off-by: ChethanUK <chethanuk@outlook.com>
@chethanuk chethanuk force-pushed the issue-1694-picker-span branch from de9d2fc to 59c4b6a Compare June 22, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants