Skip to content

feat(epp): add topology-extractor datalayer plugin#1678

Open
elevran wants to merge 6 commits into
llm-d:mainfrom
elevran:topology-extractor
Open

feat(epp): add topology-extractor datalayer plugin#1678
elevran wants to merge 6 commits into
llm-d:mainfrom
elevran:topology-extractor

Conversation

@elevran

@elevran elevran commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

What type of PR is this?
/kind feature

What this PR does / why we need it:

Adds a topology-extractor endpoint extractor that stamps each endpoint with a
Topology attribute (hostname, zone, region) for use by topology-aware scoring.

The plugin registers for both endpoint lifecycle events and Pod k8s notification
events. Label names for each topology field are configured via hostname, zone,
and region parameters; when omitted, the standard Kubernetes topology labels are
used as defaults:

  • hostname: kubernetes.io/hostname
  • zone: topology.kubernetes.io/zone
  • region: topology.kubernetes.io/region

Labels are read from the endpoint's pod metadata at endpoint event time. When the
hostname label is absent on a pod, the plugin falls back to spec.hostname from
the Pod notification event. Zone and region have no fallback.

Note: zone and region values are not read from k8s Node objects -- that would
require additional RBAC to allow GET/LIST on Nodes. The topology fields are
populated when the pod itself carries the corresponding labels (e.g. propagated
via the Downward API or admission webhook from the node's labels).

This is groundwork for topology-aware scoring in disaggregated inference: a planned
topology-locality-scorer will score endpoints whose Topology.Hostname matches a
key on the request attribute store, enabling locality-aware routing that reduces KV
cache transfer latency across nodes.

Which issue(s) this PR fixes:
Refs #545

Test plan:

Release note (write NONE if no user-facing change):

Add `topology-extractor` datalayer plugin: stamps endpoints with a `Topology`
attribute (hostname, zone, region). Label names are configurable; defaults are the
standard Kubernetes topology labels. Hostname falls back to `spec.hostname` when
the label is absent.

@elevran elevran requested a review from a team as a code owner June 17, 2026 12:09
@elevran elevran requested review from liu-cong and vMaroon June 17, 2026 12:09
@github-actions github-actions Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 17, 2026
@elevran elevran force-pushed the topology-extractor branch from 3fd1524 to ec952ab Compare June 17, 2026 12:54
@github-actions github-actions Bot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Jun 17, 2026
@elevran elevran marked this pull request as draft June 17, 2026 12:57
@github-actions github-actions Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 18, 2026
Comment thread pkg/epp/framework/plugins/datalayer/attribute/topology/README.md Outdated
Comment thread pkg/epp/framework/plugins/datalayer/extractor/topology/extractor.go Outdated
Comment thread pkg/epp/framework/plugins/datalayer/extractor/topology/extractor.go Outdated
@github-actions github-actions Bot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Jun 22, 2026
elevran added 2 commits June 23, 2026 16:38
Introduces a topology-extractor endpoint extractor that stamps each
endpoint with a Topology attribute (hostname) at creation time.

The hostname is resolved once when the endpoint is created:
- Default: the Pod hostname field (EndpointMetadata.PodName).
- With hostnameLabel configured: the value of the named pod label.
  If the label is absent, the attribute is not set.

Works for both k8s-discovered and file-based endpoints. The attribute
is a static Cloneable value stored in the endpoint's AttributeMap.

The extractor self-registers with the endpoint-notification-source,
creating a default source if none is configured.

Planned scorer (not implemented): topology-locality-scorer would
score 1.0 for endpoints whose Topology.Hostname matches a key set
on the request attribute store, and 0.0 otherwise. The request key
would be published by a DataProducer reading a request header
(e.g. x-topology-key) or from EPP node metadata. When no key is
present on the request, all endpoints score 0.0.

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
…fications

- Remove auto-generated release note fragment (generated by CI action)
- Register for both the endpoint notification source and a Pod k8s
  notification source via RegisterDependencies
- With hostnameLabel configured: endpoint handler extracts the label
  value and stamps the Topology attribute; Pod handler is a no-op
- Without hostnameLabel: endpoint handler tracks the live Endpoint in
  an internal map; Pod notification handler reads spec.hostname from the
  Pod object and stamps the matching endpoint
- Maintain the endpoint map under a RWMutex; remove entries on delete

The prior implementation only handled endpoint events, which do not
carry the full Pod object. Pod notifications provide spec.hostname for
the no-label path.

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
elevran added 4 commits June 23, 2026 16:38
…ogy-extractor

The prior implementation keyed endpoint lookup by endpoint NamespacedName
(e.g. worker-1-rank-0), which never matches the pod notification key
(worker-1). Also, pod notifications fire before endpoints are created, so
the attribute was never stamped.

Fix:
- Key both internal maps by pod identity {PodName, Namespace}.
- endpoints map holds a []Endpoint per pod to cover all rank entries.
- hostnames map caches spec.hostname from pod notifications; whichever
  event fires first, the attribute is written once both have been seen.
- Only cache hostnames for ready pods; evict on not-ready or pod delete.

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
README: remove stale 'at endpoint creation' phrasing; clarify that
Hostname is sourced from spec.hostname or a configured pod label.

Tests: drop always-constant parameters from helper functions (unparam);
introduce constants for repeated string literals (goconst).

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
Add zone and region fields to the Topology attribute and the params
struct. Each param names the pod label to read; defaults are the
standard Kubernetes topology labels (corev1.LabelHostname,
LabelTopologyZone, LabelTopologyRegion).

The hostname label falling back to spec.hostname is preserved: when
the hostname label is absent the endpoint is tracked and stamped once
the Pod notification fires. Zone and region have no fallback.

Zone and region values are not read from Node objects -- that would
require RBAC for Node GET/LIST. They are populated when the pod itself
carries the corresponding labels.

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
Change endpoints map inner type from []Endpoint to
map[NamespacedName]Endpoint. EventAddOrUpdate re-fires on
endpoint updates, so the previous append caused duplicate
entries for the same endpoint. Map assignment is idempotent.

Remove the now-unused slices import.

Fix README wording: "pod label" -> "endpoint label".

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
@elevran elevran force-pushed the topology-extractor branch from b232151 to b785485 Compare June 23, 2026 13:39
@github-actions github-actions Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants