Move gRPC server to Unix socket and scrub introspection data by oliviassss · Pull Request #3739 · aws/amazon-vpc-cni-k8s

oliviassss · 2026-06-23T21:27:13Z

What type of PR is this?

bug
Which issue does this PR fix?:

IPAMD IP Hijack issue, more detail in internal docs.

What does this PR do / Why do we need it?:

Moves gRPC server to Unix domain socket (/var/run/aws-node/ipamd.sock, permissions 0660). The CNI plugin runs as root (invoked by kubelet) and can connect; unprivileged
hostNetwork pods cannot access the socket due to filesystem permissions and container mount namespace isolation.
Adds TCP fallback for backward compatibility during rolling upgrades. The CNI plugin tries the Unix socket first; if it doesn't exist (old IPAMD), falls back to TCP. Retains the current behavior that TCP bind is mandatory at startup (health probes and waitForIPAM depend on it).
Scrubs ContainerID and IfName from the /v1/enis introspection response to block the reconnaissance step. The endpoint remains functional for debugging (ENI IDs, IP
addresses, pod names) but no longer exposes the fields required to construct a valid DelNetwork call.

Testing done on this change:

Unit tests

  $ go test ./pkg/ipamd/... ./cmd/routed-eni-cni-plugin/... -count=1
  ok   github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd          0.332s
  ok   github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd/datastore    32.032s
  ok   github.com/aws/amazon-vpc-cni-k8s/cmd/routed-eni-cni-plugin  0.017s
  ok   github.com/aws/amazon-vpc-cni-k8s/cmd/routed-eni-cni-plugin/driver  0.019s

New tests added

 - TestRunRPCHandler_UnixSocket — verifies socket creation and 0660 permissions
 - TestEniV1RequestHandler_ScrubsSensitiveFields — verifies containerID/ifName removed
 - TestEniV1RequestHandler_ScrubsIPv6Fields — verifies IPv6 path scrubbing
 - TestDialIPAMD_FallsBackToTCPWhenSocketMissing — socket path absent, uses TCP
 - TestDialIPAMD_FallsBackToTCPWhenSocketDialFails — socket exists but Dial fails, uses TCP
 - TestDialIPAMD_ConnectsViaUnixSocket — socket exists and Dial succeeds

Integration tests (IPAMD suite, EKS v1.35.5, m5.xlarge nodes)

  Ran 26 of 26 Specs in 2491.022 seconds
  26 Passed | 0 Failed | 0 Pending | 0 Skipped

Manual verification on live cluster (VPC CNI sec-fix.v0.1)
details in internal doc
Node logs confirming socket listener:

  {"level":"info","ts":"2026-06-22T22:25:32.576Z","msg":"Serving RPC Handler version  on unix:/var/run/aws-node/ipamd.sock"}
  {"level":"info","ts":"2026-06-22T22:25:32.576Z","msg":"Starting TCP fallback gRPC listener on 127.0.0.1:50051"}

Will this PR introduce any new dependencies?:

No
Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No. Backward compatibility is maintained via TCP fallback:

New IPAMD + new CNI plugin: Uses Unix socket (secure)
New IPAMD + old CNI plugin: Old plugin uses TCP fallback (still works)
Old IPAMD + new CNI plugin: Socket doesn't exist, falls back to TCP (still works)

Rolling upgrade tested on live cluster — pods continued to receive IPs throughout the daemonset rollout.

Does this change require updates to the CNI daemonset config files to work?:

Does this PR introduce any user-facing change?:

no

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copilot

Pull request overview

This PR hardens IPAMD-to-CNI communication and introspection outputs to mitigate local reconnaissance and IP hijack vectors by moving the primary gRPC transport to a Unix domain socket and scrubbing sensitive fields from /v1/enis.

Changes:

IPAMD gRPC server now listens on a Unix socket (/var/run/aws-node/ipamd.sock) with restrictive permissions, while also starting a TCP listener for upgrade compatibility.
CNI plugin now dials IPAMD via Unix socket first and falls back to TCP if the socket is unavailable.
/v1/enis introspection response now clears ContainerID and IfName from returned IP assignment data; added tests for IPv4/IPv6 scrubbing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pkg/ipamd/rpc_handler.go	Switch gRPC server to a Unix socket and add a TCP fallback listener.
pkg/ipamd/rpc_handler_test.go	Add a unit test validating Unix socket creation/permissions.
pkg/ipamd/introspect.go	Scrub `ContainerID`/`IfName` from `/v1/enis` response payload.
pkg/ipamd/introspect_test.go	Add tests verifying sensitive-field scrubbing for IPv4/IPv6.
cmd/routed-eni-cni-plugin/cni.go	Prefer dialing IPAMD via Unix socket with TCP fallback for upgrades.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

…ss-wide umask - Make TCP listener bind mandatory at startup (matching pre-existing behavior) since health probes and waitForIPAM depend on port 50051. If TCP fails to bind, the process exits so kubelet can restart it. - Extract dialIPAMDWithSocketPath to allow test injection of the socket path. Fix TestDialIPAMD_FallsBackToTCPWhenSocketDialFails to actually exercise the "socket exists but Dial fails" branch. Add TestDialIPAMD_ConnectsViaUnixSocket for the happy path. - Remove process-wide syscall.Umask manipulation that could affect concurrent goroutines; rely on os.Chmod immediately after socket creation instead.

viveksb007

we should get rid of the TCP fallback which might hide the issues if there is anything with unix socket based implementation.

viveksb007 · 2026-06-24T21:40:45Z

+
+	// TODO: Remove TCP fallback once all nodes run the socket-based IPAMD.
+	log.Debugf("Falling back to TCP connection: %s", ipamdAddress)
+	conn, err := grpcClient.Dial(ipamdAddress, grpc.WithTransportCredentials(insecure.NewCredentials()))


CNI to IPAMD is node local communication, why do we need this fallback?

This is majorly for upgrade safety.
During a DS rolling update, kubelet may still invoke the old CNI binary while the new IPAMD has started with the Unix socket. TCP fallback ensures those in-flight calls succeed.
Additionally, health probe still use TCP :50051 so the server need to bind to TCP anyway. The fallback is a cheap safeguard for backward compatibility.

i don't think the old CNI binary while the new IPAMD has started will ever happen on a Node. check this out https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/

some old ref about DS update mechanism of delete then create, so DS poses some un-availability for short window as the DS pod is always deleted first and then created with updated manifest. (kubernetes/kubernetes#48841)

when a new pod will come, IPAMD won't start unless old CNI binary isn't replaced as init-container will run first.

Nice catch on health probe, we need to change that also to dial on unix socket that IPAMD is creating.

You're right that ds updates and the upgrade race shouldn't happen in practice. But I'd prefer to keep the TCP fallback just in case the socket is unavailable for reasons like file deletion or permission issues. We can evaluate removal after 1-2 versions.
For health probes - agree we should ideally migrate to the Unix socket, but it requires Helm chart and addon config changes plus evaluation on the rollout/rollback strategy. That's irrelevant to the security fix so better to decouple it.

We can evaluate removal after 1-2 versions.

what data will support this?

how would we know if Unix socket is being exercised or TCP fallback is being used by CNI binary?

For health probes, I think we just need to change our grpc health probe, can you double check what changes are needed to release this change via AddOns?

jaydeokar · 2026-06-25T19:34:52Z

 	assert.True(t, resp.MultiNICEnabled)
 }
+
+func TestRunRPCHandler_UnixSocket(t *testing.T) {


Can you add other test cases here ? Removing existing sockets, change permission failing

Added tests for stale socket removal and directory creation. The os perm failure need an OS abstraction layer (not used elsewhere in the codebase), may not worth adding it as the failure handling is straightforward: close listener -> remove socket -> return error.

jaydeokar · 2026-06-25T19:40:53Z

+	// TCP must bind successfully — health probes and waitForIPAM depend on it.
+	tcpListener, err := net.Listen("tcp", ipamdgRPCaddress)
+	if err != nil {
+		listener.Close()


Why do we have to close the socket connection if tcp one fails ?

TCP fail -> health probe fail -> container restart. just to handle the restart gracefully.

oliviassss · 2026-06-25T22:57:08Z

govuln failure is irrelevant to the change, and no fixed version yet

    Found in: github.com/containerd/containerd@v1.7.33
    Fixed in: N/A

oliviassss requested a review from a team as a code owner June 23, 2026 21:27

oliviassss requested a review from Copilot June 23, 2026 21:27

Copilot started reviewing on behalf of oliviassss June 23, 2026 21:27 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread pkg/ipamd/rpc_handler.go Outdated

Comment thread pkg/ipamd/rpc_handler.go

Comment thread pkg/ipamd/rpc_handler_test.go

Move gRPC server to Unix socket and scrub introspection data

de4370d

oliviassss force-pushed the fix/ipamd-grpc-unix-socket-auth branch from 7c0763e to de4370d Compare June 23, 2026 21:48

oliviassss requested a review from Copilot June 23, 2026 21:49

Copilot started reviewing on behalf of oliviassss June 23, 2026 21:50 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread pkg/ipamd/rpc_handler.go Outdated

Comment thread pkg/ipamd/rpc_handler_test.go

Comment thread cmd/routed-eni-cni-plugin/cni_test.go Outdated

Comment thread cmd/routed-eni-cni-plugin/cni_test.go

viveksb007 reviewed Jun 24, 2026

View reviewed changes

jaydeokar reviewed Jun 25, 2026

View reviewed changes

Add unit tests for stale socket removal and socket directory creation

99a69ec

Uh oh!

Conversation

oliviassss commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viveksb007 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oliviassss Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oliviassss commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

oliviassss commented Jun 23, 2026 •

edited

Loading

oliviassss Jun 25, 2026 •

edited

Loading