Skip to content

Move gRPC server to Unix socket and scrub introspection data#3739

Open
oliviassss wants to merge 3 commits into
aws:masterfrom
oliviassss:fix/ipamd-grpc-unix-socket-auth
Open

Move gRPC server to Unix socket and scrub introspection data#3739
oliviassss wants to merge 3 commits into
aws:masterfrom
oliviassss:fix/ipamd-grpc-unix-socket-auth

Conversation

@oliviassss

@oliviassss oliviassss commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

bug
Which issue does this PR fix?:

IPAMD IP Hijack issue, more detail in internal docs.

What does this PR do / Why do we need it?:

  1. Moves gRPC server to Unix domain socket (/var/run/aws-node/ipamd.sock, permissions 0660). The CNI plugin runs as root (invoked by kubelet) and can connect; unprivileged
    hostNetwork pods cannot access the socket due to filesystem permissions and container mount namespace isolation.
  2. Adds TCP fallback for backward compatibility during rolling upgrades. The CNI plugin tries the Unix socket first; if it doesn't exist (old IPAMD), falls back to TCP. Retains the current behavior that TCP bind is mandatory at startup (health probes and waitForIPAM depend on it).
  3. Scrubs ContainerID and IfName from the /v1/enis introspection response to block the reconnaissance step. The endpoint remains functional for debugging (ENI IDs, IP
    addresses, pod names) but no longer exposes the fields required to construct a valid DelNetwork call.

Testing done on this change:

Unit tests

  $ go test ./pkg/ipamd/... ./cmd/routed-eni-cni-plugin/... -count=1
  ok   github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd          0.332s
  ok   github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd/datastore    32.032s
  ok   github.com/aws/amazon-vpc-cni-k8s/cmd/routed-eni-cni-plugin  0.017s
  ok   github.com/aws/amazon-vpc-cni-k8s/cmd/routed-eni-cni-plugin/driver  0.019s

New tests added

 - TestRunRPCHandler_UnixSocket — verifies socket creation and 0660 permissions
 - TestEniV1RequestHandler_ScrubsSensitiveFields — verifies containerID/ifName removed
 - TestEniV1RequestHandler_ScrubsIPv6Fields — verifies IPv6 path scrubbing
 - TestDialIPAMD_FallsBackToTCPWhenSocketMissing — socket path absent, uses TCP
 - TestDialIPAMD_FallsBackToTCPWhenSocketDialFails — socket exists but Dial fails, uses TCP
 - TestDialIPAMD_ConnectsViaUnixSocket — socket exists and Dial succeeds

Integration tests (IPAMD suite, EKS v1.35.5, m5.xlarge nodes)

  Ran 26 of 26 Specs in 2491.022 seconds
  26 Passed | 0 Failed | 0 Pending | 0 Skipped

Manual verification on live cluster (VPC CNI sec-fix.v0.1)
details in internal doc
Node logs confirming socket listener:

  {"level":"info","ts":"2026-06-22T22:25:32.576Z","msg":"Serving RPC Handler version  on unix:/var/run/aws-node/ipamd.sock"}
  {"level":"info","ts":"2026-06-22T22:25:32.576Z","msg":"Starting TCP fallback gRPC listener on 127.0.0.1:50051"}

Will this PR introduce any new dependencies?:

No
Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No. Backward compatibility is maintained via TCP fallback:

  • New IPAMD + new CNI plugin: Uses Unix socket (secure)
  • New IPAMD + old CNI plugin: Old plugin uses TCP fallback (still works)
  • Old IPAMD + new CNI plugin: Socket doesn't exist, falls back to TCP (still works)

Rolling upgrade tested on live cluster — pods continued to receive IPs throughout the daemonset rollout.

Does this change require updates to the CNI daemonset config files to work?:

Does this PR introduce any user-facing change?:

no


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@oliviassss oliviassss requested a review from a team as a code owner June 23, 2026 21:27
@oliviassss oliviassss requested a review from Copilot June 23, 2026 21:27

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens IPAMD-to-CNI communication and introspection outputs to mitigate local reconnaissance and IP hijack vectors by moving the primary gRPC transport to a Unix domain socket and scrubbing sensitive fields from /v1/enis.

Changes:

  • IPAMD gRPC server now listens on a Unix socket (/var/run/aws-node/ipamd.sock) with restrictive permissions, while also starting a TCP listener for upgrade compatibility.
  • CNI plugin now dials IPAMD via Unix socket first and falls back to TCP if the socket is unavailable.
  • /v1/enis introspection response now clears ContainerID and IfName from returned IP assignment data; added tests for IPv4/IPv6 scrubbing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/ipamd/rpc_handler.go Switch gRPC server to a Unix socket and add a TCP fallback listener.
pkg/ipamd/rpc_handler_test.go Add a unit test validating Unix socket creation/permissions.
pkg/ipamd/introspect.go Scrub ContainerID/IfName from /v1/enis response payload.
pkg/ipamd/introspect_test.go Add tests verifying sensitive-field scrubbing for IPv4/IPv6.
cmd/routed-eni-cni-plugin/cni.go Prefer dialing IPAMD via Unix socket with TCP fallback for upgrades.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/ipamd/rpc_handler.go Outdated
Comment thread pkg/ipamd/rpc_handler.go
Comment thread pkg/ipamd/rpc_handler_test.go

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread pkg/ipamd/rpc_handler.go Outdated
Comment thread pkg/ipamd/rpc_handler_test.go
Comment thread cmd/routed-eni-cni-plugin/cni_test.go Outdated
Comment thread cmd/routed-eni-cni-plugin/cni_test.go
…ss-wide umask

- Make TCP listener bind mandatory at startup (matching pre-existing behavior)
  since health probes and waitForIPAM depend on port 50051. If TCP fails to
  bind, the process exits so kubelet can restart it.

- Extract dialIPAMDWithSocketPath to allow test injection of the socket path.
  Fix TestDialIPAMD_FallsBackToTCPWhenSocketDialFails to actually exercise the
  "socket exists but Dial fails" branch. Add TestDialIPAMD_ConnectsViaUnixSocket
  for the happy path.

- Remove process-wide syscall.Umask manipulation that could affect concurrent
  goroutines; rely on os.Chmod immediately after socket creation instead.

@viveksb007 viveksb007 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should get rid of the TCP fallback which might hide the issues if there is anything with unix socket based implementation.


// TODO: Remove TCP fallback once all nodes run the socket-based IPAMD.
log.Debugf("Falling back to TCP connection: %s", ipamdAddress)
conn, err := grpcClient.Dial(ipamdAddress, grpc.WithTransportCredentials(insecure.NewCredentials()))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CNI to IPAMD is node local communication, why do we need this fallback?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is majorly for upgrade safety.
During a DS rolling update, kubelet may still invoke the old CNI binary while the new IPAMD has started with the Unix socket. TCP fallback ensures those in-flight calls succeed.
Additionally, health probe still use TCP :50051 so the server need to bind to TCP anyway. The fallback is a cheap safeguard for backward compatibility.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think the old CNI binary while the new IPAMD has started will ever happen on a Node. check this out https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/

some old ref about DS update mechanism of delete then create, so DS poses some un-availability for short window as the DS pod is always deleted first and then created with updated manifest. (kubernetes/kubernetes#48841)

when a new pod will come, IPAMD won't start unless old CNI binary isn't replaced as init-container will run first.

Nice catch on health probe, we need to change that also to dial on unix socket that IPAMD is creating.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that ds updates and the upgrade race shouldn't happen in practice. But I'd prefer to keep the TCP fallback just in case the socket is unavailable for reasons like file deletion or permission issues. We can evaluate removal after 1-2 versions.
For health probes - agree we should ideally migrate to the Unix socket, but it requires Helm chart and addon config changes plus evaluation on the rollout/rollback strategy. That's irrelevant to the security fix so better to decouple it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can evaluate removal after 1-2 versions.

what data will support this?

how would we know if Unix socket is being exercised or TCP fallback is being used by CNI binary?

For health probes, I think we just need to change our grpc health probe, can you double check what changes are needed to release this change via AddOns?

Comment thread pkg/ipamd/introspect.go
assert.True(t, resp.MultiNICEnabled)
}

func TestRunRPCHandler_UnixSocket(t *testing.T) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add other test cases here ? Removing existing sockets, change permission failing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for stale socket removal and directory creation. The os perm failure need an OS abstraction layer (not used elsewhere in the codebase), may not worth adding it as the failure handling is straightforward: close listener -> remove socket -> return error.

Comment thread pkg/ipamd/rpc_handler.go
// TCP must bind successfully — health probes and waitForIPAM depend on it.
tcpListener, err := net.Listen("tcp", ipamdgRPCaddress)
if err != nil {
listener.Close()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to close the socket connection if tcp one fails ?

@oliviassss oliviassss Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TCP fail -> health probe fail -> container restart. just to handle the restart gracefully.

@oliviassss

Copy link
Copy Markdown
Contributor Author

govuln failure is irrelevant to the change, and no fixed version yet

    Found in: github.com/containerd/containerd@v1.7.33
    Fixed in: N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants