Skip to content

test/eni-subnet-discovery: size replicas by instance ENI capacity and harden cleanup#3742

Draft
yash97 wants to merge 1 commit into
aws:masterfrom
yash97:subnet_test
Draft

test/eni-subnet-discovery: size replicas by instance ENI capacity and harden cleanup#3742
yash97 wants to merge 1 commit into
aws:masterfrom
yash97:subnet_test

Conversation

@yash97

@yash97 yash97 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?
testing

Which issue does this PR fix?:
N/A — stabilizes the existing eni-subnet-discovery integration suite (post-MAO networking-addons canary).

What does this PR do / Why do we need it?:
The eni-subnet-discovery integration suite pinned deployments to a single node with hardcoded replica counts (Replicas(50)/(30)/(25)/(15), scaleDeployment(20/25)) and used a hardcoded, account-global IAM policy name. On smaller instance types these counts exceed node pod capacity (e.g. m5.large: maxPods 29, ENILimit 3, 9 usable pods/ENI), causing pods stuck in OutOfpods and deployment-ready timeouts; and the fixed policy name collided with leftover policies from prior/concurrent runs (CreatePolicyEntityAlreadyExists).

Changes:

  • Add two helpers that derive the replica count from the CNI's static per-instance-type limits DB (pkg/vpc.GetIPv4Limit):
    • computeReplicasForSecondaryENI = IPv4Limit - 1 (one ENI's worth of pods).
    • computeReplicasForBothSubnets = IPv4Limit + 1 (fills the primary ENI and overflows onto a secondary ENI, so pods span both the primary and the discovered secondary subnet).
      Specs that must create/populate a secondary ENI without excluding the primary subnet now use computeReplicasForBothSubnets; the rest use computeReplicasForSecondaryENI. Both fail loudly if the instance type is absent from the limits DB rather than guessing.
  • Use a unique per-run IAM policy name and make its cleanup best-effort, so a leaked policy is just an unused detached policy rather than a permanent CreatePolicy collision.
  • Comment out the custom security-group discovery contexts (custom SG / auto-refresh / tag-removal): that feature was reverted in Disabling SG discovery for ESD feature #3720 (RefreshCustomSGIDs now always falls back to the node's primary SGs), so those assertions cannot pass against current VPC CNI. Kept the negative "untagged SG is not applied" spec, which holds under the primary-SG fallback.

Testing done on this change:
Ran the suite against an EKS cluster (m5.large nodes) with the changes:

  • The base "should have subnet in CIDR range" spec and the secondary-exclusion specs pass; observed pod distribution 9 pods in primary subnet + 2 in secondary subnet (computed replicas = 11 for m5.large), and 2 new pods, 0 in excluded secondary subnet for the scale-after-exclusion case. No OutOfpods and no EntityAlreadyExists policy collisions.
  • Focused re-run of the two secondary-exclusion specs: 2 Passed | 0 Failed.
  • go vet and go test -c (compile) pass on the package.

Will this PR introduce any new dependencies?:
No. Uses the existing in-repo pkg/vpc limits DB. No new AWS/IMDS API calls beyond what the suite already makes.

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No — test-only change.

Does this change require updates to the CNI daemonset config files to work?:
No.

Does this PR introduce any user-facing change?:
No.

NONE

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

… harden cleanup

The eni-subnet-discovery integration suite pinned deployments to a single node
with hardcoded replica counts (Replicas(50)/(30)/(25)/(15), scaleDeployment(20/25))
and used a hardcoded account-global IAM policy name. On smaller instance types
these counts exceed node pod capacity (e.g. m5.large: maxPods 29, ENILimit 3,
9 usable pods/ENI), causing OutOfpods and deployment-ready timeouts, and the
fixed policy name collided with leftover policies from prior runs
(CreatePolicy -> EntityAlreadyExists).

Changes:
- Add computeReplicasForSecondaryENI (IPv4Limit-1, one ENI's worth) and
  computeReplicasForBothSubnets (IPv4Limit+1, fills primary ENI and overflows
  to a secondary ENI) helpers that derive the replica count from the CNI's
  static per-instance-type limits DB (pkg/vpc.GetIPv4Limit). Specs that must
  create/populate a secondary ENI without excluding the primary subnet use
  computeReplicasForBothSubnets; the rest use computeReplicasForSecondaryENI.
- Use a unique per-run IAM policy name and make its cleanup best-effort so a
  leaked policy is just an unused detached policy rather than a permanent
  CreatePolicy collision.
- Comment out the custom security-group discovery contexts: that feature was
  reverted in aws#3720 (RefreshCustomSGIDs now always falls back to primary SGs),
  so those assertions cannot pass against current VPC CNI.

Signed-off-by: yathakka <yathakka@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant