test/eni-subnet-discovery: size replicas by instance ENI capacity and harden cleanup#3742
Draft
yash97 wants to merge 1 commit into
Draft
test/eni-subnet-discovery: size replicas by instance ENI capacity and harden cleanup#3742yash97 wants to merge 1 commit into
yash97 wants to merge 1 commit into
Conversation
… harden cleanup The eni-subnet-discovery integration suite pinned deployments to a single node with hardcoded replica counts (Replicas(50)/(30)/(25)/(15), scaleDeployment(20/25)) and used a hardcoded account-global IAM policy name. On smaller instance types these counts exceed node pod capacity (e.g. m5.large: maxPods 29, ENILimit 3, 9 usable pods/ENI), causing OutOfpods and deployment-ready timeouts, and the fixed policy name collided with leftover policies from prior runs (CreatePolicy -> EntityAlreadyExists). Changes: - Add computeReplicasForSecondaryENI (IPv4Limit-1, one ENI's worth) and computeReplicasForBothSubnets (IPv4Limit+1, fills primary ENI and overflows to a secondary ENI) helpers that derive the replica count from the CNI's static per-instance-type limits DB (pkg/vpc.GetIPv4Limit). Specs that must create/populate a secondary ENI without excluding the primary subnet use computeReplicasForBothSubnets; the rest use computeReplicasForSecondaryENI. - Use a unique per-run IAM policy name and make its cleanup best-effort so a leaked policy is just an unused detached policy rather than a permanent CreatePolicy collision. - Comment out the custom security-group discovery contexts: that feature was reverted in aws#3720 (RefreshCustomSGIDs now always falls back to primary SGs), so those assertions cannot pass against current VPC CNI. Signed-off-by: yathakka <yathakka@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
testing
Which issue does this PR fix?:
N/A — stabilizes the existing
eni-subnet-discoveryintegration suite (post-MAO networking-addons canary).What does this PR do / Why do we need it?:
The
eni-subnet-discoveryintegration suite pinned deployments to a single node with hardcoded replica counts (Replicas(50)/(30)/(25)/(15),scaleDeployment(20/25)) and used a hardcoded, account-global IAM policy name. On smaller instance types these counts exceed node pod capacity (e.g.m5.large: maxPods 29, ENILimit 3, 9 usable pods/ENI), causing pods stuck inOutOfpodsand deployment-ready timeouts; and the fixed policy name collided with leftover policies from prior/concurrent runs (CreatePolicy→EntityAlreadyExists).Changes:
pkg/vpc.GetIPv4Limit):computeReplicasForSecondaryENI=IPv4Limit - 1(one ENI's worth of pods).computeReplicasForBothSubnets=IPv4Limit + 1(fills the primary ENI and overflows onto a secondary ENI, so pods span both the primary and the discovered secondary subnet).Specs that must create/populate a secondary ENI without excluding the primary subnet now use
computeReplicasForBothSubnets; the rest usecomputeReplicasForSecondaryENI. Both fail loudly if the instance type is absent from the limits DB rather than guessing.CreatePolicycollision.RefreshCustomSGIDsnow always falls back to the node's primary SGs), so those assertions cannot pass against current VPC CNI. Kept the negative "untagged SG is not applied" spec, which holds under the primary-SG fallback.Testing done on this change:
Ran the suite against an EKS cluster (
m5.largenodes) with the changes:9 pods in primary subnet + 2 in secondary subnet(computed replicas = 11 form5.large), and2 new pods, 0 in excluded secondary subnetfor the scale-after-exclusion case. NoOutOfpodsand noEntityAlreadyExistspolicy collisions.2 Passed | 0 Failed.go vetandgo test -c(compile) pass on the package.Will this PR introduce any new dependencies?:
No. Uses the existing in-repo
pkg/vpclimits DB. No new AWS/IMDS API calls beyond what the suite already makes.Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No — test-only change.
Does this change require updates to the CNI daemonset config files to work?:
No.
Does this PR introduce any user-facing change?:
No.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.