fix: clean up orphaned mapt VPCs without NAT gws#299
Conversation
a3f6367 to
7bb6288
Compare
Code Review by Qodo
1. EIP release may fail
|
PR Summary by QodoFix mapt cleanup: delete orphaned VPCs without NAT gateways + LB/ENI race handling WalkthroughsDescription• Add a VPC discovery pass for mapt clusters that run without NAT gateways (kind/NatGatewayModeNone). • Only delete tagged VPCs with no active EC2 instances and past the age threshold. • Poll LB deletion and skip ELB-managed ENIs to avoid async cleanup races. Diagramgraph TD
A["delete-mapt-clusters.sh"] --> B["Discover VPCs via NAT GW tags"] --> F["Cleanup dependencies"] --> G["Final VPC delete"]
A --> C["Discover VPCs via VPC tag (origin=mapt)"] --> D["Check EC2 instances (active/terminated+age)"] --> F
F --> E["Delete & poll Load Balancers"]
F --> H["Delete ENIs/EIPs (skip ELB ENIs)"]
High-Level AssessmentThe following are alternative approaches to this PR: 1. Use Resource Groups Tagging API for discovery
2. Move to TTL-based lifecycle management (tag + scheduled janitor)
3. Rely on mapt-managed teardown (make deletion idempotent)
Recommendation: Current approach is appropriate as a pragmatic remediation: it keeps the existing NAT-based path, adds a tag-based backstop for kind clusters, and introduces LB polling/ELB-ENI exclusions to reduce race-related failures. Longer-term, consider adding TTL tags at provision time to reduce reliance on inference (instance presence/age) and simplify discovery. File ChangesBug fix (1)
|
| # --- 3.3 Standalone EIPs tagged with origin=mapt in this VPC --- | ||
| # (EIPs not attached to an ENI won't be caught above) | ||
| standalone_eips=$(aws ec2 describe-addresses --region "$region" --filters "$FILTER" --query "Addresses[?AssociationId==null].AllocationId" --output text 2>/dev/null) | ||
| for eip_alloc_id in $standalone_eips; do | ||
| delete_resource "EIP" "$eip_alloc_id" "$region" "Orphaned standalone EIP" | ||
| regional_eips=$((regional_eips + 1)) | ||
| done |
There was a problem hiding this comment.
1. Unscoped eip deletion 🐞 Bug ≡ Correctness
In the orphan-VPC cleanup path, standalone_eips is computed only by origin=mapt and AssociationId==null, so any unassociated tagged EIP in the *region* can be released while processing an unrelated orphan VPC. This can break other mapt clusters/workflows and will be re-attempted once per orphan VPC processed.
Agent Prompt
## Issue description
The orphan-VPC cleanup loop releases **all** unassociated EIPs tagged `origin=mapt` in the region, even if they do not belong to the orphan VPC currently being cleaned up.
## Issue Context
EIPs do not have a direct VPC linkage when unassociated, so a tag-only regional query can impact unrelated clusters.
## Fix Focus Areas
- scripts/mapt/delete-mapt-clusters.sh[549-555]
## Suggested fix approach
- Remove the per-orphan-VPC regional EIP deletion loop, OR
- Restrict EIP cleanup to EIPs that can be safely attributed to the orphan VPC/project, e.g.:
- Filter by `projectName==$project_name` (if EIPs are tagged with projectName), and/or
- Only delete EIPs that were previously attached to ENIs in the orphan VPC (track those ENI->EIP allocation IDs before deletion), and/or
- Run a single regional pass *after* computing a set of orphan project names, and delete only EIPs whose tags match those orphan projects.
- Ensure the EIP cleanup is not executed repeatedly per VPC (avoid duplicate delete attempts).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| orphan_vpcs=$( | ||
| aws ec2 describe-vpcs \ | ||
| --region "$region" \ | ||
| --filters "$FILTER" \ | ||
| --query 'Vpcs[].VpcId' \ | ||
| --output text 2>/dev/null | ||
| ) | ||
|
|
||
| for vpc_id in $orphan_vpcs; do | ||
| # Skip VPCs already handled by the NAT gateway path | ||
| if echo "$counted_vpcs_string" | grep -q "|${vpc_id}|"; then | ||
| continue | ||
| fi | ||
|
|
||
| # Check for any non-terminated EC2 instances in this VPC | ||
| active_instances=$( | ||
| aws ec2 describe-instances \ | ||
| --region "$region" \ | ||
| --filters "Name=vpc-id,Values=$vpc_id" \ | ||
| --query 'Reservations[].Instances[?State.Name!=`terminated`].InstanceId' \ | ||
| --output text 2>/dev/null | ||
| ) | ||
| if [ -n "$active_instances" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| # Check for terminated instances to apply the age limit | ||
| terminated_data=$( | ||
| aws ec2 describe-instances \ | ||
| --region "$region" \ | ||
| --filters "Name=vpc-id,Values=$vpc_id" "Name=instance-state-name,Values=terminated" \ | ||
| --query 'Reservations[].Instances[].LaunchTime' \ | ||
| --output text 2>/dev/null | ||
| ) | ||
| if [ -n "$terminated_data" ]; then | ||
| # Use the most recent LaunchTime among terminated instances | ||
| skip_vpc=false | ||
| for launch_time in $terminated_data; do | ||
| age_seconds=$(get_age_seconds "$launch_time") | ||
| if [ "$age_seconds" -le "$AGE_LIMIT_SECONDS" ]; then | ||
| skip_vpc=true | ||
| break | ||
| fi | ||
| done | ||
| if $skip_vpc; then | ||
| continue | ||
| fi | ||
| fi | ||
| # No instances at all (purged from API >1h after termination) or | ||
| # only terminated instances older than the age limit — VPC is orphaned. | ||
|
|
||
| project_name=$( | ||
| aws ec2 describe-vpcs \ | ||
| --region "$region" \ | ||
| --vpc-ids "$vpc_id" \ | ||
| --query "Vpcs[0].Tags[?Key=='$PROJECT_TAG_KEY'].Value" \ | ||
| --output text 2>/dev/null | ||
| ) | ||
| project_name="${project_name:-unknown}" | ||
|
|
||
| echo " --------------------------------------------------" | ||
| echo " ✅ ORPHANED VPC FOUND (no NAT Gateway, no active instances)" | ||
| echo " VPC ID: $vpc_id" | ||
| echo " Project: $project_name" | ||
|
|
||
| regional_vpc_count=$((regional_vpc_count + 1)) | ||
| regional_vpcs_to_delete="$regional_vpcs_to_delete $vpc_id" | ||
| counted_vpcs_string="${counted_vpcs_string}${vpc_id}|" |
There was a problem hiding this comment.
2. Orphan vpc bypasses age 🐞 Bug ≡ Correctness
The orphan-VPC path marks a VPC for deletion whenever there are no active instances, but if the instance API returns no terminated instances (instances purged or never created) it skips the AGE_LIMIT_SECONDS guard entirely. This contradicts the documented “older than 1 day” safety limit and can delete newly created/provisioning origin=mapt VPCs.
Agent Prompt
## Issue description
Orphan VPC deletion can proceed without any age verification when `terminated_data` is empty, allowing deletion of recently created/provisioning VPCs tagged `origin=mapt`.
## Issue Context
AWS VPC APIs do not expose VPC creation time directly. The current logic only applies age checks when terminated instances are visible.
## Fix Focus Areas
- scripts/mapt/delete-mapt-clusters.sh[444-511]
- scripts/mapt/README.md[5-6]
## Suggested fix approach
Implement a reliable age signal for VPCs with no EC2 history, and *only* proceed when that signal is older than `AGE_LIMIT_SECONDS`. Options:
- Preferably look up the VPC creation event time via CloudTrail (if available in this environment) and compute age from that.
- If CloudTrail is not available, use a conservative fallback: **skip deletion** when no age signal is available (log a warning so it can be investigated).
- Keep the existing terminated-instance-based check as an additional guard.
This preserves the documented “older than 1 day” behavior while still allowing safe cleanup when an age source exists.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
mapt kind clusters use NatGatewayModeNone, so the existing NAT-gateway-based VPC discovery never finds them. This left hundreds of orphaned VPCs with their networking resources (IGWs, subnets, SGs, route tables) accumulating across us-west-2, us-east-1, and us-east-2. Add a second discovery pass that finds origin=mapt VPCs directly, skips any with active EC2 instances, and cleans up dependencies. Also add LB deletion polling to avoid async race conditions with ELB-managed ENIs/EIPs. Assisted-by: Cursor
Standalone EIPs were deleted region-wide inside the per-VPC loop, potentially affecting unrelated mapt clusters. Move to a single post-loop pass filtered by orphan projectName tags. Add CloudTrail-based age check for VPCs with no EC2 instance history to prevent deleting recently provisioned VPCs. Missing CloudTrail events (>90d retention) are treated as old and safe to delete. Assisted-by: Cursor
7bb6288 to
b715681
Compare
|
Code review by qodo was updated up to the latest commit b715681 |
| # Check for any non-terminated EC2 instances in this VPC | ||
| active_instances=$( | ||
| aws ec2 describe-instances \ | ||
| --region "$region" \ | ||
| --filters "Name=vpc-id,Values=$vpc_id" \ | ||
| --query 'Reservations[].Instances[?State.Name!=`terminated`].InstanceId' \ | ||
| --output text 2>/dev/null | ||
| ) | ||
| if [ -n "$active_instances" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| # Check for terminated instances to apply the age limit | ||
| terminated_data=$( | ||
| aws ec2 describe-instances \ | ||
| --region "$region" \ | ||
| --filters "Name=vpc-id,Values=$vpc_id" "Name=instance-state-name,Values=terminated" \ | ||
| --query 'Reservations[].Instances[].LaunchTime' \ | ||
| --output text 2>/dev/null | ||
| ) | ||
| if [ -n "$terminated_data" ]; then | ||
| skip_vpc=false | ||
| for launch_time in $terminated_data; do | ||
| age_seconds=$(get_age_seconds "$launch_time") | ||
| if [ "$age_seconds" -le "$AGE_LIMIT_SECONDS" ]; then | ||
| skip_vpc=true | ||
| break | ||
| fi | ||
| done | ||
| if $skip_vpc; then | ||
| continue | ||
| fi |
There was a problem hiding this comment.
1. None breaks orphan checks 🐞 Bug ≡ Correctness
In the orphan-VPC path, active_instances and terminated_data are treated as “present” based only on -n, so an AWS CLI --output text result of the literal string None will incorrectly skip VPC deletion or feed None into get_age_seconds() and trip the age guard. This can prevent orphaned VPC cleanup from ever running for some VPCs, re-creating the leak this PR is intended to fix.
Agent Prompt
### Issue description
The orphan-VPC cleanup treats any non-empty `--output text` result as real data. When AWS CLI returns the literal string `None`, the script either:
- skips deletion (`active_instances` check), or
- treats `None` as a timestamp, making `get_age_seconds` return `0` and causing the VPC to be skipped as “too new”.
### Issue Context
This script already anticipates `None` from AWS CLI text output in other locations, but the new orphan-VPC branch does not.
### Fix Focus Areas
- scripts/mapt/delete-mapt-clusters.sh[459-490]
- scripts/mapt/delete-mapt-clusters.sh[492-509]
### Proposed fix
- After each AWS CLI `--output text` capture used as a presence check, normalize `None` to empty (or explicitly check `!= "None"`).
- Example:
- `if [ -n "$active_instances" ] && [ "$active_instances" != "None" ]; then ...`
- `if [ -n "$terminated_data" ] && [ "$terminated_data" != "None" ]; then ... else ...`
- Similarly, when iterating timestamps, skip non-timestamps (e.g., `None`) before calling `get_age_seconds`.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
NLB-managed ENIs were fully skipped during orphan VPC cleanup,
leaving their EIPs allocated. This blocked IGW detachment with
DependencyViolation ("mapped public addresses"), preventing VPC
deletion. Release EIPs from ELB ENIs before skipping the ENI.
Assisted-by: Cursor
|
Code review by qodo was updated up to the latest commit 5e573c6 |
| eips_data=$(aws ec2 describe-addresses --region "$region" --query "Addresses[?NetworkInterfaceId=='$eni_id'].AllocationId" --output text 2>/dev/null) | ||
| for eip_alloc_id in $eips_data; do | ||
| delete_resource "EIP" "$eip_alloc_id" "$region" "ELB ENI EIP release" | ||
| regional_eips=$((regional_eips + 1)) | ||
| done |
There was a problem hiding this comment.
1. Eip release may fail 🐞 Bug ☼ Reliability
In the ELB-managed ENI branch, the script queries EIPs by NetworkInterfaceId (i.e., currently associated) and then calls the generic EIP deleter which only runs aws ec2 release-address without disassociating first, so EIP cleanup can fail and block later dependency deletion. The loop also doesn’t guard against AWS CLI text output returning the literal string None, which can trigger an invalid release-address --allocation-id None call.
Agent Prompt
### Issue description
The orphan-VPC cleanup’s ELB-ENI path attempts to release EIPs that are still associated to an ENI by calling `delete_resource "EIP"`, but `delete_resource` uses `aws ec2 release-address` only (no disassociation). This can fail and leave resources that later prevent IGW detach / VPC delete. The loop can also attempt to release a literal `None` allocation id.
### Issue Context
This happens only for ENIs with `InterfaceType` corresponding to ELB-managed ENIs (the branch that currently does `continue` after EIP handling).
### Fix Focus Areas
- scripts/mapt/delete-mapt-clusters.sh[556-560]
- scripts/mapt/delete-mapt-clusters.sh[101-131]
### Suggested fix approach
1. When looking up EIPs for an ENI, also retrieve `AssociationId` (or detect association via `describe-addresses`).
2. If an EIP is associated, disassociate it first (`aws ec2 disassociate-address --association-id ...`), then release by allocation id.
3. Treat `None`/empty results as “no EIPs” (skip the loop) so you don’t call `release-address` with `None`.
4. (Optional but safer) Only increment `regional_eips` after the disassociate/release succeeds (check exit codes).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Summary
(kind clusters use
NatGatewayModeNone— confirmed in mapt source code)origin=mapttag, then filtered by EC2 instancepresence: active instances → skip, no instances → safe to delete
conditions during cleanup
Problem
The cleanup script discovered VPCs exclusively via NAT gateways tagged
origin=mapt. Kind clusters never create NAT gateways, making theminvisible to the script. This caused ~287 orphaned VPCs to accumulate
(117 in us-west-2, 143 in us-east-1, 27 in us-east-2) along with
associated IGWs, subnets, security groups, route tables, and load balancers.
Test plan
CI test