EKS ebs-csi add-on defaultStorageClass fights the encrypted default StorageClass (two-defaults on reconcile)

## Summary

The workload `eks` step (and the `AWSEKSCluster` builder it shares) configures the EKS
**ebs-csi** managed add-on with `defaultStorageClass.enabled: true`. That add-on feature
creates a StorageClass named `ebs-csi-default-sc` (gp3, no `encrypted` parameter) and marks
it the cluster default. The step then creates `ebs-csi-default-sc-encrypted` (`encrypted:
true`, marked default) and **patches `ebs-csi-default-sc` to `is-default-class=false`**, so
the encrypted class is the real cluster default.

This works as long as the add-on is never reconciled — but **any add-on update makes EKS
re-reconcile the add-on, and its `defaultStorageClass` feature re-asserts
`is-default-class=true` on `ebs-csi-default-sc`**, overriding the patch. The result is **two
default StorageClasses**, which the cluster resolves non-deterministically.

## Why it matters

- Two defaults → a PVC that omits `storageClassName` may bind to either class.
- The add-on's `ebs-csi-default-sc` sets no `encrypted` parameter. In accounts where
  **EBS encryption-by-default is disabled**, the SC-level `encrypted: true` on the encrypted
  class is the only thing encrypting volumes — so a PVC that lands on `ebs-csi-default-sc`
  there would be **unencrypted**. (We have confirmed a mix in the fleet: some accounts have
  account-level default encryption on, several production accounts have it **off**. So the
  encrypted default SC is genuinely load-bearing and cannot be dropped or replaced by relying
  on account-level encryption.)
- `ebs-csi-default-sc` itself is **load-bearing on a subset of clusters**: the in-cluster
  observability stack (Loki/Mimir) pins `storageClassName: ebs-csi-default-sc` explicitly, so
  the class must be preserved (cannot be blanket-deleted).

## How this surfaced

During the Python→Go migration of the `eks` step, the Go code re-serialized the add-on's
`configurationValues` (Go's compact JSON vs the prior spaced JSON — identical content,
different bytes). That byte difference alone triggered an add-on update → reconcile → the
flip above, producing two default StorageClasses on the clusters it was applied to. (An
earlier attempt to set the add-on's update conflict-resolution to `PRESERVE` let the update
*succeed* but did **not** prevent the flip — the add-on recreates/re-asserts its default SC,
so PRESERVE is not a fix.) The control-room `cluster` step is unaffected: it intentionally
uses `ebs-csi-default-sc` as its default and applies no demoting patch, so there is no
conflict there. **This is workload-only.**

## Current mitigation (in place)

`pulumi.IgnoreChanges(["configurationValues"])` on the ebs-csi (and secrets-store) add-ons.
The migration adopts the live add-on config exactly as-is and never re-serializes/updates it,
so EKS never reconciles the add-on and never flips the default. The existing, correct state
(encrypted class = sole default; `ebs-csi-default-sc` present but demoted) is preserved.

### What to expect for a NEW workload while this mitigation is in place

- **Greenfield create:** the add-on is created with `defaultStorageClass.enabled: true` →
  creates `ebs-csi-default-sc` as default → the encrypted class + the demoting patch then make
  the encrypted class the sole default. `IgnoreChanges` only affects updates, not create, so
  provisioning is unaffected and the workload comes up correctly (one default = encrypted).
- **Steady state:** because the add-on's `configurationValues` is never re-applied, the add-on
  is never reconciled, so the default never flips. Stable.
- **Residual fragility (unchanged from before the migration):** an explicit add-on **version
  bump** is a real add-on update (the version field is not ignored) → EKS reconciles → the
  default flips back to `ebs-csi-default-sc` → two defaults, and the demoting patch (a separate
  resource) will not automatically re-apply. Remediation today is a one-time
  `kubectl annotate sc ebs-csi-default-sc storageclass.kubernetes.io/is-default-class=false
  --overwrite`. This is the same latent issue that existed pre-migration; the mitigation just
  stops the migration itself from triggering it.

## Proper long-term fix

Stop using the add-on's `defaultStorageClass` feature and have PTD own the StorageClasses
directly:

- Set the ebs-csi add-on `defaultStorageClass.enabled: false` so the add-on no longer
  creates/manages a default class.
- Define `ebs-csi-default-sc` ourselves as a plain **non-default** StorageClass (so the
  Loki/Mimir PVCs that pin it keep resolving).
- Keep `ebs-csi-default-sc-encrypted` as the sole default.

Then there is no add-on-managed default to fight over — no demoting patch, and no flip on
add-on version bumps. **Transition risk to validate first:** flipping `enabled: false` on an
existing add-on may **delete** the add-on-created `ebs-csi-default-sc` (load-bearing for the
observability stack on some clusters). Confirm the delete-vs-orphan behavior on a safe target
and define the ownership hand-off (create our SC and transfer field management) before rolling
this out.

Optional interim hardening: make the demoting StorageClassPatch `dependsOn` the add-on and
re-assert on every apply, so even a version-bump reconcile self-heals.

## References

- Mitigation lands with the eks/cluster Python→Go migration PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS ebs-csi add-on defaultStorageClass fights the encrypted default StorageClass (two-defaults on reconcile) #309

Summary

Why it matters

How this surfaced

Current mitigation (in place)

What to expect for a NEW workload while this mitigation is in place

Proper long-term fix

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

EKS ebs-csi add-on defaultStorageClass fights the encrypted default StorageClass (two-defaults on reconcile) #309

Description

Summary

Why it matters

How this surfaced

Current mitigation (in place)

What to expect for a NEW workload while this mitigation is in place

Proper long-term fix

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions