Summary
The workload eks step (and the AWSEKSCluster builder it shares) configures the EKS
ebs-csi managed add-on with defaultStorageClass.enabled: true. That add-on feature
creates a StorageClass named ebs-csi-default-sc (gp3, no encrypted parameter) and marks
it the cluster default. The step then creates ebs-csi-default-sc-encrypted (encrypted: true, marked default) and patches ebs-csi-default-sc to is-default-class=false, so
the encrypted class is the real cluster default.
This works as long as the add-on is never reconciled — but any add-on update makes EKS
re-reconcile the add-on, and its defaultStorageClass feature re-asserts
is-default-class=true on ebs-csi-default-sc, overriding the patch. The result is two
default StorageClasses, which the cluster resolves non-deterministically.
Why it matters
- Two defaults → a PVC that omits
storageClassName may bind to either class.
- The add-on's
ebs-csi-default-sc sets no encrypted parameter. In accounts where
EBS encryption-by-default is disabled, the SC-level encrypted: true on the encrypted
class is the only thing encrypting volumes — so a PVC that lands on ebs-csi-default-sc
there would be unencrypted. (We have confirmed a mix in the fleet: some accounts have
account-level default encryption on, several production accounts have it off. So the
encrypted default SC is genuinely load-bearing and cannot be dropped or replaced by relying
on account-level encryption.)
ebs-csi-default-sc itself is load-bearing on a subset of clusters: the in-cluster
observability stack (Loki/Mimir) pins storageClassName: ebs-csi-default-sc explicitly, so
the class must be preserved (cannot be blanket-deleted).
How this surfaced
During the Python→Go migration of the eks step, the Go code re-serialized the add-on's
configurationValues (Go's compact JSON vs the prior spaced JSON — identical content,
different bytes). That byte difference alone triggered an add-on update → reconcile → the
flip above, producing two default StorageClasses on the clusters it was applied to. (An
earlier attempt to set the add-on's update conflict-resolution to PRESERVE let the update
succeed but did not prevent the flip — the add-on recreates/re-asserts its default SC,
so PRESERVE is not a fix.) The control-room cluster step is unaffected: it intentionally
uses ebs-csi-default-sc as its default and applies no demoting patch, so there is no
conflict there. This is workload-only.
Current mitigation (in place)
pulumi.IgnoreChanges(["configurationValues"]) on the ebs-csi (and secrets-store) add-ons.
The migration adopts the live add-on config exactly as-is and never re-serializes/updates it,
so EKS never reconciles the add-on and never flips the default. The existing, correct state
(encrypted class = sole default; ebs-csi-default-sc present but demoted) is preserved.
What to expect for a NEW workload while this mitigation is in place
- Greenfield create: the add-on is created with
defaultStorageClass.enabled: true →
creates ebs-csi-default-sc as default → the encrypted class + the demoting patch then make
the encrypted class the sole default. IgnoreChanges only affects updates, not create, so
provisioning is unaffected and the workload comes up correctly (one default = encrypted).
- Steady state: because the add-on's
configurationValues is never re-applied, the add-on
is never reconciled, so the default never flips. Stable.
- Residual fragility (unchanged from before the migration): an explicit add-on version
bump is a real add-on update (the version field is not ignored) → EKS reconciles → the
default flips back to ebs-csi-default-sc → two defaults, and the demoting patch (a separate
resource) will not automatically re-apply. Remediation today is a one-time
kubectl annotate sc ebs-csi-default-sc storageclass.kubernetes.io/is-default-class=false --overwrite. This is the same latent issue that existed pre-migration; the mitigation just
stops the migration itself from triggering it.
Proper long-term fix
Stop using the add-on's defaultStorageClass feature and have PTD own the StorageClasses
directly:
- Set the ebs-csi add-on
defaultStorageClass.enabled: false so the add-on no longer
creates/manages a default class.
- Define
ebs-csi-default-sc ourselves as a plain non-default StorageClass (so the
Loki/Mimir PVCs that pin it keep resolving).
- Keep
ebs-csi-default-sc-encrypted as the sole default.
Then there is no add-on-managed default to fight over — no demoting patch, and no flip on
add-on version bumps. Transition risk to validate first: flipping enabled: false on an
existing add-on may delete the add-on-created ebs-csi-default-sc (load-bearing for the
observability stack on some clusters). Confirm the delete-vs-orphan behavior on a safe target
and define the ownership hand-off (create our SC and transfer field management) before rolling
this out.
Optional interim hardening: make the demoting StorageClassPatch dependsOn the add-on and
re-assert on every apply, so even a version-bump reconcile self-heals.
References
- Mitigation lands with the eks/cluster Python→Go migration PR.
Summary
The workload
eksstep (and theAWSEKSClusterbuilder it shares) configures the EKSebs-csi managed add-on with
defaultStorageClass.enabled: true. That add-on featurecreates a StorageClass named
ebs-csi-default-sc(gp3, noencryptedparameter) and marksit the cluster default. The step then creates
ebs-csi-default-sc-encrypted(encrypted: true, marked default) and patchesebs-csi-default-sctois-default-class=false, sothe encrypted class is the real cluster default.
This works as long as the add-on is never reconciled — but any add-on update makes EKS
re-reconcile the add-on, and its
defaultStorageClassfeature re-assertsis-default-class=trueonebs-csi-default-sc, overriding the patch. The result is twodefault StorageClasses, which the cluster resolves non-deterministically.
Why it matters
storageClassNamemay bind to either class.ebs-csi-default-scsets noencryptedparameter. In accounts whereEBS encryption-by-default is disabled, the SC-level
encrypted: trueon the encryptedclass is the only thing encrypting volumes — so a PVC that lands on
ebs-csi-default-scthere would be unencrypted. (We have confirmed a mix in the fleet: some accounts have
account-level default encryption on, several production accounts have it off. So the
encrypted default SC is genuinely load-bearing and cannot be dropped or replaced by relying
on account-level encryption.)
ebs-csi-default-scitself is load-bearing on a subset of clusters: the in-clusterobservability stack (Loki/Mimir) pins
storageClassName: ebs-csi-default-scexplicitly, sothe class must be preserved (cannot be blanket-deleted).
How this surfaced
During the Python→Go migration of the
eksstep, the Go code re-serialized the add-on'sconfigurationValues(Go's compact JSON vs the prior spaced JSON — identical content,different bytes). That byte difference alone triggered an add-on update → reconcile → the
flip above, producing two default StorageClasses on the clusters it was applied to. (An
earlier attempt to set the add-on's update conflict-resolution to
PRESERVElet the updatesucceed but did not prevent the flip — the add-on recreates/re-asserts its default SC,
so PRESERVE is not a fix.) The control-room
clusterstep is unaffected: it intentionallyuses
ebs-csi-default-scas its default and applies no demoting patch, so there is noconflict there. This is workload-only.
Current mitigation (in place)
pulumi.IgnoreChanges(["configurationValues"])on the ebs-csi (and secrets-store) add-ons.The migration adopts the live add-on config exactly as-is and never re-serializes/updates it,
so EKS never reconciles the add-on and never flips the default. The existing, correct state
(encrypted class = sole default;
ebs-csi-default-scpresent but demoted) is preserved.What to expect for a NEW workload while this mitigation is in place
defaultStorageClass.enabled: true→creates
ebs-csi-default-scas default → the encrypted class + the demoting patch then makethe encrypted class the sole default.
IgnoreChangesonly affects updates, not create, soprovisioning is unaffected and the workload comes up correctly (one default = encrypted).
configurationValuesis never re-applied, the add-onis never reconciled, so the default never flips. Stable.
bump is a real add-on update (the version field is not ignored) → EKS reconciles → the
default flips back to
ebs-csi-default-sc→ two defaults, and the demoting patch (a separateresource) will not automatically re-apply. Remediation today is a one-time
kubectl annotate sc ebs-csi-default-sc storageclass.kubernetes.io/is-default-class=false --overwrite. This is the same latent issue that existed pre-migration; the mitigation juststops the migration itself from triggering it.
Proper long-term fix
Stop using the add-on's
defaultStorageClassfeature and have PTD own the StorageClassesdirectly:
defaultStorageClass.enabled: falseso the add-on no longercreates/manages a default class.
ebs-csi-default-scourselves as a plain non-default StorageClass (so theLoki/Mimir PVCs that pin it keep resolving).
ebs-csi-default-sc-encryptedas the sole default.Then there is no add-on-managed default to fight over — no demoting patch, and no flip on
add-on version bumps. Transition risk to validate first: flipping
enabled: falseon anexisting add-on may delete the add-on-created
ebs-csi-default-sc(load-bearing for theobservability stack on some clusters). Confirm the delete-vs-orphan behavior on a safe target
and define the ownership hand-off (create our SC and transfer field management) before rolling
this out.
Optional interim hardening: make the demoting StorageClassPatch
dependsOnthe add-on andre-assert on every apply, so even a version-bump reconcile self-heals.
References