Skip to content

EKS ebs-csi add-on defaultStorageClass fights the encrypted default StorageClass (two-defaults on reconcile) #309

@stevenolen

Description

@stevenolen

Summary

The workload eks step (and the AWSEKSCluster builder it shares) configures the EKS
ebs-csi managed add-on with defaultStorageClass.enabled: true. That add-on feature
creates a StorageClass named ebs-csi-default-sc (gp3, no encrypted parameter) and marks
it the cluster default. The step then creates ebs-csi-default-sc-encrypted (encrypted: true, marked default) and patches ebs-csi-default-sc to is-default-class=false, so
the encrypted class is the real cluster default.

This works as long as the add-on is never reconciled — but any add-on update makes EKS
re-reconcile the add-on, and its defaultStorageClass feature re-asserts
is-default-class=true on ebs-csi-default-sc
, overriding the patch. The result is two
default StorageClasses
, which the cluster resolves non-deterministically.

Why it matters

  • Two defaults → a PVC that omits storageClassName may bind to either class.
  • The add-on's ebs-csi-default-sc sets no encrypted parameter. In accounts where
    EBS encryption-by-default is disabled, the SC-level encrypted: true on the encrypted
    class is the only thing encrypting volumes — so a PVC that lands on ebs-csi-default-sc
    there would be unencrypted. (We have confirmed a mix in the fleet: some accounts have
    account-level default encryption on, several production accounts have it off. So the
    encrypted default SC is genuinely load-bearing and cannot be dropped or replaced by relying
    on account-level encryption.)
  • ebs-csi-default-sc itself is load-bearing on a subset of clusters: the in-cluster
    observability stack (Loki/Mimir) pins storageClassName: ebs-csi-default-sc explicitly, so
    the class must be preserved (cannot be blanket-deleted).

How this surfaced

During the Python→Go migration of the eks step, the Go code re-serialized the add-on's
configurationValues (Go's compact JSON vs the prior spaced JSON — identical content,
different bytes). That byte difference alone triggered an add-on update → reconcile → the
flip above, producing two default StorageClasses on the clusters it was applied to. (An
earlier attempt to set the add-on's update conflict-resolution to PRESERVE let the update
succeed but did not prevent the flip — the add-on recreates/re-asserts its default SC,
so PRESERVE is not a fix.) The control-room cluster step is unaffected: it intentionally
uses ebs-csi-default-sc as its default and applies no demoting patch, so there is no
conflict there. This is workload-only.

Current mitigation (in place)

pulumi.IgnoreChanges(["configurationValues"]) on the ebs-csi (and secrets-store) add-ons.
The migration adopts the live add-on config exactly as-is and never re-serializes/updates it,
so EKS never reconciles the add-on and never flips the default. The existing, correct state
(encrypted class = sole default; ebs-csi-default-sc present but demoted) is preserved.

What to expect for a NEW workload while this mitigation is in place

  • Greenfield create: the add-on is created with defaultStorageClass.enabled: true
    creates ebs-csi-default-sc as default → the encrypted class + the demoting patch then make
    the encrypted class the sole default. IgnoreChanges only affects updates, not create, so
    provisioning is unaffected and the workload comes up correctly (one default = encrypted).
  • Steady state: because the add-on's configurationValues is never re-applied, the add-on
    is never reconciled, so the default never flips. Stable.
  • Residual fragility (unchanged from before the migration): an explicit add-on version
    bump
    is a real add-on update (the version field is not ignored) → EKS reconciles → the
    default flips back to ebs-csi-default-sc → two defaults, and the demoting patch (a separate
    resource) will not automatically re-apply. Remediation today is a one-time
    kubectl annotate sc ebs-csi-default-sc storageclass.kubernetes.io/is-default-class=false --overwrite. This is the same latent issue that existed pre-migration; the mitigation just
    stops the migration itself from triggering it.

Proper long-term fix

Stop using the add-on's defaultStorageClass feature and have PTD own the StorageClasses
directly:

  • Set the ebs-csi add-on defaultStorageClass.enabled: false so the add-on no longer
    creates/manages a default class.
  • Define ebs-csi-default-sc ourselves as a plain non-default StorageClass (so the
    Loki/Mimir PVCs that pin it keep resolving).
  • Keep ebs-csi-default-sc-encrypted as the sole default.

Then there is no add-on-managed default to fight over — no demoting patch, and no flip on
add-on version bumps. Transition risk to validate first: flipping enabled: false on an
existing add-on may delete the add-on-created ebs-csi-default-sc (load-bearing for the
observability stack on some clusters). Confirm the delete-vs-orphan behavior on a safe target
and define the ownership hand-off (create our SC and transfer field management) before rolling
this out.

Optional interim hardening: make the demoting StorageClassPatch dependsOn the add-on and
re-assert on every apply, so even a version-bump reconcile self-heals.

References

  • Mitigation lands with the eks/cluster Python→Go migration PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions