Skip to content

AKS: system node pool can't be K8s-version-bumped via Pulumi due to agentPoolProfiles ignoreChanges #292

@amdove

Description

@amdove

Problem

When running a K8s version upgrade on an AKS workload cluster via ptd ensure --only-steps kubernetes --auto-apply, the system node pool (agentpool) is not upgraded. Pulumi rolls the control plane and the user node pool(s), but leaves the system pool on the previous version. The operator has to run a manual az aks nodepool upgrade --name agentpool --kubernetes-version <target> after every hop.

This was discovered while running the 1.32.6 → 1.35.4 multi-hop upgrade on internal-az-staging on 2026-05-14/15 (see posit-dev/ptd-workspace#15 for the skill update that documents the workaround, and rstudio/ptd-config#2899 for the upgrade itself).

Why it happens

The system pool is declared inline in agentPoolProfiles on the ManagedCluster resource (lib/steps/aks.go:129-166). User pools, by contrast, are managed as separate azure-native:containerservice:AgentPool Pulumi resources.

When user pools exist, lib/steps/aks.go:184-186 adds agentPoolProfiles to Pulumi's ignoreChanges on the ManagedCluster:

// Always ignore agentPoolProfiles when using separate AgentPool resources
// Azure reflects separate pools in this array, but we manage them as distinct resources
if len(userNodePools) > 0 {
    ignoreChanges = append(ignoreChanges, "agentPoolProfiles")
}

The comment explains the reason for the existence of this ignore: Azure ARM returns all pools (system + user) in agentPoolProfiles even when user pools are managed separately. Without the blanket ignore, Pulumi would see user pool entries as drift and try to delete them from the array.

The unintended side effect: the blanket ignore also covers the system pool's orchestratorVersion, so when we bump clusterConfig.KubernetesVersion, the change is applied to ManagedCluster.kubernetesVersion (control plane) but NOT to agentPoolProfiles[0].orchestratorVersion (the system pool entry — also driven from clusterConfig.KubernetesVersion in aks.go:148).

Proposed fixes (pick one)

  1. Per-field ignoreChanges — replace the blanket agentPoolProfiles ignore with something like agentPoolProfiles[*].count, agentPoolProfiles[*].powerState, etc., so Pulumi can still manage orchestratorVersion. Smallest change. Need to verify Pulumi's ignoreChanges syntax supports targeting all-but-first-element (system pool is index 0, user pools follow), or rely on per-field rather than per-index.
  2. Model system pool as a separate AgentPool resource — bigger refactor but uniform with how user pools are modeled. Eliminates the dual-source-of-truth issue entirely.

Operator workaround (current state)

The cluster-upgrade skill (posit-dev/ptd-workspace#15) now documents the manual step:

az aks nodepool upgrade \
  --cluster-name <aks-cluster-name> \
  --resource-group <rg> \
  --name agentpool \
  --kubernetes-version <target> \
  --yes --no-wait

This adds ~15–25 min of wall-clock per hop and is easy to forget. Worth fixing for the next operator who does an AKS upgrade.

Out of scope

max_surge is also hardcoded to "10%" in lib/steps/aks.go:160-162 and :347-348 — making it configurable would speed up rolls but is a nice-to-have, not a correctness issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions