Problem
When running a K8s version upgrade on an AKS workload cluster via ptd ensure --only-steps kubernetes --auto-apply, the system node pool (agentpool) is not upgraded. Pulumi rolls the control plane and the user node pool(s), but leaves the system pool on the previous version. The operator has to run a manual az aks nodepool upgrade --name agentpool --kubernetes-version <target> after every hop.
This was discovered while running the 1.32.6 → 1.35.4 multi-hop upgrade on internal-az-staging on 2026-05-14/15 (see posit-dev/ptd-workspace#15 for the skill update that documents the workaround, and rstudio/ptd-config#2899 for the upgrade itself).
Why it happens
The system pool is declared inline in agentPoolProfiles on the ManagedCluster resource (lib/steps/aks.go:129-166). User pools, by contrast, are managed as separate azure-native:containerservice:AgentPool Pulumi resources.
When user pools exist, lib/steps/aks.go:184-186 adds agentPoolProfiles to Pulumi's ignoreChanges on the ManagedCluster:
// Always ignore agentPoolProfiles when using separate AgentPool resources
// Azure reflects separate pools in this array, but we manage them as distinct resources
if len(userNodePools) > 0 {
ignoreChanges = append(ignoreChanges, "agentPoolProfiles")
}
The comment explains the reason for the existence of this ignore: Azure ARM returns all pools (system + user) in agentPoolProfiles even when user pools are managed separately. Without the blanket ignore, Pulumi would see user pool entries as drift and try to delete them from the array.
The unintended side effect: the blanket ignore also covers the system pool's orchestratorVersion, so when we bump clusterConfig.KubernetesVersion, the change is applied to ManagedCluster.kubernetesVersion (control plane) but NOT to agentPoolProfiles[0].orchestratorVersion (the system pool entry — also driven from clusterConfig.KubernetesVersion in aks.go:148).
Proposed fixes (pick one)
- Per-field
ignoreChanges — replace the blanket agentPoolProfiles ignore with something like agentPoolProfiles[*].count, agentPoolProfiles[*].powerState, etc., so Pulumi can still manage orchestratorVersion. Smallest change. Need to verify Pulumi's ignoreChanges syntax supports targeting all-but-first-element (system pool is index 0, user pools follow), or rely on per-field rather than per-index.
- Model system pool as a separate
AgentPool resource — bigger refactor but uniform with how user pools are modeled. Eliminates the dual-source-of-truth issue entirely.
Operator workaround (current state)
The cluster-upgrade skill (posit-dev/ptd-workspace#15) now documents the manual step:
az aks nodepool upgrade \
--cluster-name <aks-cluster-name> \
--resource-group <rg> \
--name agentpool \
--kubernetes-version <target> \
--yes --no-wait
This adds ~15–25 min of wall-clock per hop and is easy to forget. Worth fixing for the next operator who does an AKS upgrade.
Out of scope
max_surge is also hardcoded to "10%" in lib/steps/aks.go:160-162 and :347-348 — making it configurable would speed up rolls but is a nice-to-have, not a correctness issue.
Problem
When running a K8s version upgrade on an AKS workload cluster via
ptd ensure --only-steps kubernetes --auto-apply, the system node pool (agentpool) is not upgraded. Pulumi rolls the control plane and the user node pool(s), but leaves the system pool on the previous version. The operator has to run a manualaz aks nodepool upgrade --name agentpool --kubernetes-version <target>after every hop.This was discovered while running the
1.32.6 → 1.35.4multi-hop upgrade oninternal-az-stagingon 2026-05-14/15 (see posit-dev/ptd-workspace#15 for the skill update that documents the workaround, and rstudio/ptd-config#2899 for the upgrade itself).Why it happens
The system pool is declared inline in
agentPoolProfileson theManagedClusterresource (lib/steps/aks.go:129-166). User pools, by contrast, are managed as separateazure-native:containerservice:AgentPoolPulumi resources.When user pools exist,
lib/steps/aks.go:184-186addsagentPoolProfilesto Pulumi'signoreChangeson theManagedCluster:The comment explains the reason for the existence of this ignore: Azure ARM returns all pools (system + user) in
agentPoolProfileseven when user pools are managed separately. Without the blanket ignore, Pulumi would see user pool entries as drift and try to delete them from the array.The unintended side effect: the blanket ignore also covers the system pool's
orchestratorVersion, so when we bumpclusterConfig.KubernetesVersion, the change is applied toManagedCluster.kubernetesVersion(control plane) but NOT toagentPoolProfiles[0].orchestratorVersion(the system pool entry — also driven fromclusterConfig.KubernetesVersioninaks.go:148).Proposed fixes (pick one)
ignoreChanges— replace the blanketagentPoolProfilesignore with something likeagentPoolProfiles[*].count,agentPoolProfiles[*].powerState, etc., so Pulumi can still manageorchestratorVersion. Smallest change. Need to verify Pulumi'signoreChangessyntax supports targeting all-but-first-element (system pool is index 0, user pools follow), or rely on per-field rather than per-index.AgentPoolresource — bigger refactor but uniform with how user pools are modeled. Eliminates the dual-source-of-truth issue entirely.Operator workaround (current state)
The
cluster-upgradeskill (posit-dev/ptd-workspace#15) now documents the manual step:This adds ~15–25 min of wall-clock per hop and is easy to forget. Worth fixing for the next operator who does an AKS upgrade.
Out of scope
max_surgeis also hardcoded to"10%"inlib/steps/aks.go:160-162and:347-348— making it configurable would speed up rolls but is a nice-to-have, not a correctness issue.