Describe the bug a clear and concise description of what the bug is.
The NodeClockSkewDetected alert from the node-exporter mixin fires repeatedly on Kubernetes nodes running under OpenStack/KVM, despite the node's clock being correctly synchronized to sub-millisecond accuracy via NTP. The cause is a transient overflow in the kernel's adjtimex() syscall: node_timex_offset_seconds briefly reports exactly 4.294967296 seconds (= 2^32 nanoseconds) during NTP clock adjustments, then self-corrects on the next poll. This bogus value satisfies the alert's expression long enough to clear the for: 10m window, so the alert fires, auto-resolves seconds later, and repeats , hundreds of times per day per cluster. The alert's description tells operators to "ensure NTP is configured correctly," but NTP is already working correctly; the metric, not the clock, is the problem.
What's your helm version?
version.BuildInfo{Version:"v4.2.2", GitCommit:"b05881cf967a5a09e19866799d0edfd40675803a", GitTreeState:"clean", GoVersion:"go1.26.4", KubeClientVersion:"v1.36"}
What's your kubectl version?
Client Version: v1.34.1 Kustomize Version: v5.7.1
Which chart?
kube-prometheus-stack
What's the chart version?
78.4.0
What happened?
On a fleet of Kubernetes clusters running on OpenStack/KVM, the NodeClockSkewDetected alert from the node-exporter mixin fires hundreds of times per day across all nodes and auto-resolves within seconds each time.
Investigation showed the trigger is not real clock drift:
- node_timex_offset_seconds briefly reports exactly 4.294966 seconds, i.e. 2^32 nanoseconds, an integer overflow in the kernel's adjtimex(2) syscall
- The value appears during NTP clock adjustments and self-corrects on the next NTP poll
- All 9 nodes across two clusters hit the identical bogus value
- Actual clock offset on the same nodes, measured via chronyc tracking, is <0.00001s (sub-millisecond) ,NTP is healthy throughout
- AWS clusters in the same fleet, which use hardware PTP and bypass adjtimex for this metric, are unaffected
The fire / resolve cycle:
- Kernel returns overflow value 4.294966 for node_timex_offset_seconds
- Alert expression node_timex_offset_seconds > 0.05 and deriv(...) >= 0 is satisfied
- for: 10m window elapses → alert fires
- NTP poll corrects the kernel offset → expression goes false → alert resolves
- Repeats
The alert's description tells operators to "ensure NTP is configured correctly," but NTP is already correct — the metric is reporting a kernel-internal overflow value, not the actual clock state.
What you expected to happen?
The NodeClockSkewDetected alert should fire only when a node's clock is actually drifting , i.e. when the underlying NTP-managed clock state is genuinely off by more than 0.05s in a way that's not recovering.
On the nodes in question, NTP is healthy and the real clock offset is sub-millisecond. The alert should stay silent.
Concretely:
- A transient overflow value returned by adjtimex(2) (the metric briefly showing exactly 2^32 ns and self-correcting within seconds) should not satisfy the alert expression
- Operators should not receive pages whose remediation ("ensure NTP is configured correctly") is a no-op because NTP is already correct
- The alert's signal-to-noise ratio should be high enough that it remains actionable ,currently it fires hundreds of times per day per cluster with zero true positives, which trains operators to ignore it and erodes trust in the rest of the rule set
The root cause is a kernel-level overflow, but the alert rule is the layer where users encounter the problem and the layer where it can be mitigated quickly ,a one-line PromQL change ships in the next release, whereas kernel fixes take years to propagate through long-lived guest images.
How to reproduce it?
Environment
- Linux node running as a guest under OpenStack / KVM
- node_exporter running on that node, scraped by Prometheus
- The node-mixin NodeClockSkewDetected alert loaded (directly or via kube-prometheus-stack)
- A working NTP daemon (chrony or ntpd) on the node
Steps
- Deploy node_exporter + Prometheus with the mixin alerts on a fleet of OpenStack/KVM nodes
- Let the cluster run for a few hours
- NodeClockSkewDetected alerts begin firing periodically and auto-resolving within seconds
Confirming the cause
Run this PromQL over the last 24h on an affected cluster:
promql
node_timex_offset_seconds{job="node-exporter"} == 4.294967296
Returns hits across affected nodes at the exact moments matching the alert fire/resolve cycle. The value 4.294967296 is exactly 2^32 nanoseconds , the kernel adjtimex() overflow.
Compare against the real clock state on the same nodes:
$ chronyc tracking | grep -E "Last offset|System time"
Reports offset well under 1ms — NTP is healthy throughout.
Observed behavior
- NodeClockSkewDetected fires on every node in the cluster on a rolling basis, multiple times per day
- Each occurrence auto-resolves within seconds of clearing the for: 10m window
- node_timex_offset_seconds graphed over time shows vertical spikes to 4.294967 from a baseline near 0
- chronyc tracking on the same nodes shows real offset sub-millisecond throughout
Enter the changed values of values.yaml?
No response
Enter the command that you execute and failing/misfunctioning.
N/A — this is an alert rule misfiring in Prometheus, not a CLI failure.
Anything else we need to know?
No response
Describe the bug a clear and concise description of what the bug is.
The NodeClockSkewDetected alert from the node-exporter mixin fires repeatedly on Kubernetes nodes running under OpenStack/KVM, despite the node's clock being correctly synchronized to sub-millisecond accuracy via NTP. The cause is a transient overflow in the kernel's adjtimex() syscall: node_timex_offset_seconds briefly reports exactly 4.294967296 seconds (= 2^32 nanoseconds) during NTP clock adjustments, then self-corrects on the next poll. This bogus value satisfies the alert's expression long enough to clear the for: 10m window, so the alert fires, auto-resolves seconds later, and repeats , hundreds of times per day per cluster. The alert's description tells operators to "ensure NTP is configured correctly," but NTP is already working correctly; the metric, not the clock, is the problem.
What's your helm version?
version.BuildInfo{Version:"v4.2.2", GitCommit:"b05881cf967a5a09e19866799d0edfd40675803a", GitTreeState:"clean", GoVersion:"go1.26.4", KubeClientVersion:"v1.36"}
What's your kubectl version?
Client Version: v1.34.1 Kustomize Version: v5.7.1
Which chart?
kube-prometheus-stack
What's the chart version?
78.4.0
What happened?
On a fleet of Kubernetes clusters running on OpenStack/KVM, the NodeClockSkewDetected alert from the node-exporter mixin fires hundreds of times per day across all nodes and auto-resolves within seconds each time.
Investigation showed the trigger is not real clock drift:
The fire / resolve cycle:
The alert's description tells operators to "ensure NTP is configured correctly," but NTP is already correct — the metric is reporting a kernel-internal overflow value, not the actual clock state.
What you expected to happen?
The NodeClockSkewDetected alert should fire only when a node's clock is actually drifting , i.e. when the underlying NTP-managed clock state is genuinely off by more than 0.05s in a way that's not recovering.
On the nodes in question, NTP is healthy and the real clock offset is sub-millisecond. The alert should stay silent.
Concretely:
The root cause is a kernel-level overflow, but the alert rule is the layer where users encounter the problem and the layer where it can be mitigated quickly ,a one-line PromQL change ships in the next release, whereas kernel fixes take years to propagate through long-lived guest images.
How to reproduce it?
Environment
Steps
Confirming the cause
Run this PromQL over the last 24h on an affected cluster:
promql
node_timex_offset_seconds{job="node-exporter"} == 4.294967296
Returns hits across affected nodes at the exact moments matching the alert fire/resolve cycle. The value 4.294967296 is exactly 2^32 nanoseconds , the kernel adjtimex() overflow.
Compare against the real clock state on the same nodes:
$ chronyc tracking | grep -E "Last offset|System time"
Reports offset well under 1ms — NTP is healthy throughout.
Observed behavior
Enter the changed values of values.yaml?
No response
Enter the command that you execute and failing/misfunctioning.
N/A — this is an alert rule misfiring in Prometheus, not a CLI failure.
Anything else we need to know?
No response