pimd: fix BSR_PENDING timer being overwritten by BS liveness timer#22460
Conversation
Greptile SummaryThis PR fixes a set of interrelated bugs in the PIM Bootstrap Router (BSR) election state machine in
Confidence Score: 5/5Safe to merge. All three state-machine bugs are correctly addressed, including the previously flagged assertion crash on priority changes and the timer gap when transitioning out of BSR_PENDING. The three independent fixes are each self-contained and correctly handle their respective edge cases. The guard added to pim_cand_bsr_apply prevents pim_cand_bsr_trigger assertions from being reachable in BSR_PENDING and BSR_ELECTED states. The pim_bsm_update change correctly cancels the pending-expiry callback and starts the liveness timer when exiting BSR_PENDING, so no timer gap remains. No files require special attention. Important Files Changed
Reviews (4): Last reviewed commit: "pimd: fix BSR_PENDING timer being overwr..." | Re-trigger Greptile |
d794f6d to
4f63cc4
Compare
|
@greptileai review |
4f63cc4 to
25dea3a
Compare
This addresses the likely root cause of the flaky test pim_cand_rp_bsr.test_pim_bsr_priority_modify which had a 3.4 - 11% failure rate in the weekly topotest report. The bs_timer is used for two different purposes: 1. During BSR_PENDING state: a ~5 second timer before becoming BSR_ELECTED (callback: pim_cand_bsr_pending_expire) 2. During other states: a 130 second BS liveness timer (callback: pim_on_bs_timer) When a candidate BSR is in BSR_PENDING state and receives a BSM from another BSR, the pim_bs_timer_restart() call would overwrite the pending timer with the liveness timer. This prevented the BSR_PENDING timer from ever expiring, causing the candidate BSR to never become elected even when it had higher priority. Fix by skipping pim_bs_timer_restart() when in BSR_PENDING state. Additionally, when pim_bsm_update() drops a router out of BSR_PENDING state (due to receiving a BSM from a different BSR), the bs_timer with the pim_cand_bsr_pending_expire callback was not cancelled. This could cause an assertion failure when the timer fired with state != BSR_PENDING. Fix by cancelling the bs_timer in pim_bsm_update() before leaving BSR_PENDING state, and start the BS liveness timer after transitioning to ACCEPT_PREFERRED to ensure BSR expiry detection continues to work. This also fixes a related issue where pim_cand_bsr_apply() would return early if the address selection hadn't changed, preventing priority-only changes from triggering the BSR state machine re-evaluation. Fix by always calling pim_cand_bsr_trigger() regardless of address change, with a guard that handles the case where an operator lowers the BSR priority while in BSR_PENDING or BSR_ELECTED state. For BSR_PENDING, cancel the pending timer and transition to ACCEPT_PREFERRED to avoid assertion failures in pim_cand_bsr_pending_expire. Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
25dea3a to
89b31a4
Compare
|
@greptileai review |
|
@Mergifyio backport stable/10.7 stable/10.6 |
✅ Backports have been createdDetails
|
pimd: fix BSR_PENDING timer being overwritten by BS liveness timer (backport #22460)
pimd: fix BSR_PENDING timer being overwritten by BS liveness timer (backport #22460)
This addresses the likely root cause of the flaky test pim_cand_rp_bsr.test_pim_bsr_priority_modify which had a 3.4 - 11% failure rate in the weekly topotest report.
The bs_timer is used for two different purposes:
When a candidate BSR is in BSR_PENDING state and receives a BSM from another BSR, the pim_bs_timer_restart() call would overwrite the pending timer with the liveness timer. This prevented the BSR_PENDING timer from ever expiring, causing the candidate BSR to never become elected even when it had higher priority. Fix by skipping pim_bs_timer_restart() when in BSR_PENDING state.
Additionally, when pim_bsm_update() drops a router out of BSR_PENDING state (due to receiving a BSM from a different BSR), the bs_timer with the pim_cand_bsr_pending_expire callback was not cancelled. This could cause an assertion failure when the timer fired with state != BSR_PENDING. Fix by cancelling the bs_timer in pim_bsm_update() before leaving BSR_PENDING state.
This also fixes a related issue where pim_cand_bsr_apply() would return early if the address selection hadn't changed, preventing priority-only changes from triggering the BSR state machine re-evaluation. Fix by always calling pim_cand_bsr_trigger() regardless of address change.