Skip to content

pimd: fix BSR_PENDING timer being overwritten by BS liveness timer#22460

Merged
Jafaral merged 1 commit into
FRRouting:masterfrom
enkechen-panw:pim-bsr-timer
Jun 24, 2026
Merged

pimd: fix BSR_PENDING timer being overwritten by BS liveness timer#22460
Jafaral merged 1 commit into
FRRouting:masterfrom
enkechen-panw:pim-bsr-timer

Conversation

@enkechen-panw

Copy link
Copy Markdown
Contributor

This addresses the likely root cause of the flaky test pim_cand_rp_bsr.test_pim_bsr_priority_modify which had a 3.4 - 11% failure rate in the weekly topotest report.

The bs_timer is used for two different purposes:

  1. During BSR_PENDING state: a ~5 second timer before becoming BSR_ELECTED (callback: pim_cand_bsr_pending_expire)
  2. During other states: a 130 second BS liveness timer (callback: pim_on_bs_timer)

When a candidate BSR is in BSR_PENDING state and receives a BSM from another BSR, the pim_bs_timer_restart() call would overwrite the pending timer with the liveness timer. This prevented the BSR_PENDING timer from ever expiring, causing the candidate BSR to never become elected even when it had higher priority. Fix by skipping pim_bs_timer_restart() when in BSR_PENDING state.

Additionally, when pim_bsm_update() drops a router out of BSR_PENDING state (due to receiving a BSM from a different BSR), the bs_timer with the pim_cand_bsr_pending_expire callback was not cancelled. This could cause an assertion failure when the timer fired with state != BSR_PENDING. Fix by cancelling the bs_timer in pim_bsm_update() before leaving BSR_PENDING state.

This also fixes a related issue where pim_cand_bsr_apply() would return early if the address selection hadn't changed, preventing priority-only changes from triggering the BSR state machine re-evaluation. Fix by always calling pim_cand_bsr_trigger() regardless of address change.

@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes a set of interrelated bugs in the PIM Bootstrap Router (BSR) election state machine in pimd/pim_bsm.c that caused a flaky topology test. The core problem is that bs_timer serves three different roles depending on state — a pending-to-elected countdown, a periodic BSM generation trigger (elected), and a BS liveness timer (all other states) — and several code paths were mixing them up.

  • BSR_PENDING timer protection (pim_bsm_process): The BS liveness timer is no longer restarted when in BSR_PENDING state, preventing it from overwriting the pending-expiry timer and blocking the candidate from ever becoming elected.
  • Timer cleanup on state exit (pim_bsm_update): When a better BSR's message drops the router out of BSR_PENDING, the pending-expiry callback is now cancelled before the transition, and the liveness timer is explicitly started — closing both an assertion crash path and the timer gap flagged in earlier review threads.
  • Priority-change coverage (pim_cand_bsr_apply): The early return on cand_addrsel_update returning false is removed so that priority-only changes (no address change) now correctly trigger the BSR state machine re-evaluation, with a new guard that safely handles the cases where the current BSR is already stronger than our candidacy.

Confidence Score: 5/5

Safe to merge. All three state-machine bugs are correctly addressed, including the previously flagged assertion crash on priority changes and the timer gap when transitioning out of BSR_PENDING.

The three independent fixes are each self-contained and correctly handle their respective edge cases. The guard added to pim_cand_bsr_apply prevents pim_cand_bsr_trigger assertions from being reachable in BSR_PENDING and BSR_ELECTED states. The pim_bsm_update change correctly cancels the pending-expiry callback and starts the liveness timer when exiting BSR_PENDING, so no timer gap remains.

No files require special attention.

Important Files Changed

Filename Overview
pimd/pim_bsm.c Three coordinated fixes to the BSR state machine: timer guard in pim_bsm_process, timer cancel+restart in pim_bsm_update for BSR_PENDING exit, and assertion-crash guard in pim_cand_bsr_apply for priority/address changes in BSR_PENDING and BSR_ELECTED states. All previously flagged assertion-crash and timer-gap concerns appear resolved.

Reviews (4): Last reviewed commit: "pimd: fix BSR_PENDING timer being overwr..." | Re-trigger Greptile

Comment thread pimd/pim_bsm.c
@enkechen-panw enkechen-panw marked this pull request as draft June 23, 2026 21:42
@github-actions github-actions Bot added size/M and removed size/S labels Jun 23, 2026
@enkechen-panw

Copy link
Copy Markdown
Contributor Author

@greptileai review

@enkechen-panw enkechen-panw marked this pull request as ready for review June 23, 2026 22:17
This addresses the likely root cause of the flaky test
pim_cand_rp_bsr.test_pim_bsr_priority_modify which had a 3.4 - 11%
failure rate in the weekly topotest report.

The bs_timer is used for two different purposes:
1. During BSR_PENDING state: a ~5 second timer before becoming BSR_ELECTED
   (callback: pim_cand_bsr_pending_expire)
2. During other states: a 130 second BS liveness timer
   (callback: pim_on_bs_timer)

When a candidate BSR is in BSR_PENDING state and receives a BSM from
another BSR, the pim_bs_timer_restart() call would overwrite the pending
timer with the liveness timer. This prevented the BSR_PENDING timer from
ever expiring, causing the candidate BSR to never become elected even
when it had higher priority. Fix by skipping pim_bs_timer_restart() when
in BSR_PENDING state.

Additionally, when pim_bsm_update() drops a router out of BSR_PENDING
state (due to receiving a BSM from a different BSR), the bs_timer with
the pim_cand_bsr_pending_expire callback was not cancelled. This could
cause an assertion failure when the timer fired with state != BSR_PENDING.
Fix by cancelling the bs_timer in pim_bsm_update() before leaving
BSR_PENDING state, and start the BS liveness timer after transitioning
to ACCEPT_PREFERRED to ensure BSR expiry detection continues to work.

This also fixes a related issue where pim_cand_bsr_apply() would return
early if the address selection hadn't changed, preventing priority-only
changes from triggering the BSR state machine re-evaluation. Fix by
always calling pim_cand_bsr_trigger() regardless of address change,
with a guard that handles the case where an operator lowers the BSR
priority while in BSR_PENDING or BSR_ELECTED state. For BSR_PENDING,
cancel the pending timer and transition to ACCEPT_PREFERRED to avoid
assertion failures in pim_cand_bsr_pending_expire.

Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
@enkechen-panw

Copy link
Copy Markdown
Contributor Author

@greptileai review

@Jafaral

Jafaral commented Jun 24, 2026

Copy link
Copy Markdown
Member

@Mergifyio backport stable/10.7 stable/10.6

@mergify

mergify Bot commented Jun 24, 2026

Copy link
Copy Markdown

backport stable/10.7 stable/10.6

✅ Backports have been created

Details

@Jafaral Jafaral merged commit d50eefd into FRRouting:master Jun 24, 2026
23 checks passed
@enkechen-panw enkechen-panw deleted the pim-bsr-timer branch June 24, 2026 02:07
Jafaral added a commit that referenced this pull request Jun 24, 2026
pimd: fix BSR_PENDING timer being overwritten by BS liveness timer (backport #22460)
Jafaral added a commit that referenced this pull request Jun 24, 2026
pimd: fix BSR_PENDING timer being overwritten by BS liveness timer (backport #22460)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants