pimd: fix BSR_PENDING timer being overwritten by BS liveness timer (backport #22460)#22463
Merged
Merged
Conversation
This addresses the likely root cause of the flaky test pim_cand_rp_bsr.test_pim_bsr_priority_modify which had a 3.4 - 11% failure rate in the weekly topotest report. The bs_timer is used for two different purposes: 1. During BSR_PENDING state: a ~5 second timer before becoming BSR_ELECTED (callback: pim_cand_bsr_pending_expire) 2. During other states: a 130 second BS liveness timer (callback: pim_on_bs_timer) When a candidate BSR is in BSR_PENDING state and receives a BSM from another BSR, the pim_bs_timer_restart() call would overwrite the pending timer with the liveness timer. This prevented the BSR_PENDING timer from ever expiring, causing the candidate BSR to never become elected even when it had higher priority. Fix by skipping pim_bs_timer_restart() when in BSR_PENDING state. Additionally, when pim_bsm_update() drops a router out of BSR_PENDING state (due to receiving a BSM from a different BSR), the bs_timer with the pim_cand_bsr_pending_expire callback was not cancelled. This could cause an assertion failure when the timer fired with state != BSR_PENDING. Fix by cancelling the bs_timer in pim_bsm_update() before leaving BSR_PENDING state, and start the BS liveness timer after transitioning to ACCEPT_PREFERRED to ensure BSR expiry detection continues to work. This also fixes a related issue where pim_cand_bsr_apply() would return early if the address selection hadn't changed, preventing priority-only changes from triggering the BSR state machine re-evaluation. Fix by always calling pim_cand_bsr_trigger() regardless of address change, with a guard that handles the case where an operator lowers the BSR priority while in BSR_PENDING or BSR_ELECTED state. For BSR_PENDING, cancel the pending timer and transition to ACCEPT_PREFERRED to avoid assertion failures in pim_cand_bsr_pending_expire. Signed-off-by: Enke Chen <enchen@paloaltonetworks.com> (cherry picked from commit 89b31a4)
|
Target branch is not in the allowed branches list. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This addresses the likely root cause of the flaky test pim_cand_rp_bsr.test_pim_bsr_priority_modify which had a 3.4 - 11% failure rate in the weekly topotest report.
The bs_timer is used for two different purposes:
When a candidate BSR is in BSR_PENDING state and receives a BSM from another BSR, the pim_bs_timer_restart() call would overwrite the pending timer with the liveness timer. This prevented the BSR_PENDING timer from ever expiring, causing the candidate BSR to never become elected even when it had higher priority. Fix by skipping pim_bs_timer_restart() when in BSR_PENDING state.
Additionally, when pim_bsm_update() drops a router out of BSR_PENDING state (due to receiving a BSM from a different BSR), the bs_timer with the pim_cand_bsr_pending_expire callback was not cancelled. This could cause an assertion failure when the timer fired with state != BSR_PENDING. Fix by cancelling the bs_timer in pim_bsm_update() before leaving BSR_PENDING state.
This also fixes a related issue where pim_cand_bsr_apply() would return early if the address selection hadn't changed, preventing priority-only changes from triggering the BSR state machine re-evaluation. Fix by always calling pim_cand_bsr_trigger() regardless of address change.
This is an automatic backport of pull request #22460 done by Mergify.