Skip to content

fix(eventualreg): converge own names after a process restart#343

Open
wolfy-j wants to merge 1 commit into
mainfrom
fix/eventual-registry-restart-convergence
Open

fix(eventualreg): converge own names after a process restart#343
wolfy-j wants to merge 1 commit into
mainfrom
fix/eventual-registry-restart-convergence

Conversation

@wolfy-j

@wolfy-j wolfy-j commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Problem

A node that restarts re-registers its names with localCounter reset to 0 (NewState). Meanwhile peers, on the node's NodeLeft, reap its bindings into tombstones at the prior incarnation's counter. The restarted node's fresh dot (counter 1) is therefore causally staleState.Apply drops it via the in.Counter < cur.Counter branch (and the reap comment notes the tombstone is "terminal"). The name stays dead cluster-wide, so cluster-visible names like a control-RPC endpoint or a per-node read service resolve as name not registered indefinitely after a control-node restart.

Fix

Two layers, both off the merge hot path:

  1. state: Lamport-seed the mint. nextCounter() returns a counter strictly above both localCounter and cv[localNode]. A restarted node relearns its prior incarnation's highest counter via anti-entropy (bumpCV), so a re-registration now dominates the prior dot/tombstone instead of losing to it.
  2. service: owner re-assertion. The service tracks names it registered live; when an incoming same-origin dot (a stale reap of a prior incarnation) overrides one, it re-registers it (now with a dominating counter), healing convergence. The hot-path cost is a single lock-free e.Node == LocalNode() compare; the owned map/lock and re-register run only on the rare self-origin override.

Tests

  • TestRegister_SeedsCounterAboveObservedOrigin — the Lamport seed.
  • TestRestart_ReclaimsOwnNameAfterReapTombstone — full register → reap → restart → rejoin convergence (fails on main, passes here).
  • Full package go test -race + go vet clean.

A node that restarts re-registers its names with localCounter reset to 0, so a
fresh dot is causally stale behind the node-left reap tombstone peers minted at
the prior incarnation's counter. Apply drops it (in.Counter < cur.Counter) and the
name stays dead cluster-wide (control_rpc/node_query => 'name not registered').

- state: nextCounter() seeds a local mint above cv[localNode] (Lamport advance), so
  a re-registration dominates the prior incarnation's dot/tombstone.
- service: track owned names and re-assert one when an incoming same-origin dot
  (a stale reap) overrides it, off the merge hot path (guarded by a lock-free
  localNode compare).

Tests: Lamport-seed unit + full restart->reap->rejoin convergence. race + vet clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant