Skip to content

feat: endpoint failover for unavailable data#61

Open
barnabasbusa wants to merge 5 commits into
masterfrom
bbusa/endpoint-failover
Open

feat: endpoint failover for unavailable data#61
barnabasbusa wants to merge 5 commits into
masterfrom
bbusa/endpoint-failover

Conversation

@barnabasbusa

@barnabasbusa barnabasbusa commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Endpoint failover for unavailable data

Problem

Dugtrio picks one endpoint per call (sticky per session) and streams back whatever it returns. For data whose availability differs per endpoint — most notably historical states, which only nodes with state archiving serve — this means requests keep failing even when another endpoint in the pool has the data:

  • eth/v1/beacon/states/{old-slot}/validators → 404 "State not found" (pruned) or 500 "Historical state regen is not enabled" (lodestar without --chain.serveHistoricalState), while e.g. nimbus endpoints serve it fine.
  • Worse, some endpoints hang on these queries (observed on glamsterdam-devnet-6: a prysm node blocks indefinitely on historical state queries, and a lodestar node's nginx hangs on error responses when the request has Accept-Encoding: gzip), so the client just burns the whole callTimeout. These nodes pass all pool health checks (version/syncing/head), so they never get evicted — the hang only manifests on the failing request path.

Change

When an upstream cannot serve the requested data, retry the call on other ready endpoints instead of returning the failure:

  • Trigger: calls matching failoverPaths (default: ^/eth/v[0-9]+/beacon/states/, ^/eth/v[0-9]+/debug/beacon/states/) that fail with 404/500/501/503 or a connection error.
  • Sweep order: remaining ready endpoints interleaved by client type, so every client implementation is covered as early as possible (e.g. the archiving client is found within the first few attempts rather than after trying all instances of a pruning client). Nothing is persisted between calls — each failover sweeps fresh, so load balancing is preserved.
  • Hedging instead of per-attempt timeouts: fast upstream failures advance to the next candidate immediately. An attempt that stays silent for failoverHedgeDelay (default 20s) is not cancelled — it keeps running while the next candidate is raced in parallel, bounded by failoverMaxParallel (default 2). The first acceptable response wins; the other attempts are cancelled. So slow-but-capable endpoints keep the full call timeout to load a state (mainnet/hoodi state queries regularly take a minute+), while hanging endpoints only cost latency, not time budget — no cascade.
  • Fallback: if no endpoint can serve the data, the last upstream failure response is passed through to the client instead of a generic 500.
  • Scope: GET/HEAD/POST only (POST bodies up to 1MB are buffered for replay). Calls pinned to an explicit endpoint via X-Dugtrio-Next-Endpoint are excluded; client-type filters (prefix proxies / type filter) are honored during the sweep. Sticky session endpoints are not changed by failover. Rate limiting still charges 1 call.
  • Served failover responses expose X-Dugtrio-Failover-Attempts.

New config (all optional, feature is on by default for state paths; disable with failoverDisabled: true):

proxy:
  #failoverDisabled: false
  #failoverPaths:
  #  - ^/eth/v[0-9]+/beacon/states/
  #  - ^/eth/v[0-9]+/debug/beacon/states/
  #failoverMaxAttempts: 8
  #failoverHedgeDelay: 20s
  #failoverMaxParallel: 2

Review updates

  • Removed persistence of the retried endpoint (0c06f1f): the first version remembered the last endpoint that served a path class and tried it first, which would concentrate failover traffic from all sessions on the same node. Load balancing is prioritized instead (thanks @pk910).
  • Hedging replaces the hard per-attempt timeout (d6874a6): the first version cancelled attempts after failoverAttemptTimeout, which on bigger networks (where state loads take a minute+) would cascade — no endpoint got enough time to load the state (thanks @pk910). failoverAttemptTimeout is replaced by failoverHedgeDelay/failoverMaxParallel.

Testing

Unit tests (proxy/proxycall_test.go, run with -race in CI) cover the failover coordinator against mock upstreams: fast-failure sweep, hedging on silent endpoints, slow endpoint still winning after hedges started (the cascade regression), fallback passthrough when nobody has the data, the parallelism cap, POST body replay, and the non-failover passthrough path.

Manual testing of the original version against all 45 glamsterdam-devnet-6 nodes (6 client types; nimbus is the only one serving historical states there, and the pool includes endpoints that hang on these queries):

Case Result
GET /eth/v1/beacon/states/10000/validators 200 from nimbus-erigon-1, X-Dugtrio-Failover-Attempts: 4, hanging endpoints skipped
GET /eth/v2/debug/beacon/states/10000 (SSZ) 200, 3.7MB body (5 attempts)
POST .../validators with JSON body 200, body replayed, 1 attempt, 0.2s
GET /eth/v1/beacon/headers/head untouched — no failover header, normal sticky endpoint
nonexistent state (slot 99999999) bounded sweep (8 attempts), upstream error returned — no hang
?dugtrio-next-endpoint=<node> (pinned) no failover, response passed through as before
?dugtrio-next-endpoint=lodestar (type filter) sweep stayed within lodestar endpoints, upstream error passed through

go build clean; golangci-lint reports no new findings (2 pre-existing on master).

When an endpoint cannot serve the requested data (404/500/501/503,
connection error or attempt timeout), retry the call on other ready
endpoints instead of returning the failure to the client. Alternate
endpoints are swept interleaved by client type so every client
implementation is covered as early as possible, and the last endpoint
that successfully served a path class is remembered and tried first
on subsequent failovers.

Failover applies to GET/HEAD/POST calls matching configurable path
patterns (default: state queries, where availability differs per
endpoint depending on pruning/archiving). POST bodies up to 1MB are
buffered for replay. Calls pinned to an explicit endpoint via
X-Dugtrio-Next-Endpoint are excluded; client-type filters are
honored during the sweep.

While alternate candidates remain, each attempt is bounded by
failoverAttemptTimeout (default 20s) so a hanging endpoint cannot
consume the whole call timeout; the last candidate runs on the
remaining call budget. Served failover responses expose the
X-Dugtrio-Failover-Attempts header.
@barnabasbusa barnabasbusa requested a review from pk910 as a code owner July 3, 2026 07:58
@barnabasbusa barnabasbusa added the build-docker-image build docker image for this PR label Jul 3, 2026
barnabasbusa and others added 4 commits July 3, 2026 10:09
Remove the proxy-wide map that remembered the last endpoint serving a
failover path class and tried it first on subsequent failovers. It
concentrated all failover traffic from all sessions on the same node
instead of load balancing. Candidates are now always swept in the
interleaved per-client-type order.
State queries on bigger networks regularly take a minute or more, so
cancelling an attempt after failoverAttemptTimeout cascaded: no
endpoint ever got enough time to load the state.

Fast upstream failures (404/500/501/503, connection errors) still
advance to the next candidate immediately. But an attempt that stays
silent for failoverHedgeDelay (default 20s) now keeps running while
the next candidate is raced in parallel, bounded by
failoverMaxParallel (default 2). The first acceptable response wins
and the other attempts are cancelled. Slow endpoints keep their full
call timeout to respond; hanging endpoints only cost latency, not
time budget. If no endpoint can serve the data, the last upstream
failure response is passed through to the client.

Replaces the failoverAttemptTimeout config option with
failoverHedgeDelay and adds failoverMaxParallel. Also makes the call
context cancelled flag atomic, as it is now read across attempt
goroutines, and adds tests covering the sweep, hedging, slow-endpoint
wins, fallback passthrough, parallelism cap and body replay.
The gosec G118 rule only exists in newer golangci-lint versions; with
the v2.8.0 pinned in CI the suppression is unused and rejected by
nolintlint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-docker-image build docker image for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant