feat: endpoint failover for unavailable data#61
Open
barnabasbusa wants to merge 5 commits into
Open
Conversation
When an endpoint cannot serve the requested data (404/500/501/503, connection error or attempt timeout), retry the call on other ready endpoints instead of returning the failure to the client. Alternate endpoints are swept interleaved by client type so every client implementation is covered as early as possible, and the last endpoint that successfully served a path class is remembered and tried first on subsequent failovers. Failover applies to GET/HEAD/POST calls matching configurable path patterns (default: state queries, where availability differs per endpoint depending on pruning/archiving). POST bodies up to 1MB are buffered for replay. Calls pinned to an explicit endpoint via X-Dugtrio-Next-Endpoint are excluded; client-type filters are honored during the sweep. While alternate candidates remain, each attempt is bounded by failoverAttemptTimeout (default 20s) so a hanging endpoint cannot consume the whole call timeout; the last candidate runs on the remaining call budget. Served failover responses expose the X-Dugtrio-Failover-Attempts header.
Remove the proxy-wide map that remembered the last endpoint serving a failover path class and tried it first on subsequent failovers. It concentrated all failover traffic from all sessions on the same node instead of load balancing. Candidates are now always swept in the interleaved per-client-type order.
State queries on bigger networks regularly take a minute or more, so cancelling an attempt after failoverAttemptTimeout cascaded: no endpoint ever got enough time to load the state. Fast upstream failures (404/500/501/503, connection errors) still advance to the next candidate immediately. But an attempt that stays silent for failoverHedgeDelay (default 20s) now keeps running while the next candidate is raced in parallel, bounded by failoverMaxParallel (default 2). The first acceptable response wins and the other attempts are cancelled. Slow endpoints keep their full call timeout to respond; hanging endpoints only cost latency, not time budget. If no endpoint can serve the data, the last upstream failure response is passed through to the client. Replaces the failoverAttemptTimeout config option with failoverHedgeDelay and adds failoverMaxParallel. Also makes the call context cancelled flag atomic, as it is now read across attempt goroutines, and adds tests covering the sweep, hedging, slow-endpoint wins, fallback passthrough, parallelism cap and body replay.
The gosec G118 rule only exists in newer golangci-lint versions; with the v2.8.0 pinned in CI the suppression is unused and rejected by nolintlint.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Endpoint failover for unavailable data
Problem
Dugtrio picks one endpoint per call (sticky per session) and streams back whatever it returns. For data whose availability differs per endpoint — most notably historical states, which only nodes with state archiving serve — this means requests keep failing even when another endpoint in the pool has the data:
eth/v1/beacon/states/{old-slot}/validators→ 404 "State not found" (pruned) or 500 "Historical state regen is not enabled" (lodestar without--chain.serveHistoricalState), while e.g. nimbus endpoints serve it fine.Accept-Encoding: gzip), so the client just burns the wholecallTimeout. These nodes pass all pool health checks (version/syncing/head), so they never get evicted — the hang only manifests on the failing request path.Change
When an upstream cannot serve the requested data, retry the call on other ready endpoints instead of returning the failure:
failoverPaths(default:^/eth/v[0-9]+/beacon/states/,^/eth/v[0-9]+/debug/beacon/states/) that fail with 404/500/501/503 or a connection error.failoverHedgeDelay(default 20s) is not cancelled — it keeps running while the next candidate is raced in parallel, bounded byfailoverMaxParallel(default 2). The first acceptable response wins; the other attempts are cancelled. So slow-but-capable endpoints keep the full call timeout to load a state (mainnet/hoodi state queries regularly take a minute+), while hanging endpoints only cost latency, not time budget — no cascade.X-Dugtrio-Next-Endpointare excluded; client-type filters (prefix proxies / type filter) are honored during the sweep. Sticky session endpoints are not changed by failover. Rate limiting still charges 1 call.X-Dugtrio-Failover-Attempts.New config (all optional, feature is on by default for state paths; disable with
failoverDisabled: true):Review updates
failoverAttemptTimeout, which on bigger networks (where state loads take a minute+) would cascade — no endpoint got enough time to load the state (thanks @pk910).failoverAttemptTimeoutis replaced byfailoverHedgeDelay/failoverMaxParallel.Testing
Unit tests (
proxy/proxycall_test.go, run with-racein CI) cover the failover coordinator against mock upstreams: fast-failure sweep, hedging on silent endpoints, slow endpoint still winning after hedges started (the cascade regression), fallback passthrough when nobody has the data, the parallelism cap, POST body replay, and the non-failover passthrough path.Manual testing of the original version against all 45 glamsterdam-devnet-6 nodes (6 client types; nimbus is the only one serving historical states there, and the pool includes endpoints that hang on these queries):
GET /eth/v1/beacon/states/10000/validatorsnimbus-erigon-1,X-Dugtrio-Failover-Attempts: 4, hanging endpoints skippedGET /eth/v2/debug/beacon/states/10000(SSZ)POST .../validatorswith JSON bodyGET /eth/v1/beacon/headers/head?dugtrio-next-endpoint=<node>(pinned)?dugtrio-next-endpoint=lodestar(type filter)go buildclean;golangci-lintreports no new findings (2 pre-existing on master).