feat: endpoint failover for unavailable data by barnabasbusa · Pull Request #61 · ethpandaops/dugtrio

barnabasbusa · 2026-07-03T07:58:05Z

Endpoint failover for unavailable data

Problem

Dugtrio picks one endpoint per call (sticky per session) and streams back whatever it returns. For data whose availability differs per endpoint — most notably historical states, which only nodes with state archiving serve — this means requests keep failing even when another endpoint in the pool has the data:

eth/v1/beacon/states/{old-slot}/validators → 404 "State not found" (pruned) or 500 "Historical state regen is not enabled" (lodestar without --chain.serveHistoricalState), while e.g. nimbus endpoints serve it fine.
Worse, some endpoints hang on these queries (observed on glamsterdam-devnet-6: a prysm node blocks indefinitely on historical state queries, and a lodestar node's nginx hangs on error responses when the request has Accept-Encoding: gzip), so the client just burns the whole callTimeout. These nodes pass all pool health checks (version/syncing/head), so they never get evicted — the hang only manifests on the failing request path.

Change

When an upstream cannot serve the requested data, retry the call on other ready endpoints instead of returning the failure:

Trigger: calls matching failoverPaths (default: ^/eth/v[0-9]+/beacon/states/, ^/eth/v[0-9]+/debug/beacon/states/) that fail with 404/500/501/503 or a connection error.
Sweep order: remaining ready endpoints interleaved by client type, so every client implementation is covered as early as possible (e.g. the archiving client is found within the first few attempts rather than after trying all instances of a pruning client). Nothing is persisted between calls — each failover sweeps fresh, so load balancing is preserved.
Hedging instead of per-attempt timeouts: fast upstream failures advance to the next candidate immediately. An attempt that stays silent for failoverHedgeDelay (default 20s) is not cancelled — it keeps running while the next candidate is raced in parallel, bounded by failoverMaxParallel (default 2). The first acceptable response wins; the other attempts are cancelled. So slow-but-capable endpoints keep the full call timeout to load a state (mainnet/hoodi state queries regularly take a minute+), while hanging endpoints only cost latency, not time budget — no cascade.
Fallback: if no endpoint can serve the data, the last upstream failure response is passed through to the client instead of a generic 500.
Scope: GET/HEAD/POST only (POST bodies up to 1MB are buffered for replay). Calls pinned to an explicit endpoint via X-Dugtrio-Next-Endpoint are excluded; client-type filters (prefix proxies / type filter) are honored during the sweep. Sticky session endpoints are not changed by failover. Rate limiting still charges 1 call.
Served failover responses expose X-Dugtrio-Failover-Attempts.

New config (all optional, feature is on by default for state paths; disable with failoverDisabled: true):

proxy:
  #failoverDisabled: false
  #failoverPaths:
  #  - ^/eth/v[0-9]+/beacon/states/
  #  - ^/eth/v[0-9]+/debug/beacon/states/
  #failoverMaxAttempts: 8
  #failoverHedgeDelay: 20s
  #failoverMaxParallel: 2

Review updates

Removed persistence of the retried endpoint (0c06f1f): the first version remembered the last endpoint that served a path class and tried it first, which would concentrate failover traffic from all sessions on the same node. Load balancing is prioritized instead (thanks @pk910).
Hedging replaces the hard per-attempt timeout (d6874a6): the first version cancelled attempts after failoverAttemptTimeout, which on bigger networks (where state loads take a minute+) would cascade — no endpoint got enough time to load the state (thanks @pk910). failoverAttemptTimeout is replaced by failoverHedgeDelay/failoverMaxParallel.

Testing

Unit tests (proxy/proxycall_test.go, run with -race in CI) cover the failover coordinator against mock upstreams: fast-failure sweep, hedging on silent endpoints, slow endpoint still winning after hedges started (the cascade regression), fallback passthrough when nobody has the data, the parallelism cap, POST body replay, and the non-failover passthrough path.

Manual testing of the original version against all 45 glamsterdam-devnet-6 nodes (6 client types; nimbus is the only one serving historical states there, and the pool includes endpoints that hang on these queries):

Case	Result
`GET /eth/v1/beacon/states/10000/validators`	200 from `nimbus-erigon-1`, `X-Dugtrio-Failover-Attempts: 4`, hanging endpoints skipped
`GET /eth/v2/debug/beacon/states/10000` (SSZ)	200, 3.7MB body (5 attempts)
`POST .../validators` with JSON body	200, body replayed, 1 attempt, 0.2s
`GET /eth/v1/beacon/headers/head`	untouched — no failover header, normal sticky endpoint
nonexistent state (slot 99999999)	bounded sweep (8 attempts), upstream error returned — no hang
`?dugtrio-next-endpoint=<node>` (pinned)	no failover, response passed through as before
`?dugtrio-next-endpoint=lodestar` (type filter)	sweep stayed within lodestar endpoints, upstream error passed through

go build clean; golangci-lint reports no new findings (2 pre-existing on master).

When an endpoint cannot serve the requested data (404/500/501/503, connection error or attempt timeout), retry the call on other ready endpoints instead of returning the failure to the client. Alternate endpoints are swept interleaved by client type so every client implementation is covered as early as possible, and the last endpoint that successfully served a path class is remembered and tried first on subsequent failovers. Failover applies to GET/HEAD/POST calls matching configurable path patterns (default: state queries, where availability differs per endpoint depending on pruning/archiving). POST bodies up to 1MB are buffered for replay. Calls pinned to an explicit endpoint via X-Dugtrio-Next-Endpoint are excluded; client-type filters are honored during the sweep. While alternate candidates remain, each attempt is bounded by failoverAttemptTimeout (default 20s) so a hanging endpoint cannot consume the whole call timeout; the last candidate runs on the remaining call budget. Served failover responses expose the X-Dugtrio-Failover-Attempts header.

Remove the proxy-wide map that remembered the last endpoint serving a failover path class and tried it first on subsequent failovers. It concentrated all failover traffic from all sessions on the same node instead of load balancing. Candidates are now always swept in the interleaved per-client-type order.

State queries on bigger networks regularly take a minute or more, so cancelling an attempt after failoverAttemptTimeout cascaded: no endpoint ever got enough time to load the state. Fast upstream failures (404/500/501/503, connection errors) still advance to the next candidate immediately. But an attempt that stays silent for failoverHedgeDelay (default 20s) now keeps running while the next candidate is raced in parallel, bounded by failoverMaxParallel (default 2). The first acceptable response wins and the other attempts are cancelled. Slow endpoints keep their full call timeout to respond; hanging endpoints only cost latency, not time budget. If no endpoint can serve the data, the last upstream failure response is passed through to the client. Replaces the failoverAttemptTimeout config option with failoverHedgeDelay and adds failoverMaxParallel. Also makes the call context cancelled flag atomic, as it is now read across attempt goroutines, and adds tests covering the sweep, hedging, slow-endpoint wins, fallback passthrough, parallelism cap and body replay.

The gosec G118 rule only exists in newer golangci-lint versions; with the v2.8.0 pinned in CI the suppression is unused and rejected by nolintlint.

barnabasbusa requested a review from pk910 as a code owner July 3, 2026 07:58

barnabasbusa added the build-docker-image build docker image for this PR label Jul 3, 2026

barnabasbusa and others added 4 commits July 3, 2026 10:09

Merge branch 'master' into bbusa/endpoint-failover

eb3ade2

ci: drop nolint directive unused by golangci-lint v2.8.0

43767ca

The gosec G118 rule only exists in newer golangci-lint versions; with the v2.8.0 pinned in CI the suppression is unused and rejected by nolintlint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: endpoint failover for unavailable data#61

feat: endpoint failover for unavailable data#61
barnabasbusa wants to merge 5 commits into
masterfrom
bbusa/endpoint-failover

barnabasbusa commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

barnabasbusa commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Endpoint failover for unavailable data

Problem

Change

Review updates

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

barnabasbusa commented Jul 3, 2026 •

edited

Loading