Adaptive averaging/matchmaking deadlines instead of fixed timeouts

Straggler handling right now is a set of fixed constants the user has to guess. `matchmaking_time`, `averaging_timeout`, `next_chunk_timeout`, `sender_timeout`. Slow peers past the deadline get dropped and averaging moves on. That part works fine. Picking the number is the hard part.

The docstrings basically admit it. `averaging_timeout` is "around 2x the actual all reduce time". `matchmaking_time` is "3 to 5s local, 10 to 15s over the internet". So the right value depends on the swarm, and the swarm keeps changing as peers join and leave.

One fixed number is wrong both ways. Too high and every round eats the worst case wait even when everyone present already finished. Too low and you drop peers that were about to show up.

The deadline should track the arrival times you already see. `progress_tracker` has most of the signal (per peer progress on the DHT, `estimated_next_update_time()`). So instead of a constant, set a per round deadline from the recent arrival offsets and proceed once a quorum is in. Opt in, defaults unchanged.

I built a simulator for exactly this. churn, a deterministic DiLoCo sim with peers joining and leaving, stragglers, and joiner stalls: https://github.com/pjdurden/churn. Its straggler policy swaps the fixed deadline for an adaptive one (`median + k·MAD` over a rolling arrival offset history) plus a partial participation quorum with sidelining.

On a persistent straggler scenario (4 workers, one 10x slow), adaptive vs fixed deadline: wall clock 92,050µs to 20,053µs (4.59x), utilization 0.62 to 0.89. It's a relative testbed, not an absolute throughput claim, but the win reproduces directionally on real code. I want to check it against real hivemind runs and prototype the opt in path if the direction makes sense to you.

Two questions before I build anything:

1. Is the fixed constant a deliberate keep it simple call, or are you open to an adaptive opt in?
2. You mentioned in #476 that some of this was "played with a few times but not merged". Any prior context or a shape you'd want before I start?

Can share the churn setup either way.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adaptive averaging/matchmaking deadlines instead of fixed timeouts #676

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Adaptive averaging/matchmaking deadlines instead of fixed timeouts #676

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions