Skip to content

Adaptive averaging/matchmaking deadlines instead of fixed timeouts #676

Description

@pjdurden

Straggler handling right now is a set of fixed constants the user has to guess. matchmaking_time, averaging_timeout, next_chunk_timeout, sender_timeout. Slow peers past the deadline get dropped and averaging moves on. That part works fine. Picking the number is the hard part.

The docstrings basically admit it. averaging_timeout is "around 2x the actual all reduce time". matchmaking_time is "3 to 5s local, 10 to 15s over the internet". So the right value depends on the swarm, and the swarm keeps changing as peers join and leave.

One fixed number is wrong both ways. Too high and every round eats the worst case wait even when everyone present already finished. Too low and you drop peers that were about to show up.

The deadline should track the arrival times you already see. progress_tracker has most of the signal (per peer progress on the DHT, estimated_next_update_time()). So instead of a constant, set a per round deadline from the recent arrival offsets and proceed once a quorum is in. Opt in, defaults unchanged.

I built a simulator for exactly this. churn, a deterministic DiLoCo sim with peers joining and leaving, stragglers, and joiner stalls: https://github.com/pjdurden/churn. Its straggler policy swaps the fixed deadline for an adaptive one (median + k·MAD over a rolling arrival offset history) plus a partial participation quorum with sidelining.

On a persistent straggler scenario (4 workers, one 10x slow), adaptive vs fixed deadline: wall clock 92,050µs to 20,053µs (4.59x), utilization 0.62 to 0.89. It's a relative testbed, not an absolute throughput claim, but the win reproduces directionally on real code. I want to check it against real hivemind runs and prototype the opt in path if the direction makes sense to you.

Two questions before I build anything:

  1. Is the fixed constant a deliberate keep it simple call, or are you open to an adaptive opt in?
  2. You mentioned in [Feature Request] fp16/bf16 gpu params with fp32 offloading in hivemind.Optimizer #476 that some of this was "played with a few times but not merged". Any prior context or a shape you'd want before I start?

Can share the churn setup either way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions