You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pengzhang edited this page Sep 4, 2017
·
1 revision
title: Straggler Mitigation
Straggler Mitigation
Straggler Mitigation By Enabling Fault Tolerance
Our first solution is to detect slow workers (in 3 iterations and run time per each is >= 60s). And then kill the worker to enable fault tolerance which will restart slow workers in another node.
This solution works but not well, in a busy shared Hadoop cluster, more workres will be restarted while restarting needs extra cost to start a procss or loading data into memory.
An Effective Straggler Mitigation Method: Partial Complete
An effective straggler mitigation method is just to skip the slow workers. By defining a parameter guagua.min.workers.ratio, by default it is 0.95 which means each iteration master only waits for 95% workers to be finished. This is very important to only skip stragglers. Check this feature in below slide, the slower worker is skipped in iteration 2 and 3, but laster it is good in iteration 4.