Don't Let a Few Network Failures Slow the Entire AllReduce
Abstract
Network failures are common in large GPU clusters and can force collective
communication libraries to route around degraded NICs, leaving the affected
server on the critical path of standard ring AllReduce. We derive an
information-theoretic lower bound for AllReduce completion time under
asymmetric network bandwidth, showing that a straggler with at least half of
its original bandwidth need only impose small unavoidable overhead. We then
design OptCC, a four-stage pipelined AllReduce algorithm that approaches
this bound. In SimAI experiments, OptCC completes AllReduce within 2%–6% of
NCCL’s fault-free ring performance under practical failures with up to 50%
bandwidth loss, while prior fault-tolerant schemes can incur up to 57%
overhead.
Type
Publication
arXiv preprint arXiv:2606.01680