Don't Let a Few Network Failures Slow the Entire AllReduce

Peiqing Chen
,
Jiedong Jiang
Nengneng Yu
Nengneng Yu
,
Yuefeng Wang
,
Sixian Xiong
,
Wei Wang
,
Zaoxing Liu
Abstract
Network failures are common in large GPU clusters and can force collective communication libraries to route around degraded NICs, leaving the affected server on the critical path of standard ring AllReduce. We derive an information-theoretic lower bound for AllReduce completion time under asymmetric network bandwidth, showing that a straggler with at least half of its original bandwidth need only impose small unavoidable overhead. We then design OptCC, a four-stage pipelined AllReduce algorithm that approaches this bound. In SimAI experiments, OptCC completes AllReduce within 2%–6% of NCCL’s fault-free ring performance under practical failures with up to 50% bandwidth loss, while prior fault-tolerant schemes can incur up to 57% overhead.
Type
Publication
arXiv preprint arXiv:2606.01680
publications