Don't Let a Few Network Failures Slow the Entire AllReduce

Peiqing Chen

Jiedong Jiang

Nengneng Yu

Yuefeng Wang

Sixian Xiong

Wei Wang

Zaoxing Liu

Preprint PDF

Abstract

Network failures are common in large GPU clusters and can force collective communication libraries to route around degraded NICs, leaving the affected server on the critical path of standard ring AllReduce. We derive an information-theoretic lower bound for AllReduce completion time under asymmetric network bandwidth, showing that a straggler with at least half of its original bandwidth need only impose small unavoidable overhead. We then design OptCC, a four-stage pipelined AllReduce algorithm that approaches this bound. In SimAI experiments, OptCC completes AllReduce within 2%–6% of NCCL’s fault-free ring performance under practical failures with up to 50% bandwidth loss, while prior fault-tolerant schemes can incur up to 57% overhead.

Type

Preprint

Publication

arXiv preprint arXiv:2606.01680

Last updated on Jun 1, 2026

Systems for ML LLM Distributed Systems

Enabling Performant and Flexible Model-Internal Observability for LLM Inference →