Reliable and Resilient Collective Communication Library for LLM Training and Serving

Mon, 01 Dec 2025 00:00:00 +0000

Collective Communication Under Failures for Distributed ML

Thu, 01 May 2025 00:00:00 +0000

Motivation

Modern ML training and inference span tens to tens of thousands of GPUs, where network failures (NIC, cable, port faults) can waste 10–15 % of GPU hours due to slow recovery. Today’s collective communication libraries (NCCL, etc.) are fail-stop: any network error crashes the communicator and forces a job restart. This project closes the gap with two complementary threads — failure-resilient collectives, and a tighter performance lower bound once a server has degraded.

R2CC — Failure-resilient collective communication

A drop-in extension for NCCL that turns the CCL layer from fail-stop to fail-operational under inter-node NIC/link failures. Detects faults mid-collective, migrates in-flight transfers losslessly to healthy connections, and adapts collective schedules to the degraded topology — with no framework changes required.

Evaluated on a 4×8 H100 InfiniBand cluster and large-scale ML simulators modeling up to 1024 GPUs: R2CC incurs < 1.1 % training overhead and < 3 % inference overhead under active failure scenarios, and reduces failure-induced overhead by up to 92 % for training and 98 % for inference versus existing fault-tolerant approaches.

📄 (arXiv preprint).

💻 Code is open-sourced at — a drop-in NCCL extension; contributions and bug reports welcome.

Bandwidth-optimal AllReduce under bandwidth-asymmetric topologies

Once a NIC fails and a server becomes a straggler, what is the unavoidable AllReduce overhead, and how do we approach it? This thread gives the first information-theoretic lower bound on AllReduce completion time under asymmetric network bandwidth, and designs a four-stage pipelined AllReduce that closes the gap to within 2 – 6 % of NCCL’s fault-free ring under up to 50 % bandwidth loss — whereas state-of-the-art still incurs up to 57 % overhead.

Manuscript under review — details available on request.

Distributed Systems | Nengneng Yu

Reliable and Resilient Collective Communication Library for LLM Training and Serving

Collective Communication Under Failures for Distributed ML

Motivation

R2CC — Failure-resilient collective communication

Bandwidth-optimal AllReduce under bandwidth-asymmetric topologies