Reliable and Resilient Collective Communication Library for LLM Training and Serving

Wei Wang
Nengneng Yu
Nengneng Yu
,
Sixian Xiong
,
Zaoxing Liu
Abstract
Modern ML training and inference now span tens to tens of thousands of GPUs, where network faults can waste 10–15% of GPU hours due to slow recovery. We present R2CC, a fault-tolerant communication library that provides lossless, low-overhead failover by exploiting multi-NIC hardware. R2CC performs rapid connection migration, bandwidth-aware load redistribution, and resilient collective algorithms to maintain progress under failures. Evaluated on a 4×8 H100 InfiniBand cluster and large-scale ML simulators modeling up to 1024 GPUs with diverse failure patterns, R2CC is highly robust to inter-node network failures, incurring less than 1.1% training and 3% inference overhead under active failure scenarios. Compared to existing fault-tolerant approaches, R2CC reduces failure-induced overhead by up to 92% for training and 98% for inference.
Type
Publication
arXiv preprint arXiv:2512.25059
publications