Reliable and Resilient Collective Communication Library for LLM Training and Serving

Wei Wang

Nengneng Yu

Sixian Xiong

Zaoxing Liu

Abstract

Modern ML training and inference now span tens to tens of thousands of GPUs, where network faults can waste 10–15% of GPU hours due to slow recovery. We present R2CC, a fault-tolerant communication library that provides lossless, low-overhead failover by exploiting multi-NIC hardware. R2CC performs rapid connection migration, bandwidth-aware load redistribution, and resilient collective algorithms to maintain progress under failures. Evaluated on a 4×8 H100 InfiniBand cluster and large-scale ML simulators modeling up to 1024 GPUs with diverse failure patterns, R2CC is highly robust to inter-node network failures, incurring less than 1.1% training and 3% inference overhead under active failure scenarios. Compared to existing fault-tolerant approaches, R2CC reduces failure-induced overhead by up to 92% for training and 98% for inference.

Type

Preprint

Publication

arXiv preprint arXiv:2512.25059

Last updated on Dec 1, 2025

Systems for ML LLM Distributed Systems

← Tidal: Tackling Concept Drift in Provenance-Based Advanced Persistent Threats Detection

TabSyM: A Generative Pipeline for Small Multi-Cohort Omics Tabular Data →