Reliable and Resilient Collective Communication Library for LLM Training and Serving
A fault-tolerant, NCCL-compatible collective communication library that keeps LLM training and serving alive under NIC/link failures with <1.1% training / <3% inference overhead.
wei-wang
