Distributed Systems

Reliable and Resilient Collective Communication Library for LLM Training and Serving featured image

Reliable and Resilient Collective Communication Library for LLM Training and Serving

A fault-tolerant, NCCL-compatible collective communication library that keeps LLM training and serving alive under NIC/link failures with <1.1% training / <3% inference overhead.

wei-wang