<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Distributed Systems | Nengneng Yu</title><link>https://samfisheryu.github.io/tags/distributed-systems/</link><atom:link href="https://samfisheryu.github.io/tags/distributed-systems/index.xml" rel="self" type="application/rss+xml"/><description>Distributed Systems</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 01 Dec 2025 00:00:00 +0000</lastBuildDate><image><url>https://samfisheryu.github.io/media/icon_hu_da05098ef60dc2e7.png</url><title>Distributed Systems</title><link>https://samfisheryu.github.io/tags/distributed-systems/</link></image><item><title>Reliable and Resilient Collective Communication Library for LLM Training and Serving</title><link>https://samfisheryu.github.io/publications/r2cc-2025-arxiv/</link><pubDate>Mon, 01 Dec 2025 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/publications/r2cc-2025-arxiv/</guid><description/></item><item><title>Collective Communication Under Failures for Distributed ML</title><link>https://samfisheryu.github.io/projects/cc-fault-tolerance/</link><pubDate>Thu, 01 May 2025 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/projects/cc-fault-tolerance/</guid><description>&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Modern ML training and inference span tens to tens of thousands of GPUs,
where network failures (NIC, cable, port faults) can waste &lt;strong&gt;10–15 %&lt;/strong&gt; of
GPU hours due to slow recovery. Today&amp;rsquo;s collective communication libraries
(NCCL, etc.) are &lt;em&gt;fail-stop&lt;/em&gt;: any network error crashes the communicator
and forces a job restart. This project closes the gap with two complementary
threads — failure-resilient collectives, and a tighter performance lower
bound once a server has degraded.&lt;/p&gt;
&lt;h3 id="r2cc--failure-resilient-collective-communication"&gt;R2CC — Failure-resilient collective communication&lt;/h3&gt;
&lt;p&gt;A drop-in extension for NCCL that turns the CCL layer from fail-stop to
&lt;strong&gt;fail-operational&lt;/strong&gt; under inter-node NIC/link failures. Detects faults
mid-collective, migrates in-flight transfers losslessly to healthy
connections, and adapts collective schedules to the degraded topology —
with no framework changes required.&lt;/p&gt;
&lt;p&gt;Evaluated on a 4×8 H100 InfiniBand cluster and large-scale ML simulators
modeling up to 1024 GPUs: R2CC incurs &lt;strong&gt;&amp;lt; 1.1 %&lt;/strong&gt; training overhead and
&lt;strong&gt;&amp;lt; 3 %&lt;/strong&gt; inference overhead under active failure scenarios, and reduces
failure-induced overhead by &lt;strong&gt;up to 92 % for training&lt;/strong&gt; and &lt;strong&gt;98 % for
inference&lt;/strong&gt; versus existing fault-tolerant approaches.&lt;/p&gt;
&lt;p&gt;📄
(arXiv
preprint).&lt;/p&gt;
&lt;p&gt;💻 Code is open-sourced at &lt;strong&gt;
&lt;/strong&gt; —
a drop-in NCCL extension; contributions and bug reports welcome.&lt;/p&gt;
&lt;h3 id="bandwidth-optimal-allreduce-under-bandwidth-asymmetric-topologies"&gt;Bandwidth-optimal AllReduce under bandwidth-asymmetric topologies&lt;/h3&gt;
&lt;p&gt;Once a NIC fails and a server becomes a &lt;em&gt;straggler&lt;/em&gt;, what is the
unavoidable AllReduce overhead, and how do we approach it? This thread
gives the &lt;strong&gt;first information-theoretic lower bound&lt;/strong&gt; on AllReduce
completion time under asymmetric network bandwidth, and designs a
four-stage pipelined AllReduce that closes the gap to within &lt;strong&gt;2 – 6 %&lt;/strong&gt;
of NCCL&amp;rsquo;s fault-free ring under up to 50 % bandwidth loss — whereas
state-of-the-art still incurs up to 57 % overhead.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Manuscript under review — details available on request.&lt;/em&gt;&lt;/p&gt;</description></item></channel></rss>