<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Nengneng Yu</title><link>https://samfisheryu.github.io/</link><atom:link href="https://samfisheryu.github.io/index.xml" rel="self" type="application/rss+xml"/><description>Nengneng Yu</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 24 Oct 2022 00:00:00 +0000</lastBuildDate><image><url>https://samfisheryu.github.io/media/icon_hu_da05098ef60dc2e7.png</url><title>Nengneng Yu</title><link>https://samfisheryu.github.io/</link></image><item><title>Enabling Performant and Flexible Model-Internal Observability for LLM Inference</title><link>https://samfisheryu.github.io/publications/dmi-arxiv/</link><pubDate>Thu, 07 May 2026 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/publications/dmi-arxiv/</guid><description/></item><item><title>A 15-Layer Multi-Omics Dissection of Gastric Cancer Ecotypes Reveals Therapeutic Opportunities</title><link>https://samfisheryu.github.io/publications/cell-reports-medicine-2026-gc/</link><pubDate>Thu, 01 Jan 2026 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/publications/cell-reports-medicine-2026-gc/</guid><description/></item><item><title>Tidal: Tackling Concept Drift in Provenance-Based Advanced Persistent Threats Detection</title><link>https://samfisheryu.github.io/publications/nines-2026-tidal/</link><pubDate>Thu, 01 Jan 2026 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/publications/nines-2026-tidal/</guid><description/></item><item><title>Reliable and Resilient Collective Communication Library for LLM Training and Serving</title><link>https://samfisheryu.github.io/publications/r2cc-2025-arxiv/</link><pubDate>Mon, 01 Dec 2025 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/publications/r2cc-2025-arxiv/</guid><description/></item><item><title>Enable LLM Internal Observability</title><link>https://samfisheryu.github.io/projects/dmi/</link><pubDate>Mon, 01 Sep 2025 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/projects/dmi/</guid><description>&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Today&amp;rsquo;s inference-time workloads increasingly depend on &lt;strong&gt;timely access to a
model&amp;rsquo;s internal states&lt;/strong&gt; — for interpretability, test-time alignment,
speculative decoding, hallucination detection, and real-time safety
monitoring. But existing approaches retrofit observation either through
PyTorch hooks (limited and slow) or engine-specific APIs (not portable),
treating observability as an afterthought rather than a systems concern.
This project pursues internal observability as a &lt;em&gt;first-class systems
primitive&lt;/em&gt;, asking: what does it take for a serving stack to expose
arbitrary internal states without paying for it on the inference hot path?&lt;/p&gt;
&lt;h3 id="dmi--deep-model-inspector"&gt;DMI — Deep Model Inspector&lt;/h3&gt;
&lt;p&gt;DMI is the first system in this line of work. It decouples observability
from the inference hot path through three pieces:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;HookPoint&lt;/strong&gt; — a CUDA-graph-compatible collection primitive that can be
placed at arbitrary locations in PyTorch models, exposing diverse internal
states without engine-specific changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ring²&lt;/strong&gt; — a GPU–CPU staging abstraction that keeps tensor capture inside
CUDA graphs and a dedicated GPU-side ring buffer.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;data exporter&lt;/strong&gt; that drains the staged tensors asynchronously on the
host, governed by complete or best-effort policies that adapt data rate
and fidelity to interconnect and memory budgets.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A standalone system in ~11K lines of CUDA / C++ / Python. With out of the box
support for Hugging Face and vLLM. In offline batch inference, DMI introduces
only &lt;strong&gt;0.4 % – 6.8 %&lt;/strong&gt;
overhead and ~&lt;strong&gt;6 %&lt;/strong&gt; in moderate online serving — &lt;strong&gt;2× – 15×&lt;/strong&gt; lower
than baselines with comparable observability features.&lt;/p&gt;
&lt;p&gt;📄
(arXiv preprint, with NSDI'26
).&lt;/p&gt;
&lt;p&gt;💻 Code is open-sourced at &lt;strong&gt;
&lt;/strong&gt; —
contributions, issues, and benchmark reports very welcome.&lt;/p&gt;</description></item><item><title>TabSyM: A Generative Pipeline for Small Multi-Cohort Omics Tabular Data</title><link>https://samfisheryu.github.io/publications/tabsym-2025-biorxiv/</link><pubDate>Mon, 14 Jul 2025 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/publications/tabsym-2025-biorxiv/</guid><description/></item><item><title>Collective Communication Under Failures for Distributed ML</title><link>https://samfisheryu.github.io/projects/cc-fault-tolerance/</link><pubDate>Thu, 01 May 2025 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/projects/cc-fault-tolerance/</guid><description>&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Modern ML training and inference span tens to tens of thousands of GPUs,
where network failures (NIC, cable, port faults) can waste &lt;strong&gt;10–15 %&lt;/strong&gt; of
GPU hours due to slow recovery. Today&amp;rsquo;s collective communication libraries
(NCCL, etc.) are &lt;em&gt;fail-stop&lt;/em&gt;: any network error crashes the communicator
and forces a job restart. This project closes the gap with two complementary
threads — failure-resilient collectives, and a tighter performance lower
bound once a server has degraded.&lt;/p&gt;
&lt;h3 id="r2cc--failure-resilient-collective-communication"&gt;R2CC — Failure-resilient collective communication&lt;/h3&gt;
&lt;p&gt;A drop-in extension for NCCL that turns the CCL layer from fail-stop to
&lt;strong&gt;fail-operational&lt;/strong&gt; under inter-node NIC/link failures. Detects faults
mid-collective, migrates in-flight transfers losslessly to healthy
connections, and adapts collective schedules to the degraded topology —
with no framework changes required.&lt;/p&gt;
&lt;p&gt;Evaluated on a 4×8 H100 InfiniBand cluster and large-scale ML simulators
modeling up to 1024 GPUs: R2CC incurs &lt;strong&gt;&amp;lt; 1.1 %&lt;/strong&gt; training overhead and
&lt;strong&gt;&amp;lt; 3 %&lt;/strong&gt; inference overhead under active failure scenarios, and reduces
failure-induced overhead by &lt;strong&gt;up to 92 % for training&lt;/strong&gt; and &lt;strong&gt;98 % for
inference&lt;/strong&gt; versus existing fault-tolerant approaches.&lt;/p&gt;
&lt;p&gt;📄
(arXiv
preprint).&lt;/p&gt;
&lt;p&gt;💻 Code is open-sourced at &lt;strong&gt;
&lt;/strong&gt; —
a drop-in NCCL extension; contributions and bug reports welcome.&lt;/p&gt;
&lt;h3 id="bandwidth-optimal-allreduce-under-bandwidth-asymmetric-topologies"&gt;Bandwidth-optimal AllReduce under bandwidth-asymmetric topologies&lt;/h3&gt;
&lt;p&gt;Once a NIC fails and a server becomes a &lt;em&gt;straggler&lt;/em&gt;, what is the
unavoidable AllReduce overhead, and how do we approach it? This thread
gives the &lt;strong&gt;first information-theoretic lower bound&lt;/strong&gt; on AllReduce
completion time under asymmetric network bandwidth, and designs a
four-stage pipelined AllReduce that closes the gap to within &lt;strong&gt;2 – 6 %&lt;/strong&gt;
of NCCL&amp;rsquo;s fault-free ring under up to 50 % bandwidth loss — whereas
state-of-the-art still incurs up to 57 % overhead.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Manuscript under review — details available on request.&lt;/em&gt;&lt;/p&gt;</description></item><item><title>Towards Interactive Research Agents for Internet Incident Investigation</title><link>https://samfisheryu.github.io/publications/hotnets-2023-agents/</link><pubDate>Wed, 01 Nov 2023 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/publications/hotnets-2023-agents/</guid><description/></item><item><title>Experience</title><link>https://samfisheryu.github.io/experience/</link><pubDate>Tue, 24 Oct 2023 00:00:00 +0000</pubDate><guid>https://samfisheryu.github.io/experience/</guid><description/></item></channel></rss>