Nengneng Yu

Enabling Performant and Flexible Model-Internal Observability for LLM Inference

Thu, 07 May 2026 00:00:00 +0000

A 15-Layer Multi-Omics Dissection of Gastric Cancer Ecotypes Reveals Therapeutic Opportunities

Thu, 01 Jan 2026 00:00:00 +0000

Tidal: Tackling Concept Drift in Provenance-Based Advanced Persistent Threats Detection

Thu, 01 Jan 2026 00:00:00 +0000

Reliable and Resilient Collective Communication Library for LLM Training and Serving

Mon, 01 Dec 2025 00:00:00 +0000

Enable LLM Internal Observability

Mon, 01 Sep 2025 00:00:00 +0000

Motivation

Today’s inference-time workloads increasingly depend on timely access to a model’s internal states — for interpretability, test-time alignment, speculative decoding, hallucination detection, and real-time safety monitoring. But existing approaches retrofit observation either through PyTorch hooks (limited and slow) or engine-specific APIs (not portable), treating observability as an afterthought rather than a systems concern. This project pursues internal observability as a first-class systems primitive, asking: what does it take for a serving stack to expose arbitrary internal states without paying for it on the inference hot path?

DMI — Deep Model Inspector

DMI is the first system in this line of work. It decouples observability from the inference hot path through three pieces:

HookPoint — a CUDA-graph-compatible collection primitive that can be placed at arbitrary locations in PyTorch models, exposing diverse internal states without engine-specific changes.
Ring² — a GPU–CPU staging abstraction that keeps tensor capture inside CUDA graphs and a dedicated GPU-side ring buffer.
A data exporter that drains the staged tensors asynchronously on the host, governed by complete or best-effort policies that adapt data rate and fidelity to interconnect and memory budgets.

A standalone system in ~11K lines of CUDA / C++ / Python. With out of the box support for Hugging Face and vLLM. In offline batch inference, DMI introduces only 0.4 % – 6.8 % overhead and ~6 % in moderate online serving — 2× – 15× lower than baselines with comparable observability features.

📄 (arXiv preprint, with NSDI'26 ).

💻 Code is open-sourced at — contributions, issues, and benchmark reports very welcome.

TabSyM: A Generative Pipeline for Small Multi-Cohort Omics Tabular Data

Mon, 14 Jul 2025 00:00:00 +0000

Collective Communication Under Failures for Distributed ML

Thu, 01 May 2025 00:00:00 +0000

Motivation

Modern ML training and inference span tens to tens of thousands of GPUs, where network failures (NIC, cable, port faults) can waste 10–15 % of GPU hours due to slow recovery. Today’s collective communication libraries (NCCL, etc.) are fail-stop: any network error crashes the communicator and forces a job restart. This project closes the gap with two complementary threads — failure-resilient collectives, and a tighter performance lower bound once a server has degraded.

R2CC — Failure-resilient collective communication

A drop-in extension for NCCL that turns the CCL layer from fail-stop to fail-operational under inter-node NIC/link failures. Detects faults mid-collective, migrates in-flight transfers losslessly to healthy connections, and adapts collective schedules to the degraded topology — with no framework changes required.

Evaluated on a 4×8 H100 InfiniBand cluster and large-scale ML simulators modeling up to 1024 GPUs: R2CC incurs < 1.1 % training overhead and < 3 % inference overhead under active failure scenarios, and reduces failure-induced overhead by up to 92 % for training and 98 % for inference versus existing fault-tolerant approaches.

📄 (arXiv preprint).

💻 Code is open-sourced at — a drop-in NCCL extension; contributions and bug reports welcome.

Bandwidth-optimal AllReduce under bandwidth-asymmetric topologies

Once a NIC fails and a server becomes a straggler, what is the unavoidable AllReduce overhead, and how do we approach it? This thread gives the first information-theoretic lower bound on AllReduce completion time under asymmetric network bandwidth, and designs a four-stage pipelined AllReduce that closes the gap to within 2 – 6 % of NCCL’s fault-free ring under up to 50 % bandwidth loss — whereas state-of-the-art still incurs up to 57 % overhead.

Manuscript under review — details available on request.

Towards Interactive Research Agents for Internet Incident Investigation

Wed, 01 Nov 2023 00:00:00 +0000

Experience

Tue, 24 Oct 2023 00:00:00 +0000