Projects | Nengneng Yu

Projects

Systems for ML

Enable LLM Internal Observability

Treating model-internal observability as a first-class systems primitive for high-performance LLM inference.

Sep 1, 2025 • 1 min read

Systems for ML

Collective Communication Under Failures for Distributed ML

Closing the gap between fail-stop CCLs and the realities of large-scale GPU training/serving — through resilient communication and bandwidth-optimal AllReduce under failures.

May 1, 2025 • 1 min read