Projects
Enable LLM Internal Observability featured image

Enable LLM Internal Observability

Treating model-internal observability as a first-class systems primitive for high-performance LLM inference.

Collective Communication Under Failures for Distributed ML featured image

Collective Communication Under Failures for Distributed ML

Closing the gap between fail-stop CCLs and the realities of large-scale GPU training/serving — through resilient communication and bandwidth-optimal AllReduce under failures.