Enabling Performant and Flexible Model-Internal Observability for LLM Inference

Nengneng Yu

Sixian Xiong

Yibo Zhao

Wei Wang

Zaoxing Liu

Preprint PDF Poster

Abstract

Today’s inference-time workloads increasingly depend on timely access to a model’s internal states. We present DMI, a high-speed deep model inspector that treats internal observability as a first-class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring², a GPU–CPU memory abstraction for capturing and staging tensors, and a policy-controlled host backend that exports them. DMI enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets. Our experiments demonstrate that DMI incurs only 0.4%–6.8% overhead in offline batch inference and an average of 6% in moderate online serving, reducing latency overhead by 2×–15× compared to existing baselines with similar observability features.

Type

Preprint

Publication

arXiv preprint arXiv:2605.11093

Last updated on May 11, 2026

Systems for ML LLM

← Don't Let a Few Network Failures Slow the Entire AllReduce

A 15-Layer Multi-Omics Dissection of Gastric Cancer Ecotypes Reveals Therapeutic Opportunities →