Enabling Performant and Flexible Model-Internal Observability for LLM Inference

Abstract
Today’s inference-time workloads increasingly depend on timely access to a
model’s internal states. We present DMI, a high-speed deep model inspector
that treats internal observability as a first-class systems primitive,
decoupling it from the inference hot path via an asynchronous observability
substrate built from Ring², a GPU–CPU memory abstraction for capturing and
staging tensors, and a policy-controlled host backend that exports them. DMI
enables the placement of observation points across a rich space of internal
signals and diverse inference backends while preserving serving optimizations
and adhering to tight GPU memory budgets. Our experiments demonstrate that
DMI incurs only 0.4%–6.8% overhead in offline batch inference and an average
of 6% in moderate online serving, reducing latency overhead by 2×–15×
compared to existing baselines with similar observability features.
Type
Publication
arXiv preprint