Observability

We provide node group level observability system, including nodes, metrics, statistics, and timeline.

Nodes

For each node within a node group, we provide a detailed observability framework, encompassing node health, metrics, and events. To view a node's details, simply click its name in the node list.

Node Health

Navigate to the Health tab to view the node’s health status, including GPU, general hardware, and system conditions.

node health

Node Metrics

Under the Metrics tab, you can monitor key performance indicators such as CPU usage, memory consumption, GPU utilization, GPU temperature, and network activity.

node metrics

Node Events

Inspect all events associated with a node, including event name, type, component, timestamp, and detailed messages. Select a node from the list and go to the Events tab to explore its event history.

node events

Metrics

The metrics section provides a comprehensive overview of node group performance data, visualized for easy analysis.

metrics 0.6x

Statistics

In the statistics tab, you can see some node group level statistics data, including GPU usages, Job completion status.

statistics 0.6x

Timeline

The timeline offers a dynamic chart illustrating node status changes over time, enabling you to track operational trends at a glance.

timeline
Lepton AI

© 2025