Observability
We provide node group level observability system, including nodes, metrics, statistics, and timeline.
Nodes
For each node within a node group, we provide a detailed observability framework, encompassing node health, metrics, and events. To view a node's details, simply click its name in the node list.
Node Health
Navigate to the Health tab to view the node’s health status, including GPU, general hardware, and system conditions.

Node Metrics
Under the Metrics tab, you can monitor key performance indicators such as CPU usage, memory consumption, GPU utilization, GPU temperature, and network activity.

Node Events
Inspect all events associated with a node, including event name, type, component, timestamp, and detailed messages. Select a node from the list and go to the Events tab to explore its event history.

Metrics
The metrics section provides a comprehensive overview of node group performance data, visualized for easy analysis.

Statistics
In the statistics tab, you can see some node group level statistics data, including GPU usages, Job completion status.

Timeline
The timeline offers a dynamic chart illustrating node status changes over time, enabling you to track operational trends at a glance.
