Job Failure Diagnose
Lepton AI provides a built-in feature to automatically diagnose job failure. This feature is designed to help users quickly identify and resolve issues in batch jobs, ensuring optimal performance and reliability.
Let's walk through an example of how to use this feature to diagnose a job failure.
Prepare a Job Designed to Fail
First we need to create a job that will fail for testing. The following script will create a memort leak on a GPU by continuously allocating tensors using Pytorch and preventing them from being garbage collected, until it reaches a target memory usage (defaulting to 10GB but configurable via command line).
import torch
import time
import argparse
import sys
def create_gpu_memory_leak(target_gb=10):
if not torch.cuda.is_available():
print("No GPU available. Exiting.")
return
print(f"Starting GPU memory leak test (target: {target_gb}GB)...")
stored_tensors = [] # List to prevent garbage collection
try:
while True:
# Create a 1GB tensor (approximately)
# Using float32 (4 bytes) * 250M elements ≈ 1GB
tensor = torch.rand(250_000_000, device='cuda:0')
stored_tensors.append(tensor) # Prevent garbage collection
current_memory = torch.cuda.memory_allocated('cuda:0') / (1024**3) # Convert to GB
print(f"Current GPU memory usage: {current_memory:.2f} GB")
if current_memory > target_gb: # Use the target parameter
print(f"Reached {target_gb}GB memory usage target")
break
time.sleep(0.1) # Small delay to prevent system from becoming unresponsive
except KeyboardInterrupt:
print("\nTest interrupted by user")
finally:
print("Test completed")
if __name__ == "__main__":
# python bad_job.py -t 100 for 100GB target
parser = argparse.ArgumentParser(description='GPU Memory Leak Test')
parser.add_argument('-t', '--target', type=float, default=10,
help='Target memory usage in GB (default: 10)')
args = parser.parse_args()
create_gpu_memory_leak(args.target)
Save the script as bad_job.py
and upload it to the Lepton file system. using the following command:
lep storage upload bad_job.py
Create the Job
Navigate to the Batch Jobs page to create the job with the following configuration:
Resource
As we just want to test the job failure diagnosis feature, we can create a job with any desired GPU card and corresponding node group.
Container
In the container section, we will use the default image and paste the following command to run job:
python /mnt/bad_job.py -t 100 gigabytes # 100GB target memory usage
File System Mount
Open the Advanced configuration and click Add file system mount.
In the "Mount from" part, choose the default path as we have uploaded the file to the default storage area, then specify /mnt
in the "Mount as" section.
Create the Job
Click Create to create the job. Once the job is created, you can view the job in the Lepton Dashboard.
Diagnose Job Failure
The job will fail after reaching the target memory usage. And then you will see a failure message in the job details page.
And for this example, you will see an error tag to show the job is failed due to ERR_GPU_OUT_OF_MEMORY
.
Hover over the error tag, you will see the detailed error message.