Creating a job
A job corresponds to an one-off task that runs to completion and then stops. This page will go through the basics of creating a job in Lepton. We will then cover the various configurable options available to you when creating a job: environment variables, secrets, file system mounts and more.
The Basics
You can create a job either from CLI or from the Dashboard, this page will go through the steps to create a job from CLI.
To create a job from a container image, we can use lep job create
:
lep job create --name mypy \
--container-image default/lepton:photon-py3.11-runner-0.21.0 \
--command "echo 1; sleep 5" \
--resource-shape "cpu.small"
This will create a job called mypy
from the default/lepton:photon-py3.11-runner-0.21.0
container image and runs a simple bash command on a cpu.small
instance that has one core and 4GB of memory. The job will be created with the default configuration, which is a single worker, and no environment variables or secrets.
You can specify the job using a json file and supply the configuration to the job creation:
lep job create -n mypy \
-f ./lep_job_spec_mypy.json
with the json specification like this:
{
"resource_shape": "cpu.small",
"container": {
"image": "default/lepton:photon-py3.11-runner-0.21.0",
"command": [
"/bin/bash",
"-c",
"echo 1; sleep 5"
]
}
}
After the job is created, you can view its status with
lep job get -n mypy
Configurable Options
Number of workers
Number of workers to use for the job.
To create a job with 4 workers:
lep job create --name mypy \
--container-image default/lepton:photon-py3.11-runner-0.21.0 \
--command "echo 1; sleep 5" \
--resource-shape "cpu.small" \
--num-workers 4
Resource shapes
Resource shapes are the instance types that the job will be running on. The resource shape is specified with the --resource-shape
flag:
lep job create --name mypy \
--container-image default/lepton:photon-py3.11-runner-0.21.0 \
--command "echo 1; sleep 5" \
--resource-shape cpu.small
The common resource shapes are:
cpu.small, cpu.medium. cpu.large, gpu.a10, gpu.h100-sxm, gpu.2xh100-sxm, gpu.4xh100-sxm, gpu.8xh100-sxm
This is not a complete list. For enterprise users, you may have access to more resource shapes. You can contact Lepton support for more information.
Environment variables and secrets
Environment variables are key-value pairs that are passed to the job. They will be automatically set as environment variables in the job container, so the runtime can refer to them as needed.
To pass environment variables to a job, you can use the --env
flag with the lep job create
command. For example, to pass the environment variable "MYKEY1" with value "MYVALUE1", and "MYKEY2" with value "MYVALUE2", you can use:
lep job create --name mypy \
--container-image default/lepton:photon-py3.11-runner-0.21.0 \
--env MYKEY1=MYVALUE1 \
--env MYKEY2=MYVALUE2 \
--command "echo 1; sleep 5" \
--resource-shape cpu.small
You can repeatedly use the --env
flag to pass multiple environment variables.
Secret values are similar to environment variables, but their values are pre-stored in the platform so it is not exposed in the job environment. For example, you might want to keep your huggingface hub token as a secret on the lepton platform, and you can do this via:
lep secret create --name HF_TOKEN --value <your-huggingface-hub-token>
After this, you can pass in the secret value to the job using the --secret
flag:
lep job create --name mypy \
--container-image default/lepton:photon-py3.11-runner-0.21.0 \
--secret HF_TOKEN \
--command "echo 1; sleep 5" \
--resource-shape cpu.small
You can also store multiple secret values, and specify which secret value to use with the --secret
flag like the following:
lep secret create --name ALICE_HF_TOKEN --value <alice-s-huggingface-hub-token>
lep secret create --name BOB_HF_TOKEN --value <bob-s-huggingface-hub-token>
lep job create --name mypy \
--container-image default/lepton:photon-py3.11-runner-0.21.0 \
--command "echo 1; sleep 5" \
--resource-shape cpu.small \
--secret HF_TOKEN=ALICE_HF_TOKEN # use Alice's token
Inside the job, the secret value will be available as an environment variable with the same name as the secret name. For example, in both cases above, the secret value will be available as HF_TOKEN
and the value being the corresponding hf token value you stored.
Predefined environment variables: Your defined environment variables should not start with the name prefix LEPTON_
, as this prefix is reserved for predefined env variables. The following environment variables are predefined and will be available in the job:
LEPTON_JOB_NAME
: The name of the jobLEPTON_RESOURCE_ACCELERATOR_TYPE
: The resource accelerator type of the job
File system mount
When you launch a job, you can mount a file system to the job. Lepton provides a serverless file system that is mounted to the job similar to a local POSIX file system, behaving much similar to an NFS volume. The filesystem is useful to store data files and models that are not included in the job image, or to persist files across jobs. To read more about the file system specifics, check out the File System documentation.
To mount a file system to a job, you can use the --mount
flag with the lep job create
command:
lep job create --name mypy \
--container-image default/lepton:photon-py3.11-runner-0.21.0 \
--mount /:/leptonfs \
--command "echo 1; sleep 5" \
--resource-shape cpu.small
This will mount the root of the lepton file system (/
) to the job, and are accessible at /leptonfs
in the job container. You can operate on the file system as if it is a local file system, and the files are persisted across jobs.
Make sure that you are not mounting the file system as system folders in the job, such as /etc
, /usr
, or /var
. Also make sure that the mounted path does not already exist in the container image. Both cases may cause conflicts with the guest operating system. Lepton will make a best effort to prevent you from mounting the file system to these folders, and we recommend you double check the mounted path.
As a general rule, similar to other distributed / network file systems, you should avoid concurrent writes and other operations that may lead to race conditions. Consider using UUIDs or other mechanism to avoid conflicts.
Other supported configurations
The detail descriptions for these options can be viewed by using lep job create -h
or referring to Lepton CLI documentation.
--container-port
--max-failure-retry
--max-job-failure-retry
--image-pull-secrets
--intra-job-communication
--ttl-seconds-after-finished
--log-collection
Advanced Topics
Node groups
For enterprise users who have reserved resources on Lepton, you can specify the node group where the job will be launched. This can be done using the --node-group
flag:
lep job create --name mypy \
--container-image default/lepton:photon-py3.11-runner-0.21.0 \
--node-group mynodegroup \
--command "echo 1; sleep 5" \
--resource-shape cpu.small
where mynodegroup
is a node group your resources are reserved on. The job will be launched on the resources of the node group.
Examples
Example 1: Distributed training with Pytorch
Here is an example of running a distributed pytorch job with 2 workers with a python file train.py
and a shell script train.sh
:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.distributed as dist
from torchvision import datasets, transforms
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
class MNISTModel(nn.Module):
def __init__(self):
super(MNISTModel, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
def train(rank, world_size):
print(f"Running on rank {rank}.")
dist.init_process_group("nccl", rank=rank, world_size=world_size)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
train_loader = DataLoader(dataset, batch_size=64, sampler=sampler)
model = MNISTModel().to(rank)
model = DDP(model, device_ids=[rank])
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()
for epoch in range(1, 11):
sampler.set_epoch(epoch)
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(rank), target.to(rank)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % 10 == 0:
print(f"Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}")
if rank == 0:
torch.save(model.module.state_dict(), "mnist_model.pth")
print("Model saved as mnist_model.pth")
dist.destroy_process_group()
def main():
world_size = torch.cuda.device_count()
torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size, join=True)
if __name__ == "__main__":
main()
Save the above script as train.py
and train.sh
. Then upload it to the Lepton file system via command:
# upload files to the file system under root directory
lep storage upload train.py /train.py
lep storage upload train.sh /train.sh
Then create a job with the following job specs:
{
"resource_shape": "gpu.a10.6xlarge",
"container": {
"command": [
"/bin/bash",
"-c",
"chmod +x /mnt/train.sh; /mnt/train.sh"
]
},
"num_workers": 2,
"envs": [],
"mounts": [
{
"path": "/",
"mount_path": "/mnt"
}
],
"ttl_seconds_after_finished": 259200,
"intra_job_communication": true
}
Then create the job via the Lepton CLI with the job specs file:
lep job create -f job_spec.json -n pytorch-job -w 2
Once the job is created, you can view the job in the Lepton Dashboard. You can view the job details and logs in the job details page. You can also access the worker of the job by clicking on the "Terminal" button.
Example 2: Running jobs with conda environment
Conda is known for its ability to create isolated environments for different projects. Here is an example of running a job with conda environment management:
Let's say we have a Pod running with a conda installed image and file system mounted at /mnt
. We can create a conda environment with pytorch and pack it to the file system.
# Create a conda environment with pytorch
conda create -n foo python=3.10.12
conda activate foo
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# Pack the conda environment to file system
pip install conda-pack
conda pack -n foo -o /mnt/foo.tar.gz
The foo.tar.gz
file contains the conda environment. You can load the conda environment in the job by adding the following command in the Run Command
field during job creation. Making sure conda is installed in the container image.
# Load the conda environment
mkdir -p foo
cp /mnt/foo.tar.gz ./
tar -xzf foo.tar.gz -C foo
# Activate the conda environment
source foo/bin/activate
# Verify that the environment was installed
conda list
Once the job is created, you can view the job in the Web UI.