Creating a job

A job corresponds to a one-off task that runs to completion and then stops.

This page will go through the basics of creating a job in Lepton with the various configurable options available to you when creating a job: environment variables, secrets, file system mounts and more.

Create Job in Dashboard

Navigate to the create job page, you can see the create job page as following image.

Configure Options

Resource

Resource shape: The instance type that the job will be running on, select from a variety of CPU and GPU shapes.
Node group: The node group that the job will be launched on, default to the shared node group.
Number of workers: The number of workers that will be used for the job, default to 1.

Enterprise users can have access to more resource shapes and node groups, contact Lepton support for more information.

Container

Image: The container image that will be used to create the job. You can choose from default image lists or use your own custom image.
Private image registry auth (optional): If you are using a private image, you need to specify the image registry auth.
Run Command: Command to run when the container starts.
Container Ports: The ports that the container will listen on.
Log Collection: Whether to collect the logs from the container, following the workspace level setting by default.

Advanced

Environment Variables: Environment variables are key-value pairs that are passed to the job. They will be automatically set as environment variables in the job container, so the runtime can refer to them as needed.
Your defined environment variables should not start with the name prefix LEPTON_, as this prefix is reserved for predefined env variables. The following environment variables are predefined and will be available in the job:
- LEPTON_JOB_NAME: The name of the job
- LEPTON_RESOURCE_ACCELERATOR_TYPE: The resource accelerator type of the job
File System Mounts: When you launch a job, you can mount a file system to the job. Lepton provides a serverless file system that is mounted to the job similar to a local POSIX file system, behaving much similar to an NFS volume. The filesystem is useful to store data files and models that are not included in the job image, or to persist files across jobs. To read more about the file system specifics, check out the File System documentation.
As a general rule, similar to other distributed / network file systems, you should avoid concurrent writes and other operations that may lead to race conditions. Consider using UUIDs or other mechanism to avoid conflicts.
Shared Memory: The shared memory size is the size of the shared memory that will be allocated to the container.
Max replica failure retry: Maximum number of times to retry a failed replica, zero by default.
Max job failure retry: Maximum number of failure restarts of the entire job.
Visibility: You can use this to specify the visibility of the job. If the visibility is set to private, only the creator can access the job. If the visibility is set to public, all the users in the workspace can access the job.

Examples

For job creation, job failure diagnosis and so on, you can refer to the following examples: