Creating a job

A job corresponds to a one-off task that runs to completion and then stops.

This page will go through the basics of creating a job in Lepton with the various configurable options available to you when creating a job: environment variables, secrets, file system mounts and more.

Create Job in Dashboard

Navigate to the create job page, you can see the create job page as following image.

create job page

Configure Options

Resource

  • Resource shape: The instance type that the job will be running on, select from a variety of CPU and GPU shapes.
  • Node group: The node group that the job will be launched on, default to the shared node group.
  • Number of workers: The number of workers that will be used for the job, default to 1.

Container

  • Image: The container image that will be used to create the job. You can choose from default image lists or use your own custom image.
  • Private image registry auth (optional): If you are using a private image, you need to specify the image registry auth.
  • Run Command: Command to run when the container starts.
  • Container Ports: The ports that the container will listen on.
  • Log Collection: Whether to collect the logs from the container, following the workspace level setting by default.

Advanced

  • Environment Variables: Environment variables are key-value pairs that are passed to the job. They will be automatically set as environment variables in the job container, so the runtime can refer to them as needed.
  • File System Mounts: When you launch a job, you can mount a file system to the job. Lepton provides a serverless file system that is mounted to the job similar to a local POSIX file system, behaving much similar to an NFS volume. The filesystem is useful to store data files and models that are not included in the job image, or to persist files across jobs. To read more about the file system specifics, check out the File System documentation.
  • Shared Memory: The shared memory size is the size of the shared memory that will be allocated to the container.
  • Max replica failure retry: Maximum number of times to retry a failed replica, zero by default.
  • Max job failure retry: Maximum number of failure restarts of the entire job.
  • Visibility: You can use this to specify the visibility of the job. If the visibility is set to private, only the creator can access the job. If the visibility is set to public, all the users in the workspace can access the job.

Examples

For job creation, job failure diagnosis and so on, you can refer to the following examples:

Distributed training with Pytorch
Job Failure Diagnose
Running jobs with conda environment
Lepton AI

© 2025