NanoGPT Training

This is a step-by-step guide on training a NanoGPT model with distributed training on Lepton.

Create Training Job

Navigate to the create job page, you can see the job configuration form.

  • Job name: We can set it to nanogpt-training.
  • Resource: We need to use H100 GPUs, select H100 x8 and set worker count to 1.
    resource
  • Image: We can use the default image.
  • File system mount: We will mount the whole file system to /mnt path, so we can leverage the codebase from the shared file system.
    file-system-mount
  • Run command: Copy the following codes to the run command field.
# Download the environment setup script from Lepton's GitHub repository, make it executable, and source it to initialize the environment variables.
wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.sh

export NCCL_DEBUG=INFO

# Print the environment variables and list the files in the root directory to verify the setup.
env | grep RANK

pip install torch numpy transformers datasets tiktoken wandb tqdm

# Clone the NanoGPT repository
cd /workspace
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT

# prepare train data
python data/shakespeare_char/prepare.py

ngpus=$(nvidia-smi -L | wc -l)
if [ ngpus != 8 ]; then
    unset NCCL_SOCKET_IFNAME
fi

accum_steps=$((ngpus*WORLD_SIZE*4))
sed -i "s/gradient_accumulation_steps = 1/gradient_accumulation_steps = ${accum_steps}/g" config/train_shakespeare_char.py

torchrun \
    --master_addr ${MASTER_ADDR} \
    --nnodes ${WORLD_SIZE} \
    --node_rank ${NODE_RANK} \
    --nproc_per_node ${ngpus} \
 train.py config/train_shakespeare_char.py

After the configuration is done, click on Create to submit the job, and you can see the job status in the job detail page.

job created

The training job will take about 5 minutes to finish, when the job is running, you can check the real-time logs and metrics.

logs
timeline
Lepton AI

© 2025