NanoGPT Training
This is a step-by-step guide on training a NanoGPT model with distributed training on Lepton.
Create Training Job
Navigate to the create job page, you can see the job configuration form.
- Job name: We can set it to
nanogpt-training
. - Resource: We need to use H100 GPUs, select H100 x8 and set worker count to 1.
- Image: We can use the default image.
- File system mount: We will mount the whole file system to
/mnt
path, so we can leverage the codebase from the shared file system. - Run command: Copy the following codes to the run command field.
# Download the environment setup script from Lepton's GitHub repository, make it executable, and source it to initialize the environment variables.
wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.sh
export NCCL_DEBUG=INFO
# Print the environment variables and list the files in the root directory to verify the setup.
env | grep RANK
pip install torch numpy transformers datasets tiktoken wandb tqdm
# Clone the NanoGPT repository
cd /workspace
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT
# prepare train data
python data/shakespeare_char/prepare.py
ngpus=$(nvidia-smi -L | wc -l)
if [ ngpus != 8 ]; then
unset NCCL_SOCKET_IFNAME
fi
accum_steps=$((ngpus*WORLD_SIZE*4))
sed -i "s/gradient_accumulation_steps = 1/gradient_accumulation_steps = ${accum_steps}/g" config/train_shakespeare_char.py
torchrun \
--master_addr ${MASTER_ADDR} \
--nnodes ${WORLD_SIZE} \
--node_rank ${NODE_RANK} \
--nproc_per_node ${ngpus} \
train.py config/train_shakespeare_char.py
After the configuration is done, click on Create to submit the job, and you can see the job status in the job detail page.

The training job will take about 5 minutes to finish, when the job is running, you can check the real-time logs and metrics.

