NanoGPT Training

This is a step-by-step guide on training a NanoGPT model with distributed training on Lepton.

Create Training Job

Navigate to the create job page, you can see the job configuration form.

Job name: We can set it to nanogpt-training.
Resource: We need to use H100 GPUs, select H100 x8 and set worker count to 1.
Image: We can use the default image.
File system mount: We will mount the whole file system to /mnt path, so we can leverage the codebase from the shared file system.
Run command: Copy the following codes to the run command field.

# Download the environment setup script from Lepton's GitHub repository, make it executable, and source it to initialize the environment variables.
wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.sh

export NCCL_DEBUG=INFO

# Print the environment variables and list the files in the root directory to verify the setup.
env | grep RANK

pip install torch numpy transformers datasets tiktoken wandb tqdm

# Clone the NanoGPT repository
cd /workspace
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT

# prepare train data
python data/shakespeare_char/prepare.py

ngpus=$(nvidia-smi -L | wc -l)
if [ ngpus != 8 ]; then
    unset NCCL_SOCKET_IFNAME
fi

accum_steps=$((ngpus*WORLD_SIZE*4))
sed -i "s/gradient_accumulation_steps = 1/gradient_accumulation_steps = ${accum_steps}/g" config/train_shakespeare_char.py

torchrun \
    --master_addr ${MASTER_ADDR} \
    --nnodes ${WORLD_SIZE} \
    --node_rank ${NODE_RANK} \
    --nproc_per_node ${ngpus} \
 train.py config/train_shakespeare_char.py

After the configuration is done, click on Create to submit the job, and you can see the job status in the job detail page.

The training job will take about 5 minutes to finish, when the job is running, you can check the real-time logs and metrics.