NCCL Performance Benchmark Job
This example shows how to run a NCCL performance benchmark job on Lepton step by step.
Step 1: Create a New Job
First, you need to create a job in our platform. Head over to the Create Job page.
As you can see, there are many configurations you can fill in, like name, resource, image, etc. You can find a more detailed guide in documentation for creating a job.

In this example, we will use the following configurations:
- Name:
nccl-benchmark
or any name you want - Resource: For the performance benchmark, we need to choose
8xH100
to take over the whole node for NCCL performance benchmark, and set number of workers to 2 to use both nodes. So you will have to select a node group matching the resource shape requirements, it's recommended to use your own dedicated node group for this job. - Image: We will use
nvcr.io/nvidia/pytorch:24.11-py3
as the image for the job. This image is built with the latest NVIDIA container toolkit and PyTorch 24.11. Choose custom image and then fill in the image name. - Run command: We will load a code from remote github repo and run the NCCL performance benchmark. Fill in the command as follows:
set -euox pipefail
trap -- 's=$?; echo >&2 "$0: Error on line "$LINENO": $BASH_COMMAND"; exit $s' ERR
export DEBIAN_FRONTEND=noninteractive
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
apt-get -y update
apt-get install -y libibverbs-dev infiniband-diags openmpi-bin openmpi-doc libopenmpi-dev net-tools openssh-server openssh-client
# custom env setup
git clone https://github.com/NVIDIA/nccl-tests.git /tmp/nccl-tests
cd /tmp/nccl-tests
NV_COMPUTE=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits|head -n 1 | tr -d ".")
make -j MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ NVCC_GENCODE="-gencode=arch=compute_${NV_COMPUTE},code=sm_${NV_COMPUTE}"
# SSH setup (replace with your own credentials)
mkdir -p /root/.ssh
echo 'YOUR_SSH_PUBLIC_KEY' >> /root/.ssh/authorized_keys
cat <<EOT > /root/.ssh/id_ed25519
YOUR_SSH_PRIVATE_KEY
EOT
chmod 700 /root/.ssh
chmod 600 /root/.ssh/*
if grep -q "^PermitRootLogin" /etc/ssh/sshd_config; then
sed -i 's/^PermitRootLogin .*/PermitRootLogin yes/' /etc/ssh/sshd_config
else
echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
fi
if grep -q "^PubkeyAuthentication" /etc/ssh/sshd_config; then
sed -i 's/^PubkeyAuthentication .*/PubkeyAuthentication yes/' /etc/ssh/sshd_config
else
echo "PubkeyAuthentication yes" >> /etc/ssh/sshd_config
fi
sed -i 's/^#Port .*/Port 2222/' /etc/ssh/sshd_config
service ssh restart
COMPLETE_FILE="/tmp/lepton-mpi-complete"
if [ "$LEPTON_JOB_WORKER_INDEX" -eq 0 ]; then
HOSTFILE="/tmp/hostfile.txt"
rm -f "$HOSTFILE"
for i in $(seq 0 $((LEPTON_JOB_TOTAL_WORKERS - 1))); do
IP_ADDRESS=""
while [ -z "$IP_ADDRESS" ]; do
IP_ADDRESS=$({ getent hosts -- "${LEPTON_JOB_WORKER_HOSTNAME_PREFIX}-$i-lan.${LEPTON_SUBDOMAIN}" || echo ""; } | cut -d' ' -f1)
if [ -z "$IP_ADDRESS" ]; then
sleep 5
fi
done
WAIT_RETRY=60
while ! ssh -o StrictHostKeyChecking=no -p 2222 "$IP_ADDRESS" -- echo ok 2>&1; do
echo "waiting for server ping ..."
WAIT_RETRY=$((WAIT_RETRY-1))
if [ $WAIT_RETRY -eq 0 ]; then
echo "timed out waiting host $IP_ADDRESS to be ready"
exit 1
fi
sleep 5
echo "retry ssh to $IP_ADDRESS"
done
echo "$IP_ADDRESS" >> "$HOSTFILE"
done
mpirun -np "$LEPTON_JOB_TOTAL_WORKERS" \
-x LOGLEVEL=INFO \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_HCA="mlx" \
-pernode \
--allow-run-as-root \
--hostfile "$HOSTFILE" \
-mca plm_rsh_args "-p 2222 -o StrictHostKeyChecking=no" \
/tmp/nccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g "$LEPTON_RESOURCE_ACCELERATOR_NUM" -c 0
{
read -r # ignore head node itself
while read -r PEER; do
ssh -n -o StrictHostKeyChecking=no -p 2222 "$PEER" -- touch "$COMPLETE_FILE"
done
} <"$HOSTFILE"
else
while true; do
[ ! -f "$COMPLETE_FILE" ] || break
sleep 5
done
fi
echo "MPI job completed!"
You need to fill in the YOUR_SSH_PUBLIC_KEY
and YOUR_SSH_PRIVATE_KEY
with your own credentials.
Step 2: Run the Job
Click on the Create button, then you can see the job status in the detail page of the job. The job will proceed once two replicas are in the "Ready" state, which will take a few minutes.

You can see the logs of the job by clicking on the Logs button to check the test result.

After the job is finished, the two replicas will be terminated automatically with Completed state.