Dedicated endpoints

A dedicated endpoint corresponds to a running instance of an AI model, exposing itself as a HTTP server. Any service can be run as a dedicated endpoint, the most common use case is to deploy an AI model, exposed with an OpenAPI.

Create Dedicated Endpoints

A dedicated endpoint can be created in five different ways:

  1. Create Dedicated LLM Endpoint
  2. Create from Lepton Prebuilt
  3. Create from Custom Models
  4. Create from Container Image
  5. Create from NVIDIA NIM
Create dedicated endpoint

Create Dedicated LLM Endpoint

Besides the prebuilt templates, you can also create a dedicated endpoint from your custom models by selecting the Create Dedicated LLM Endpoint option. The models can be either your uploaded fine-tuned models, or the ones from the HuggingFace Hub and some other model providers.

HuggingFace

You can create a dedicated endpoint from the models hosted on the HuggingFace Hub.

huggingface

File system

You can upload your own models to your workspace and create a dedicated endpoint from them.

For a more detailed guide for file system management, checkout the File System guide.

Create from Lepton Prebuilt

Lepton provides some prebuilt templates for common AI workloads, such as LLM, Stable Diffusion, etc. You can create a dedicated endpoint from these templates with a few clicks in your workspace.

prebuilt list

There are some dedicated endpoints templates that can help you get started quickly. These templates allow for simple configurations, including setting environment variables and selecting expected resource types, enabling you to launch optimized deployments effortlessly.

Lepton LLM Engine

The LLM Engine by Lepton provides fast and easy deployment of popular open source LLM models, at users' full control. We have built it with compatibility for major LLM architectures, with common optimization technique like dynamic batching, quantization, speculatively execution, and more. We also provide an OpenAI-compatible API to deploy your own fine-tuned LLM models, so you can use it as a drop-in replacement for the OpenAI API.

Stable Diffusion Farm

Stable Diffusion Farm is setup for running stable diffusion workload with your choice of models. This template will allow you to:

  • Setup image generation workload api such as text-to-image generation, image-to-image generation, etc.
  • Use your own choice of models including lora, ckpt, etc. This documentation will cover configurations including adding your choice of models, extensions, and access it via API.

Stable Diffusion WebUI

Stable Diffusion Web UI is a web browser interface tool to easily manage model checkpoints, loras, and configurations for image generation tasks with the Stable Diffusion model, an AI model can generate images from text prompts or modify existing images with text prompts. This documentation will walk you through setting up your environment, adding your choice of models or Loras, and generating an image.

Create from Custom Models

You can create a dedicated endpoint from the photons you have built. For example, we assume that you have a gpt2 photon pushed to your workspace, which is the HuggingFace implementation of the GPT-2 model, suitable to be run on a small CPU server. You can create a dedicated endpoint from the photon with the following command:

lep deployment create --name mygpt2 --photon gpt2

This will create a deployment, which is also a dedicated endpoint, called mygpt2 from the gpt2 photon. The deployment will be created with the default configuration, which is a single replica, no autoscaling, no environment variables or secrets, and on a default cpu.small instance that has one core and 4GB of memory.

Create from container image

You can also create a dedicated endpoint from arbitrary container images that expose a web service. For example, you can create from a minimal Python 3.10 container image that runs a simple HTTP server, with command python -m http.server [port], or more sophisticated services:

lep dep create -n pythonserver \
    --container-image python:3.10 \
    --container-command "python3 -m http.server 8080" \
    --public

This will start a http server on port 8080. In this illustrative example, the service is a simple HTTP server that accepts GET requests and serves files from the current directory. The --public flag makes the deployment publicly accessible so you can easily check it. For more details on the access control of the deployment, see the Access tokens section.

Create from NVIDIA NIM

For enhanced performance and seamless compatibility, NVIDIA-optimized models using NIM container registry are also available on Lepton AI.

In your workspace, select Create from NVIDIA NIM option, you can see a list of available NVIDIA-optimized models.

NVIDIA NIM list

Configuration Options

Lepton provides a number of configuration options to customize your dedicated endpoint.

Environment variables and secrets

Environment variables are key-value pairs that will be passed to the deployment. All the variables will be automatically inject in the deployment container, so the runtime can refer to them as needed.

Secret values are similar to environment variables, but their values are pre-stored in the platform so it is not exposed in the development environment. You can learn more about secrets here.

You can also store multiple secret values, and specify which secret value to use with the --secret flag like the following: Inside the deployment, the secret value will be available as an environment variable with the same name as the secret name.

env and secrets

The following environment variables are predefined and will be available in the deployment:

  • LEPTON_DEPLOYMENT_NAME: The name of the deployment
  • LEPTON_PHOTON_NAME: The name of the photon used to create the deployment
  • LEPTON_PHOTON_ID: The ID of the photon used to create the deployment
  • LEPTON_WORKSPACE_ID: The ID of the workspace where the deployment is created
  • LEPTON_WORKSPACE_TOKEN: The workspace token of the deployment, if --include-workspace-token is passed
  • LEPTON_RESOURCE_ACCELERATOR_TYPE: The resource accelerator type of the deployment

Access tokens

By default, your dedicated endpoints will be protected with your workspace token, meaning that only requests with the workspace token in header will be allowed to access the endpoint. If you want to allow public access to the endpoint, you can also toggle the Enable public access option, this will create a endpoint whose HTTP service is publicly accessible.

enable public access

Alternatively, you can specify a token that can be used in addition to the workspace token to access the endpoint. This will create a endpoint that can be accessed with the workspace tokens that you specified.

use tokens

Resource shapes

Resource shapes are the instance types that the endpoint will be run on. You can select the resource shape in the endpoint creation page.

resource shape

As you can see, you can select from a variety of CPU and GPU shapes, and the number of GPUs you want to use. And you will see the pricing of the resource shape you've selected, calculated by the number of minutes you are billed for.

Autoscaling

By default, Lepton will create your endpoints with a single replica and automatic scale down to zero after 1 hour of inactivity. You can override this behavior with the three other autoscaling options and related flags.

autoscaling
  1. Scale replicas to zero based on no-traffic timeout: You can specify the initial number of replicas and the no-traffic timeout(seconds).
  2. Autoscale replicas based on traffic QPM: You can specify the minimum and maximum number of replicas, and the target QPM. You can also specify the query methods and query paths to include in the traffic metrics.
  3. Autoscale replicas based on GPU utilization: You can specify the minimum and maximum number of replicas, and the target GPU utilization.

File system mount

Lepton provides a serverless file system that is mounted to the deployment similar to a local POSIX file system, behaving much similar to an NFS volume. The file system is useful to store data files and models that are not included in the deployment image, or to persist files across deployments. To read more about the file system specifics, check out the File System documentation.

To mount a file system to a deployment, you can click on the Add File System Mount button in the File System Mount section.

file system

For example, this will mount the root of the lepton file system (/) to the deployment replicas, and are accessible at /leptonfs in the deployment container. You can operate on the file system as if it is a local file system, and the files are persisted across deployments.

Make sure that you are not mounting the file system as system folders in the deployment, such as /etc, /usr, or /var. Also make sure that the mounted path does not already exist in the container image. Both cases may cause conflicts with the guest operating system. Lepton will make a best effort to prevent you from mounting the file system to these folders, and we recommend you double check the mounted path.

Advanced Topics

advanced

Deployment visibility

By default, the endpoint is visible to all the team members in your workspace. If you want to restrict the visibility of the deployment, you can simply switch the Visibility option in the Advanced section.

This will make the endpoint only visible to the user who created it and the admin. Other users in the workspace will not be able to see it.

You can also update the visibility of the deployment later in the deployment details page.

Shared memory

You can specify the shared memory size for the deployment in the Advanced section.

Healthcheck initial delay seconds

By default, there are two types of probes configured:

  • Readiness Probe: Starts with an initial delay of 5 seconds and checks every 5 seconds. It requires 1 successful check to mark the container as ready, but will mark the container as not ready after 10 consecutive failures. This probe ensures the service is ready to accept traffic.
  • Liveness Probe: Has a longer initial delay of 600 seconds (10 minutes) and checks every 5 seconds. It requires 1 successful check to mark the container as healthy, but will only mark the container as unhealthy after 12 consecutive failures. This probe ensures the service remains healthy during operation.

As some endpoints might need longer time to start up the container and initialize the model, you can also specify a custom delay seconds to meet the requirements, simply select the Custom option and input the delay seconds.

Require approval to make replicas ready

You can specify whether to require approval to make replicas ready in the Advanced section. By default, the replicas will be ready immediately.

Pulling metrics from replica

You can specify whether to pull metrics from the replicas in the Advanced section. By default, the metrics will be pulled from the replicas.

Header-based replica routing

You can specify the header-based replica routing in the Advanced section. By default, the requests will be load balanced across all the replicas.

Private image registry auth (optional)

If your container image is hosted in a private image registry, you can specify the registry auth in the Advanced section.

Log Collection

You can specify whether to collect logs from the replicas in the Advanced section. By default, the option is synced with the workspace setting.

Use Dedicated Endpoints

This document will go through the basics of using your dedicated endpoints in Lepton. If you do not have a dedicated endpoint yet, you can learn how to create one in the create dedicated endpoints guide.

Calling your endpoint

After your dedicated endpoint is created, you can start making API requests to it from anywhere.

Through client SDK

You can use our client SDK to make API requests to your endpoint easily.

  1. Install client SDK

First, you need to install the Python SDK in your own project:

pip install leptonai
  1. Get API token

You can find your workspace API token in the workspace settings page. Set this token as the LEPTON_API_TOKEN environment variable.

export LEPTON_API_TOKEN=<your-api-token>
  1. Make API Request
import os
from leptonai.client import Client

api_token = os.environ.get('LEPTON_API_TOKEN')
client = Client("your-workspace-id", "your-endpoint-name", token=api_token)

result = client.run(
  inputs="I enjoy walking with my cute dog",
  max_new_tokens=50,
  do_sample=True,
  top_k=50,
  top_p=0.95
)

print(result)

Through endpoint URL

You can also call the endpoint URL directly with HTTP requests.

  1. Get your endpoint URL and API token

You can get the endpoint URL from the endpoint details page in the dashboard.

  1. Make API Request

Now you can call your endpoint directly with the URL and API token:

curl -X 'POST' \
  'your-endpoint-url' \
  -H 'Content-Type: application/json' \
  -H 'accept: application/json' \
  -H "Authorization: Bearer your-api-token" \
  -d '{
  "inputs": "I enjoy walking with my cute dog",
  "max_new_tokens": 50,
  "do_sample": true,
  "top_k": 50,
  "top_p": 0.95
}'

Logs

Logs are useful for debugging your endpoint, you can view the logs either in the dashboard or through the CLI. For example, you can get the logs through the CLI like this and logs will be streamed to your terminal:

lep deployment log --name mygpt2

Metrics

You can easily view metrics chart of your endpoint in the dashboard. Now we support QPS, latency, GPU memory usage and GPU temperature metrics.

Billing

Dedicated endpoints are billed by the resource usage of the endpoint. (calculated by minutes)

// Basic billing formula:
Cost = Machine unit price * Machine hours * Replica count

You can find the details of each machine type in Lepton AI pricing page.

Lepton AI

© 2024