Serverless Endpoints

Built on top of the Lepton platform, we provide a variety of serverless endpoints for popular open source models. You can experiment with the models on our Built With Lepton directly, or use the APIs to integrate such models into your own application.

Sample Usage of Llama2-7b

Here is a simple example of using our Llama2-7b model to chat with OpenAI's Python SDK.

1 Install dependencies for using serverless endpoints

Our LLM serverless endpoints are fully compatible with OpenAI's API spec, so you can use the OpenAI Python SDK with Lepton AI API token to call it. To begin with, let's install the OpenAI Python SDK.

pip install -U openai

2 Import dependencies and set up the ENV variables

Simply redirecting the request to service hosted by Lepton with your API token will get the setup process done.

import os
import openai

client = openai.OpenAI(
    base_url="https://llama2-7b.lepton.run/api/v1/",
    api_key=os.environ.get('LEPTON_API_TOKEN')
)

3 Make chat completion requests

Now let's make a completion request to the model.

completion = client.chat.completions.create(
    model="llama2-7b",
    messages=[
        {"role": "user", "content": "say hello"},
    ],
    max_tokens=128,
    stream=True,
)

for chunk in completion:
    if not chunk.choices:
        continue
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="")

This is a simple example of making a completion request to the model. And as mentioned above, there are other SOTA models available for you to use. You may check Serverless endpoints for more details.

Usage and billing

Serverless endpoints usage will be shown in Dashboard - Setting - Billing. Usage will be billed by the amount of tokens processed.

For the pricing of each model, please refer to Pricing Page.

Rate Limit

The rate limit for the serverless endpoints is 10 requests per minute across all models under Basic Plan. If you need a higher rate limit, please contact us.

Lepton AI

© 2024