Skip to content
/ rungpt Public

An open-source cloud-native of large multi-modal models (LMMs) serving framework.

License

Notifications You must be signed in to change notification settings

jina-ai/rungpt

Repository files navigation

☄️ RunGPT

RunGPT: An open-source cloud-native large-scale multimodal model serving framework

"A playful and whimsical vector art of a Stochastic Tigger, wearing a t-shirt with a "GPT" text printed logo, surrounded by colorful geometric shapes. –ar 1:1 –upbeta"

— Prompts and logo art was produced with PromptPerfect & Stable Diffusion X

PyPI PyPI - License

RunGPT is an open-source cloud-native large-scale language models (LLMs) serving framework. It is designed to simplify the deployment and management of large language models, on a distributed cluster of GPUs. We aim to make it a one-stop solution for a centralized and accessible place to gather techniques for optimizing LLM and make them easy to use for everyone.

Table of contents

Features

RunGPT provides the following features to make it easy to deploy and serve large language models (LLMs) at scale:

  • Scalable architecture for handling high traffic loads
  • Optimized for low-latency inference
  • Automatic model partitioning and distribution across multiple GPUs
  • Centralized model management and monitoring
  • REST API for easy integration with existing applications

Updates

  • 2023-08-22: The OpenGPT is now renamed to RunGPT. We have also released the first version v0.1.0 of RunGPT. You can install it with pip install rungpt.
  • 2023-05-12: 🎉We have released the first version v0.0.1 of OpenGPT. You can install it with pip install open_gpt_torch.

Get Started

Installation

Install the package with pip:

pip install rungpt

Quickstart

import run_gpt

model = run_gpt.create_model(
    'stabilityai/stablelm-tuned-alpha-3b', device='cuda', precision='fp16'
)

prompt = "The quick brown fox jumps over the lazy dog."

output = model.generate(
    prompt,
    max_length=100,
    temperature=0.9,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    do_sample=True,
    num_return_sequences=1,
)

We use the stabilityai/stablelm-tuned-alpha-3b as the open example model as it is relatively small and fast to download.

Warning In the above example, we use precision='fp16' to reduce the memory usage and speed up the inference with some loss in accuracy on text generation tasks. You can also use precision='fp32' instead as you like for better performance.

Note It usually takes a while (several minutes) for the first time to download and load the model into the memory.

In most cases of large model serving, the model cannot fit into a single GPU. To solve this problem, we also provide a device_map option (supported by accecleate package) to automatically partition the model and distribute it across multiple GPUs:

model = run_gpt.create_model(
    'stabilityai/stablelm-tuned-alpha-3b', precision='fp16', device_map='balanced'
)

In the above example, device_map="balanced" evenly split the model on all available GPUs, making it possible for you to serve large models.

Note The device_map option is supported by the accelerate package.

See examples on how to use rungpt with different models. 🔥

Build a model serving in one line

To do so, you can use the serve command:

rungpt serve stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced

💡 Tip: you can inspect the available options with rungpt serve --help.

This will start a gRPC and HTTP server listening on port 51000 and 52000 respectively. Once the server is ready, as shown below:

Click to expand

You can then send requests to the server:

import requests

prompt = "Once upon a time,"

response = requests.post(
    "https://localhost:51000/generate",
    json={
        "prompt": prompt,
        "max_length": 100,
        "temperature": 0.9,
        "top_k": 50,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "do_sample": True,
        "num_return_sequences": 1,
    },
)

What's more, we also provide a Python client (inference-client) for you to easily interact with the server:

from run_gpt import Client

client = Client()

# connect to the model server
model = client.get_model(endpoint='grpc:https://0.0.0.0:51000')

prompt = "Once upon a time,"

output = model.generate(
    prompt,
    max_length=100,
    temperature=0.9,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    do_sample=True,
    num_return_sequences=1,
)

The output has the same format as the one from the OpenAI's Python API:

{ "id": "18d92585-7b66-4b7c-b818-71287c122c50", 
  "object": "text_completion", 
  "created": 1692610173, 
  "choices": [{"text": "Once upon a time, there was an old man who lived in the forest. He had no children", 
              "finish_reason": "length", 
              "index": 0.0}], 
  "prompt": "Once upon a time,", 
  "usage": {"completion_tokens": 21, "total_tokens": 27, "prompt_tokens": 6}}

For the streaming output, you can install sseclient-py first:

pip install sseclient-py

And send the request to https://localhost:51000/generate_stream with the same payload.

import sseclient
import requests

prompt = "Once upon a time,"

response = requests.post(
    "https://localhost:51000/generate_stream",
    json={
        "prompt": prompt,
        "max_length": 100,
        "temperature": 0.9,
        "top_k": 50,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "do_sample": True,
        "num_return_sequences": 1,
    },
    stream=True,
)
client = sseclient.SSEClient(response)
for event in client.events():
    print(event.data)

And the output will be streamed back to you (only show 3 iterations here):

{ "id": "18d92585-7b66-4b7c-b818-71287c122c51", 
  "object": "text_completion", 
  "created": 1692610173, 
  "choices": [{"text": " there", "finish_reason": None, "index": 0.0}], 
  "prompt": "Once upon a time,", 
  "usage": {"completion_tokens": 1, "total_tokens": 7, "prompt_tokens": 6}},
{ "id": "18d92585-7b66-4b7c-b818-71287c122c52", 
  "object": "text_completion", 
  "created": 1692610173, 
  "choices": [{"text": "was", "finish_reason": None, "index": 0.0}], 
  "prompt": None, 
  "usage": {"completion_tokens": 2, "total_tokens": 9, "prompt_tokens": 7}},
{ "id": "18d92585-7b66-4b7c-b818-71287c122c53", 
  "object": "text_completion", 
  "created": 1692610173, 
  "choices": [{"text": "an", "finish_reason": None, "index": 0.0}], 
  "prompt": None, 
  "usage": {"completion_tokens": 3, "total_tokens": 11, "prompt_tokens": 8}}

We also support chat mode, which is useful for interactive applications. The inputs for chat should be a list of dictionaries which contain role and content. For example:

import requests

messages = [
    {"role": "user", "content": "Hello!"},
]

response = requests.post(
    "https://localhost:51000/chat",
    json={
        "messages": messages,
        "max_length": 100,
        "temperature": 0.9,
        "top_k": 50,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "do_sample": True,
        "num_return_sequences": 1,
    },
)

The response will be:

{"id": "18d92585-7b66-4b7c-b818-71287c122c57", 
  "object": "chat.completion", 
  "created": 1692610173, 
  "choices": [{"message": {
                            "role": "assistant",
                            "content": "\n\nHello there, how may I assist you today?",
                        }, 
              "finish_reason": "stop", "index": 0.0}], 
  "prompt": "Hello there!", 
  "usage": {"completion_tokens": 12, "total_tokens": 15, "prompt_tokens": 3}}

You can also replace the chat with chat_stream to get the streaming output.

Cloud-native deployment

You can also deploy the server to a cloud provider like Jina Cloud or AWS. To do so, you can use deploy command:

Jina Cloud

using predefined executor

rungpt deploy stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced --cloud jina --replicas 1

It will give you a HTTP url and a gRPC url by default:

https://{random-host-name}-http.wolf.jina.ai
grpcs:https://{random-host-name}-grpc.wolf.jina.ai

AWS

TBD

Benchmark

We have done some benchmarking on different model architectures and different configurations (whether to use quantization, torch.compile and page attention ...), regards to the latency, throughput (prefill stage && the whole decoding process) and perplexity.

The script for benchmarking locates at scripts/benchmark.py. You can run the scripts to get the benchmarking results.

Environment Setting

We use a single RTX3090 (cuda version is 11.8) for all benchmarking except for Llama-2-13b (2*RTX3090). We use:

torch==2.0.1 (without torch.compile) / torch==2.1.0.dev20230803 (with torch.compile)
bitsandbytes==0.41.0
transformers==4.31.0
triton==2.0.0

Model Candidates

Model_Name
meta-llama/Llama-2-7b-hf
mosaicml/mpt-7b
stabilityai/stablelm-base-alpha-7b
EleutherAI/gpt-j-6B

Benchmarking Results

  • Latency/throughput for different models (precision: fp16)
Model_Name average_prefill_latency(ms/token) average_prefill_throughput(token/s) average_decode_latency(ms/token) average_decode_throughput(token/s)
meta-llama/Llama-2-7b-hf 49 20.619 49.4 20.054
meta-llama/Llama-2-13b-hf 175 5.727 188.27 4.836
mosaicml/mpt-7b 27 37.527 28.04 35.312
stabilityai/stablelm-base-alpha-7b 50 20.09 45.73 21.878
EleutherAI/gpt-j-6B 75 13.301 76.15 11.181
  • Latency/throughput for different models using torch.compile (precision: fp16)

Warning torch.compile doesn't support Flash-Attention based model like MPT. Also, it cannot be used in multi-GPUs environment.

Model_Name average_prefill_latency(ms/token) average_prefill_throughput(token/s) average_decode_latency(ms/token) average_decode_throughput(token/s)
meta-llama/Llama-2-7b-hf 25 40.644 26.54 37.75
meta-llama/Llama-2-13b-hf - - - -
mosaicml/mpt-7b - - - -
stabilityai/stablelm-base-alpha-7b 44 22.522 42.97 21.413
EleutherAI/gpt-j-6B 32 31.488 33.89 25.105
  • Latency/throughput for different models using quantization (precision: fp16 / bit8 / bit4)
prefill latency (ms/token) prefill throughput (tokens/s) decode latency (ms/token) decode throughput (tokens/s)
fp16 bit8 bit4 fp16 bit8 bit4 fp16 bit8 bit4 fp16 bit8 bit4
meta-llama/Llama-2-7b-hf 49 301 125 20.619 3.325 8.015 49.4 256.44 112.22 20.054 3.9 8.918
meta-llama/Llama-2-13b-hf 175 974 376 5.727 1.027 2.662 182.27 796.32 349.93 4.836 1.144 2.662
mosaicml/mpt-7b 27 139 86 37.527 7.222 11.6 28.04 141.04 94.22 35.312 7.021 10.507
stabilityai/stablelm-base-alpha-7b 50 164 156 20.09 6.134 6.408 45.73 148.53 147.56 21.878 6.947 6.994
EleutherAI/gpt-j-6B 75 368 162 13.301 2.724 6.195 76.15 365.51 138.44 11.181 2.327 5.642
  • Perplexity for different models using quantization (precision: fp16 / bit8 / bit4)

Notice From this benchmark we see that quantization doesn't affect the perplexity of the model too much.

wikitext2 ptb c4
fp16 bit8 bit4 fp16 bit8 bit4 fp16 bit8 bit4
meta-llama/Llama-2-7b-hf 5.4721 5.506 5.6437 22.9483 23.8797 25.0556 6.9727 7.0098 7.1623
meta-llama/Llama-2-13b-hf 4.8837 4.9229 4.9811 27.6802 27.9665 28.8417 6.4677 6.4884 6.566
mosaicml/mpt-7b 7.6829 7.7256 7.9869 10.6002 10.6743 10.9486 9.6001 9.6457 9.879
stabilityai/stablelm-base-alpha-7b 14.1886 14.268 15.9817 19.2968 19.4904 21.3513 48.222 48.3384 57.022
EleutherAI/gpt-j-6B 8.8563 8.8786 9.0301 13.5946 13.6137 13.784 11.7114 11.7293 11.8929
  • Latency/throughput for different models using vllm (precision: fp16)

Warning vllm brings a significant improvement in latency and throughput, but it is not compatible with streaming output, so we don't release it yet.

prefill latency (ms/token) prefill throughput (tokens/s) decode latency (ms/token) decode throughput (tokens/s)
using vllm baseline using vllm baseline using vllm baseline using vllm baseline
meta-llama/Llama-2-7b-hf 29 49 34.939 20.619 20.34 49.40 48.67 20.054

Contributing

We welcome contributions from the community! To contribute, please submit a pull request following our contributing guidelines.

License

RunGPT is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.