"A playful and whimsical vector art of a Stochastic Tigger, wearing a t-shirt with a "GPT" text printed logo, surrounded by colorful geometric shapes. –ar 1:1 –upbeta"
— Prompts and logo art was produced with PromptPerfect & Stable Diffusion X
RunGPT is an open-source cloud-native large-scale language models (LLMs) serving framework. It is designed to simplify the deployment and management of large language models, on a distributed cluster of GPUs. We aim to make it a one-stop solution for a centralized and accessible place to gather techniques for optimizing LLM and make them easy to use for everyone.
RunGPT provides the following features to make it easy to deploy and serve large language models (LLMs) at scale:
- Scalable architecture for handling high traffic loads
- Optimized for low-latency inference
- Automatic model partitioning and distribution across multiple GPUs
- Centralized model management and monitoring
- REST API for easy integration with existing applications
- 2023-08-22: The OpenGPT is now renamed to RunGPT. We have also released the first version
v0.1.0
of RunGPT. You can install it withpip install rungpt
. - 2023-05-12: 🎉We have released the first version
v0.0.1
of OpenGPT. You can install it withpip install open_gpt_torch
.
Install the package with pip
:
pip install rungpt
import run_gpt
model = run_gpt.create_model(
'stabilityai/stablelm-tuned-alpha-3b', device='cuda', precision='fp16'
)
prompt = "The quick brown fox jumps over the lazy dog."
output = model.generate(
prompt,
max_length=100,
temperature=0.9,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
do_sample=True,
num_return_sequences=1,
)
We use the stabilityai/stablelm-tuned-alpha-3b as the open example model as it is relatively small and fast to download.
Warning In the above example, we use
precision='fp16'
to reduce the memory usage and speed up the inference with some loss in accuracy on text generation tasks. You can also useprecision='fp32'
instead as you like for better performance.
Note It usually takes a while (several minutes) for the first time to download and load the model into the memory.
In most cases of large model serving, the model cannot fit into a single GPU. To solve this problem, we also provide a device_map
option (supported by accecleate
package) to automatically partition the model and distribute it across multiple GPUs:
model = run_gpt.create_model(
'stabilityai/stablelm-tuned-alpha-3b', precision='fp16', device_map='balanced'
)
In the above example, device_map="balanced"
evenly split the model on all available GPUs, making it possible for you to serve large models.
Note The
device_map
option is supported by the accelerate package.
See examples on how to use rungpt with different models. 🔥
To do so, you can use the serve
command:
rungpt serve stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced
💡 Tip: you can inspect the available options with rungpt serve --help
.
This will start a gRPC and HTTP server listening on port 51000
and 52000
respectively.
Once the server is ready, as shown below:
You can then send requests to the server:
import requests
prompt = "Once upon a time,"
response = requests.post(
"https://localhost:51000/generate",
json={
"prompt": prompt,
"max_length": 100,
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True,
"num_return_sequences": 1,
},
)
What's more, we also provide a Python client (inference-client
) for you to easily interact with the server:
from run_gpt import Client
client = Client()
# connect to the model server
model = client.get_model(endpoint='grpc:https://0.0.0.0:51000')
prompt = "Once upon a time,"
output = model.generate(
prompt,
max_length=100,
temperature=0.9,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
do_sample=True,
num_return_sequences=1,
)
The output has the same format as the one from the OpenAI's Python API:
{ "id": "18d92585-7b66-4b7c-b818-71287c122c50",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": "Once upon a time, there was an old man who lived in the forest. He had no children",
"finish_reason": "length",
"index": 0.0}],
"prompt": "Once upon a time,",
"usage": {"completion_tokens": 21, "total_tokens": 27, "prompt_tokens": 6}}
For the streaming output, you can install sseclient-py
first:
pip install sseclient-py
And send the request to https://localhost:51000/generate_stream
with the same payload.
import sseclient
import requests
prompt = "Once upon a time,"
response = requests.post(
"https://localhost:51000/generate_stream",
json={
"prompt": prompt,
"max_length": 100,
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True,
"num_return_sequences": 1,
},
stream=True,
)
client = sseclient.SSEClient(response)
for event in client.events():
print(event.data)
And the output will be streamed back to you (only show 3 iterations here):
{ "id": "18d92585-7b66-4b7c-b818-71287c122c51",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": " there", "finish_reason": None, "index": 0.0}],
"prompt": "Once upon a time,",
"usage": {"completion_tokens": 1, "total_tokens": 7, "prompt_tokens": 6}},
{ "id": "18d92585-7b66-4b7c-b818-71287c122c52",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": "was", "finish_reason": None, "index": 0.0}],
"prompt": None,
"usage": {"completion_tokens": 2, "total_tokens": 9, "prompt_tokens": 7}},
{ "id": "18d92585-7b66-4b7c-b818-71287c122c53",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": "an", "finish_reason": None, "index": 0.0}],
"prompt": None,
"usage": {"completion_tokens": 3, "total_tokens": 11, "prompt_tokens": 8}}
We also support chat mode, which is useful for interactive applications. The inputs for chat
should be a list of
dictionaries which contain role and content. For example:
import requests
messages = [
{"role": "user", "content": "Hello!"},
]
response = requests.post(
"https://localhost:51000/chat",
json={
"messages": messages,
"max_length": 100,
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True,
"num_return_sequences": 1,
},
)
The response will be:
{"id": "18d92585-7b66-4b7c-b818-71287c122c57",
"object": "chat.completion",
"created": 1692610173,
"choices": [{"message": {
"role": "assistant",
"content": "\n\nHello there, how may I assist you today?",
},
"finish_reason": "stop", "index": 0.0}],
"prompt": "Hello there!",
"usage": {"completion_tokens": 12, "total_tokens": 15, "prompt_tokens": 3}}
You can also replace the chat
with chat_stream
to get the streaming output.
You can also deploy the server to a cloud provider like Jina Cloud or AWS.
To do so, you can use deploy
command:
using predefined executor
rungpt deploy stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced --cloud jina --replicas 1
It will give you a HTTP url and a gRPC url by default:
https://{random-host-name}-http.wolf.jina.ai
grpcs:https://{random-host-name}-grpc.wolf.jina.ai
TBD
We have done some benchmarking on different model architectures and different configurations (whether to use quantization, torch.compile and page attention ...), regards to the latency, throughput (prefill stage && the whole decoding process) and perplexity.
The script for benchmarking locates at scripts/benchmark.py
. You can run the scripts to get the benchmarking results.
We use a single RTX3090 (cuda version is 11.8) for all benchmarking except for Llama-2-13b (2*RTX3090). We use:
torch==2.0.1 (without torch.compile) / torch==2.1.0.dev20230803 (with torch.compile)
bitsandbytes==0.41.0
transformers==4.31.0
triton==2.0.0
Model_Name |
---|
meta-llama/Llama-2-7b-hf |
mosaicml/mpt-7b |
stabilityai/stablelm-base-alpha-7b |
EleutherAI/gpt-j-6B |
- Latency/throughput for different models (precision: fp16)
Model_Name | average_prefill_latency(ms/token) | average_prefill_throughput(token/s) | average_decode_latency(ms/token) | average_decode_throughput(token/s) |
---|---|---|---|---|
meta-llama/Llama-2-7b-hf | 49 | 20.619 | 49.4 | 20.054 |
meta-llama/Llama-2-13b-hf | 175 | 5.727 | 188.27 | 4.836 |
mosaicml/mpt-7b | 27 | 37.527 | 28.04 | 35.312 |
stabilityai/stablelm-base-alpha-7b | 50 | 20.09 | 45.73 | 21.878 |
EleutherAI/gpt-j-6B | 75 | 13.301 | 76.15 | 11.181 |
- Latency/throughput for different models using torch.compile (precision: fp16)
Warning torch.compile doesn't support Flash-Attention based model like MPT. Also, it cannot be used in multi-GPUs environment.
Model_Name | average_prefill_latency(ms/token) | average_prefill_throughput(token/s) | average_decode_latency(ms/token) | average_decode_throughput(token/s) |
---|---|---|---|---|
meta-llama/Llama-2-7b-hf | 25 | 40.644 | 26.54 | 37.75 |
meta-llama/Llama-2-13b-hf | - | - | - | - |
mosaicml/mpt-7b | - | - | - | - |
stabilityai/stablelm-base-alpha-7b | 44 | 22.522 | 42.97 | 21.413 |
EleutherAI/gpt-j-6B | 32 | 31.488 | 33.89 | 25.105 |
- Latency/throughput for different models using quantization (precision: fp16 / bit8 / bit4)
prefill latency (ms/token) | prefill throughput (tokens/s) | decode latency (ms/token) | decode throughput (tokens/s) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
fp16 | bit8 | bit4 | fp16 | bit8 | bit4 | fp16 | bit8 | bit4 | fp16 | bit8 | bit4 | |
meta-llama/Llama-2-7b-hf | 49 | 301 | 125 | 20.619 | 3.325 | 8.015 | 49.4 | 256.44 | 112.22 | 20.054 | 3.9 | 8.918 |
meta-llama/Llama-2-13b-hf | 175 | 974 | 376 | 5.727 | 1.027 | 2.662 | 182.27 | 796.32 | 349.93 | 4.836 | 1.144 | 2.662 |
mosaicml/mpt-7b | 27 | 139 | 86 | 37.527 | 7.222 | 11.6 | 28.04 | 141.04 | 94.22 | 35.312 | 7.021 | 10.507 |
stabilityai/stablelm-base-alpha-7b | 50 | 164 | 156 | 20.09 | 6.134 | 6.408 | 45.73 | 148.53 | 147.56 | 21.878 | 6.947 | 6.994 |
EleutherAI/gpt-j-6B | 75 | 368 | 162 | 13.301 | 2.724 | 6.195 | 76.15 | 365.51 | 138.44 | 11.181 | 2.327 | 5.642 |
- Perplexity for different models using quantization (precision: fp16 / bit8 / bit4)
Notice From this benchmark we see that quantization doesn't affect the perplexity of the model too much.
wikitext2 | ptb | c4 | |||||||
---|---|---|---|---|---|---|---|---|---|
fp16 | bit8 | bit4 | fp16 | bit8 | bit4 | fp16 | bit8 | bit4 | |
meta-llama/Llama-2-7b-hf | 5.4721 | 5.506 | 5.6437 | 22.9483 | 23.8797 | 25.0556 | 6.9727 | 7.0098 | 7.1623 |
meta-llama/Llama-2-13b-hf | 4.8837 | 4.9229 | 4.9811 | 27.6802 | 27.9665 | 28.8417 | 6.4677 | 6.4884 | 6.566 |
mosaicml/mpt-7b | 7.6829 | 7.7256 | 7.9869 | 10.6002 | 10.6743 | 10.9486 | 9.6001 | 9.6457 | 9.879 |
stabilityai/stablelm-base-alpha-7b | 14.1886 | 14.268 | 15.9817 | 19.2968 | 19.4904 | 21.3513 | 48.222 | 48.3384 | 57.022 |
EleutherAI/gpt-j-6B | 8.8563 | 8.8786 | 9.0301 | 13.5946 | 13.6137 | 13.784 | 11.7114 | 11.7293 | 11.8929 |
- Latency/throughput for different models using vllm (precision: fp16)
Warning vllm brings a significant improvement in latency and throughput, but it is not compatible with streaming output, so we don't release it yet.
prefill latency (ms/token) | prefill throughput (tokens/s) | decode latency (ms/token) | decode throughput (tokens/s) | |||||
---|---|---|---|---|---|---|---|---|
using vllm | baseline | using vllm | baseline | using vllm | baseline | using vllm | baseline | |
meta-llama/Llama-2-7b-hf | 29 | 49 | 34.939 | 20.619 | 20.34 | 49.40 | 48.67 | 20.054 |
We welcome contributions from the community! To contribute, please submit a pull request following our contributing guidelines.
RunGPT is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.