GitHub - Huangongshu/lmdeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLM

👋 join us on Twitter, Discord and WeChat

News 🎉

[2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
[2023/08] TurboMind supports Windows (tp=1)
[2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check this guide for detailed info
[2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
[2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
[2023/07] TurboMind supports Llama-2 70B with GQA.
[2023/07] TurboMind supports Llama-2 7B/13B.
[2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

Efficient Inference Engine (TurboMind): Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
Multi-GPU Model Deployment and Quantization: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
Persistent Batch Inference: Further optimization of model execution efficiency.

Supported Models

LMDeploy has two inference backends, Pytorch and TurboMind.

TurboMind

Note
W4A16 inference requires Nvidia GPU with Ampere architecture or above.

Models	Tensor Parallel	FP16	KV INT8	W4A16	W8A8
Llama	Yes	Yes	Yes	Yes	No
Llama2	Yes	Yes	Yes	Yes	No
InternLM	Yes	Yes	Yes	Yes	No

Pytorch

Models	Tensor Parallel	FP16	KV INT8	W4A16	W8A8
Llama	Yes	Yes	No	No	No
Llama2	Yes	Yes	No	No	No
InternLM	Yes	Yes	No	No	No

Performance

Case I: output token throughput with fixed input token and output token number (1, 2048)

Case II: request throughput with real conversation data

Test Setting: LLaMA-7B, NVIDIA A100(80G)

The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x. And the request throughput of TurboMind is 30% higher than vLLM.

Quick Start

Installation

Install lmdeploy with pip ( python 3.8+) or from source

pip install lmdeploy

Deploy InternLM

Get InternLM model

# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-chat-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b

Inference by TurboMind

python -m lmdeploy.turbomind.chat ./workspace

Note
When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind.
It is recommended to use NVIDIA cards such as 3090, V100, A100, etc. Disable GPU ECC can free up 10% memory, try sudo nvidia-smi --ecc-config=0 and reboot system.

Note
Tensor parallel is available to perform inference on multiple GPUs. Add --tp=<num_gpu> on chat to enable runtime TP.

Serving with gradio

python3 -m lmdeploy.serve.gradio.app ./workspace

Serving with Triton Inference Server

Launch inference server by:

bash workspace/service_docker_up.sh

Then, you can communicate with the inference server by command line,

python3 -m lmdeploy.serve.client {server_ip_addresss}:33337

or webui,

python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from here

Inference with PyTorch

For detailed instructions on Inference pytorch models, see here.

Single GPU

python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

Tensor Parallel with DeepSpeed

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

You need to install deepspeed first to use this feature.

pip install deepspeed

Quantization

Step 1. Obtain Quantization Parameters

First, run the quantization script to obtain the quantization parameters.

After execution, various parameters needed for quantization will be stored in $WORK_DIR; these will be used in the following steps..

python3 -m lmdeploy.lite.apis.calibrate \
  --model $HF_MODEL \
  --calib_dataset 'c4' \             # Calibration dataset, supports c4, ptb, wikitext2, pileval
  --calib_samples 128 \              # Number of samples in the calibration set, if memory is insufficient, you can appropriately reduce this
  --calib_seqlen 2048 \              # Length of a single piece of text, if memory is insufficient, you can appropriately reduce this
  --work_dir $WORK_DIR \             # Folder storing Pytorch format quantization statistics parameters and post-quantization weight

Step 2. Actual Model Quantization

LMDeploy supports INT4 quantization of weights and INT8 quantization of KV Cache. Run the corresponding script according to your needs.

Weight INT4 Quantization

LMDeploy uses AWQ algorithm for model weight quantization

Requires input from the $WORK_DIR of step 1, and the quantized weights will also be stored in this folder.

python3 -m lmdeploy.lite.apis.auto_awq \
  --model $HF_MODEL \
  --w_bits 4 \                       # Bit number for weight quantization
  --w_group_size 128 \               # Group size for weight quantization statistics
  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1

Click here to view the test results for weight int4 usage.

KV Cache INT8 Quantization

Click here to view the usage method, implementation formula, and test results for kv int8.

Warning
runtime Tensor Parallel for quantilized model is not available. Please setup --tp on deploy to enable static TP.

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

License

This project is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.github		.github
3rdparty		3rdparty
benchmark		benchmark
builder		builder
cmake		cmake
docker		docker
docs		docs
examples		examples
lmdeploy		lmdeploy
requirements		requirements
resources		resources
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
.readthedocs.yml		.readthedocs.yml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh-CN.md		README_zh-CN.md
generate.sh		generate.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News 🎉

Introduction

Supported Models

TurboMind

Pytorch

Performance

Quick Start

Installation

Deploy InternLM

Get InternLM model

Inference by TurboMind

Serving with gradio

Serving with Triton Inference Server

Inference with PyTorch

Single GPU

Tensor Parallel with DeepSpeed

Quantization

Step 1. Obtain Quantization Parameters

Step 2. Actual Model Quantization

Weight INT4 Quantization

KV Cache INT8 Quantization

Contributing

Acknowledgement

License

About

Releases

Packages

Languages

License

Huangongshu/lmdeploy

Folders and files

Latest commit

History

Repository files navigation

News 🎉

Introduction

Supported Models

TurboMind

Pytorch

Performance

Quick Start

Installation

Deploy InternLM

Get InternLM model

Inference by TurboMind

Serving with gradio

Serving with Triton Inference Server

Inference with PyTorch

Single GPU

Tensor Parallel with DeepSpeed

Quantization

Step 1. Obtain Quantization Parameters

Step 2. Actual Model Quantization

Weight INT4 Quantization

KV Cache INT8 Quantization

Contributing

Acknowledgement

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages