Skip to content

lapp0/lm-inference-engines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

70 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Open Source LLM Inference Engines

Overview of popular open source large language model inference engines. An inference engine is the program which loads a models weights and generates text responses based on given inputs.

Feel free to create a PR or issue if you want a new engine column, feature row, or update a status.

Compared Inference Engines

  • vLLM: Designed to provide SOTA throughput.
  • TensorRT-LLM: Nvidias design for a high performance extensible pytorch-like API for use with Nvidia Triton Inference Server.
  • llama.cpp: Pure C++ without any dependencies, with Apple Silicon prioritized.
  • TGI: HuggingFace' fast and flexible engine designed for high throughput.
  • LightLLM: Lightweight, fast and flexible framework targeting performance, written purely in Python / Triton.
  • DeepSpeed-MII / DeepSpeed-FastGen: Microsofts high performance implementation including SOTA Dynamic Splitfuse
  • ExLlamaV2: Efficiently run language models on modern consumer GPUs. Implements SOTA quantization method, EXL2.

Comparison Table

βœ… Included | 🟠 Inferior Alternative | 🌩️ Exists but has Issues | πŸ”¨ PR | πŸ—“οΈ Planned |❓ Unclear / Unofficial | ❌ Not Implemented

vLLM TensorRT-LLM llama.cpp TGI LightLLM Fastgen ExLlamaV2
Optimizations
FlashAttention2 βœ… 1 βœ… 2 🟠 3 βœ… 4 βœ… βœ… βœ…
PagedAttention βœ… 4 βœ… 2 ❌ 5 βœ… 🟠*** 6 βœ… βœ… 7
Speculative Decoding πŸ”¨ 8 πŸ—“οΈ 9 βœ… 10 βœ… 11 ❌ ❌ 12 βœ…
Tensor Parallel βœ… βœ… 13 🟠** 14 βœ… 15 βœ… βœ… 16 ❌
Pipeline Parallel βœ… 17 βœ… 18 ❌ 19 ❓ 15 ❌ ❌ 20 ❌
Optim. / Scheduler
Dyn. SplitFuse (SOTA21) πŸ—“οΈ 21 πŸ—“οΈ 22 ❌ ❌ ❌ βœ… 21 ❌
Efficient Rtr (better) ❌ ❌ ❌ ❌ βœ… 23 ❌ ❌
Cont. Batching βœ… 21 βœ… 24 βœ… βœ… ❌ βœ… 16 ❓ 25
Optim. / Quant
EXL2 (SOTA26) πŸ”¨ 27 ❌ ❌ βœ… 28 ❌ ❌ βœ…
AWQ 🌩️ 29 βœ… ❌ βœ… ❌ ❌ ❌
Other Quants (yes) 30 GPTQ GGUF 31 (yes) 32 ? ? ?
Features
OpenAI-Style API βœ… ❌ 33 βœ… [^13] βœ… 34 βœ… 35 ❌ ❌
Feat. / Sampling
Beam Search βœ… βœ… 2 βœ… 36 🟠**** 37 ❌ ❌ 38 ❌ 39
JSON / Grammars via Outlines βœ… πŸ—“οΈ βœ… βœ… ? ? βœ…
Models
Llama 2 / 3 βœ… βœ… βœ… βœ… βœ… βœ… βœ…
Mistral βœ… βœ… βœ… βœ… βœ… 40 βœ… βœ…
Mixtral βœ… βœ… βœ… βœ… βœ… βœ… βœ…
Implementation
Core Language Python C++ C++ Py / Rust Python Python Python
GPU API CUDA* CUDA* Metal / CUDA CUDA* Triton / CUDA CUDA* CUDA
Repo
License Apache 2 Apache 2 MIT Apache 2 41 Apache 2 Apache 2 MIT
Github Stars 17K 6K 54K 8K 2K 2K 3K

Benchmarks

Notes

*Supports Triton for one-off such as FlashAttention (FusedAttention) / quantization, or allows Triton plugins, however the project doesn't use Triton otherwise.

**Sequentially processed tensor split

***"TokenAttention is the special case of PagedAttention when block size equals to 1, which we have tested before and find it under-utilizes GPU compute compared to larger block size. Unless LightLLM's Triton kernel implementation is surprisingly fast, this should not bring speedup."

****TGI maintainers suggest using best_of instead of beam search. (best_of creates n generations and selects the one with the lowest logprob). Anecdotally, beam search is much better at finding the best generation for "non-creative" tasks.

Footnotes

  1. https://github.com/vllm-project/vllm/issues/485#issuecomment-1693009046 ↩

  2. https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_attention.md ↩ ↩2 ↩3

  3. https://github.com/ggerganov/llama.cpp/pull/5021 FlashAttention, but not FlashAttention2 ↩

  4. https://github.com/huggingface/text-generation-inference/issues/753#issuecomment-1663525606 ↩ ↩2

  5. https://github.com/ggerganov/llama.cpp/issues/1955 ↩

  6. https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md ↩

  7. https://github.com/turboderp/exllamav2/commit/affc3508c1d18e4294a5062f794f44112a8b07c5 ↩

  8. https://github.com/vllm-project/vllm/pull/1797 ↩

  9. https://github.com/NVIDIA/TensorRT-LLM/issues/169 ↩

  10. https://github.com/ggerganov/llama.cpp/blob/fe680e3d1080a765e5d3150ffd7bab189742898d/examples/speculative/README.md ↩

  11. https://github.com/huggingface/text-generation-inference/pull/1308 ↩

  12. https://github.com/microsoft/DeepSpeed-MII/issues/254 ↩

  13. https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp#L184 ↩

  14. https://github.com/ggerganov/llama.cpp/issues/4014#issuecomment-1804925896 ↩

  15. https://github.com/huggingface/text-generation-inference/issues/1031#issuecomment-1727976990 ↩ ↩2

  16. https://github.com/microsoft/DeepSpeed-MII ↩ ↩2

  17. https://github.com/vllm-project/vllm/issues/387 ↩

  18. https://github.com/NVIDIA/TensorRT-LLM/blob/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/tensorrt_llm/auto_parallel/config.py#L35 ↩

  19. "without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead." https://github.com/ggerganov/llama.cpp/issues/4238#issuecomment-1832768597 ↩

  20. https://github.com/microsoft/DeepSpeed-MII/issues/329#issuecomment-1830317364 ↩

  21. https://blog.vllm.ai/2023/11/14/notes-vllm-vs-deepspeed.html, https://github.com/vllm-project/vllm/issues/1562 ↩ ↩2 ↩3 ↩4

  22. https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752 ↩

  23. https://github.com/ModelTC/lightllm/blob/a9cf0152ad84beb663cddaf93a784092a47d1515/docs/LightLLM.md#efficient-router ↩

  24. https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md ↩

  25. https://github.com/turboderp/exllamav2/discussions/19#discussioncomment-6989460 ↩

  26. https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/#pareto-frontiers ↩

  27. https://github.com/vllm-project/vllm/issues/296 ↩

  28. https://github.com/huggingface/text-generation-inference/pull/1211 ↩

  29. https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/auto_awq.rst ↩

  30. https://github.com/vllm-project/vllm/blob/1f24755bf802a2061bd46f3dd1191b7898f13f45/vllm/model_executor/quantization_utils/squeezellm.py#L8 ↩

  31. https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md ↩

  32. https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/cli.py#L15-L21 ↩

  33. https://github.com/NVIDIA/TensorRT-LLM/issues/334 ↩

  34. https://huggingface.co/docs/text-generation-inference/messages_api ↩

  35. https://github.com/ModelTC/lightllm/blob/main/lightllm/server/api_models.py#L9 ↩

  36. https://github.com/ggerganov/llama.cpp/tree/master/examples/beam-search ↩

  37. https://github.com/huggingface/text-generation-inference/issues/722#issuecomment-1658823644 ↩

  38. https://github.com/microsoft/DeepSpeed-MII/issues/286#issuecomment-1808510043 ↩

  39. https://github.com/turboderp/exllamav2/issues/84 ↩

  40. https://github.com/ModelTC/lightllm/issues/224#issuecomment-1827365514 ↩

  41. https://raw.githubusercontent.com/huggingface/text-generation-inference/main/LICENSE, https://twitter.com/julien_c/status/1777328456709062848 ↩

About

Comparison of Language Model Inference Engines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published