A fast, lightweight, parallel inference server for Llama LLMs.
server
inference
llama
llm-inference
exllama
llama2
flash-attention-2
paged-attention
llama3
exllamav2
-
Updated
Jun 19, 2024 - Python
A fast, lightweight, parallel inference server for Llama LLMs.
Add a description, image, and links to the exllamav2 topic page so that developers can more easily learn about it.
To associate your repository with the exllamav2 topic, visit your repo's landing page and select "manage topics."