fast-llama is a super HIGH
-performance inference engine for LLMs like LLaMA (2.5x of llama.cpp
) written in pure C++
. It can run a 8-bit
quantized LLaMA2-7B
model on a cpu with 56 cores in speed of ~25 tokens / s
. It outperforms all current open-source inference engines, especially when compared to the renowned llama.cpp, with ~2.5 times better inference speed on a CPU.
Feature Name | Current Support | Future Suport |
---|---|---|
Model Types | ✅LLaMA (Only Chinese-LLaMA2 1.3B & 7B are verified currently) | LLM/Baichuan, StableDiffusion |
Quantization | ✅INT16, ✅INT8 | INT4 |
Model Formats | ✅HuggingFace, ✅gguf(by llama.cpp), ✅flm | |
Systems | ✅Linux | Windows, Macbook, Android, iOS |
CPU/GPU | ✅Intel CPU with AVX-512 | All x86/64, ARM, Apple Mx CPUs, GPU, CPU+GPU |
Architectures | ✅UMA, ✅NUMA |
Why you should use Fast-LLaMA?
Fast
- Extremely fast on CPU.
Faster
than any other engines on Github including llama.cpp.
- Extremely fast on CPU.
Simple
- Totally less than 7k lines of C++ codes with well-orgnized code structures and no dependencies except NUMA (if needed for multi-cpus).
"Easy To Use"
(target☺️ )
Only Linux is supported currently. Support of other platforms including Windows, Mac, GPU is coming soon.
- GCC 10.x or newer versions
- CPU with AVX-512
- libnuma-dev
libraries like mpi, openblas, mkl, etc are NOT needed currently.
Method 1. Using the provided build script:
bash ./build.sh
Method 2. Using Make:
make -j 4
Step 1
: Download a model
See llama2.c
Step 2
: Run the model
./main -c ./models/stories110M.bin -z ./models/tokenizer.bin -j 14 -q int8 -n 200 -i 'That was a long long story happened in the ancient China.'
Step 1
: Download a model
Step 2
: Convert model format
python3 ./tools/convert_flm.py -m /path/to/model-directory -o ./models/model-name-int8.flm -t int8
Step 3
: Run the model
./main -c ./models/model-name-int8.flm -j 40 -n 200 -i 'That was a long long story happened in the ancient China.'
All supported command-line options are as follows:
-c
: Path to the model file-f
: Model file format (e.g., gguf)-j
: Number of threads to use (e.g., 56)-q
: Quantization mode (e.g., int8)-n
: Number of tokens to generate (e.g., 200)-i
: Input text (e.g., 'That was a long long story happened in the ancient China.')-h
: show usage information
Below are some incomplete test results
Model | Model Size | OutputSpeed/8 threads |
OutputSpeed/28 threads |
OutputSpeed/56 threads |
---|---|---|---|---|
stories110M | 110M | 237 tps |
400 tps |
440 tps |
Chinese-LLaMA-1.3B | 1.3B | 38.9 tps |
127 tps |
155 tps |
Chinese-LLaMA-7B | 7B | 7.4 tps |
17.4 tps |
23.5 tps |
- Note: tps = tokens / second
- Testing Prompt: "That was a long long story happened in the ancient Europe. It was about a brave boy name Oliver. Oliver lived in a small village among many big moutains. It was a beautiful village."
- Quantization:
int8
- NUMA:
2
sockets- Note: Make sure that NUMA is truely available if you expect to accelerate with NUMA)
- System: (
uname -a
)Linux coderlsf 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - CPU:
56
physical cores,AVX-512
Architecture: x86_64
Model name: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU(s): 112 (56 physical cores)
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
Latancy of first token will be optimized laterly.
Why is it so fast?
- Ultimate memory efficiency
- Zero memory allocations and frees during inferencing.
- Maximization of memory locality.
- Well-designed thread scheduling algorithm
- Optimized operators
- Fuse all operators that can be fused together
- Optmize calculation of several operators
- Proper Quantizations
fast-llama is licensed under the MIT.
We would like to express our gratitude to all contributors and users of FastLLaMA. Your support and feedback have been invaluable in making this project a success. If you encounter any issues or have any suggestions, please feel free to open an issue on the GitHub repository.
Email: 📩[email protected]
Contact me if you any questions.