Name	Name	Last commit message	Last commit date
Latest commit History 138 Commits
.github/workflows	.github/workflows
src	src
tools	tools
LICENSE	LICENSE
Makefile	Makefile
README.md	README.md
build.sh	build.sh

Fast-LLaMA: A High-Performance Inference Engine

Descriptions

fast-llama is a super HIGH-performance inference engine for LLMs like LLaMA (3x of llama.cpp) written in pure C++. It can run a 8-bit quantized LLaMA2-7B model on a cpu with 56 cores in speed of ~30 tokens / s. It outperforms all current open-source inference engines, especially when compared to the renowned llama.cpp, with 2~3 times better inference speed on a CPU.

Features

Feature Name	Current Support	Future Suport
Model Types	✅LLaMA (Only Chinese-LLaMA2 1.3B & 7B are verified currently)	LLM/Baichuan, StableDiffusion
Quantization	✅INT16, ✅INT8	INT4
Model Formats	✅HuggingFace, ✅gguf(by llama.cpp), ✅flm
Systems	✅Linux	Windows, Macbook, Android, iOS
CPU/GPU	✅Intel CPU with AVX-512	All x86/64, ARM, Apple Mx CPUs, GPU, CPU+GPU
Architectures	✅UMA, ✅NUMA

Advantages

Why you should use Fast-LLaMA?

Fast
- Extremely fast on CPU. Faster than any other engines on Github including llama.cpp (3 times faster than llama.cpp).
Simple
- Totally less than 7k lines of C++ codes with well-orgnized code structures and no dependencies except NUMA (if needed for multi-cpus).
"Easy To Use" (target ☺️）

Quick Start

Compile

Only Linux is supported currently. Support of other platforms including Windows, Mac, GPU is coming soon.

Requirements

GCC 10.x or newer versions
CPU with AVX-512
libnuma-dev

libraries like mpi, openblas, mkl, etc are NOT needed currently.

Compilation

Method 1. Using the provided build script:

bash ./build.sh

Method 2. Using Make:

make -j 4

Run

1. Run with llama2.c models:

Step 1: Download a model

See llama2.c

Step 2: Run the model

./main -c ./models/stories110M.bin -z ./models/tokenizer.bin -j 14 -q int8 -n 200 -i 'That was a long long story happened in the ancient China.'

2. Run with hugging face format models

Step 1: Download a model

See Chinese-LLaMA-Alpaca-2

Step 2: Convert model format

python3 ./tools/convert_flm.py -m /path/to/model-directory -o ./models/model-name-int8.flm -t int8

Step 3: Run the model

./main -c ./models/model-name-int8.flm -j 40 -n 200 -i 'That was a long long story happened in the ancient China.'

All supported command-line options are as follows:

-c: Path to the model file
-f: Model file format (e.g., gguf)
-j: Number of threads to use (e.g., 56)
-q: Quantization mode (e.g., int8)
-n: Number of tokens to generate (e.g., 200)
-i: Input text (e.g., 'That was a long long story happened in the ancient China.')
-h: show usage information

Performance

Below are some incomplete test results

Testing Result:

Model	Model Size	OutputSpeed/`8` threads	OutputSpeed/`28` threads	OutputSpeed/`56` threads
stories110M	110M	`364`tps	`645`tps	`668`tps
Chinese-LLaMA-1.3B	1.3B	`57`tps	`172`tps	`227`tps
Chinese-LLaMA-7B	7B	`8`tps	`25`tps	`34`tps

Note: tps = tokens / second

Testing Conditions

Testing Prompt: "That was a long long story happened in the ancient Europe. It was about a brave boy name Oliver. Oliver lived in a small village among many big moutains. It was a beautiful village."
Quantization: int8
NUMA: 2 sockets
- Note: Make sure that NUMA is truely available if you expect to accelerate with NUMA)
System: (uname -a)Linux coderlsf 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
CPU: 56 physical cores, AVX-512

Architecture:            x86_64
Model name:              Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU(s):                  112 (56 physical cores)
Thread(s) per core:      2
Core(s) per socket:      28
Socket(s):               2

Latancy of first token will be optimized laterly.

Why

Why is it so fast?

Ultimate memory efficiency
- Zero memory allocations and frees during inferencing.
- Maximization of memory locality.
Well-designed thread scheduling algorithm
Optimized operators
- Fuse all operators that can be fused together
- Optmize calculation of several operators
Proper Quantizations

License

fast-llama is licensed under the MIT.

Acknowledgements

We would like to express our gratitude to all contributors and users of FastLLaMA. Your support and feedback have been invaluable in making this project a success. If you encounter any issues or have any suggestions, please feel free to open an issue on the GitHub repository.

Contact

Email: 📩[email protected]

Contact me if you any questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast-LLaMA: A High-Performance Inference Engine

Descriptions

Features

Advantages

Quick Start

Compile

Requirements

Compilation

Run

1. Run with llama2.c models:

2. Run with hugging face format models

Performance

Testing Result:

Testing Conditions

Why

License

Acknowledgements

Contact

About

Releases

Packages

Languages

License

CoderLSF/fast-llama

Folders and files

Latest commit

History

Repository files navigation

Fast-LLaMA: A High-Performance Inference Engine

Descriptions

Features

Advantages

Quick Start

Compile

Requirements

Compilation

Run

1. Run with llama2.c models:

2. Run with hugging face format models

Performance

Testing Result:

Testing Conditions

Why

License

Acknowledgements

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages