Serve Multi-GPU LlaMa on Flask!

This is a quick and dirty script that simultaneously runs LLaMa and a web server so that you can launch a local LLaMa API.

So far it supports running the 13B model on 2 GPUs but it can be extended to serving bigger models as well.

To get it running, just edit the launch.sh script to input your CUDA_VISIBLE_DEVICES, TARGET_FOLDER, and MODEL_SIZE. Then, run ./launch.sh and you should be good to go!

Once the server is launched, you can test it by opening a separate window and running

curl -X POST -H "Content-Type: application/json" -d '{"prompt":"hello world"}' http:https://localhost:54983/flask-inference/

Feel free to improve this script and submit a PR, fork your own repo, or whatever you want!

Original repository [and Readme] at https://github.com/facebookresearch/llama

Changes From Previous Fork

New example.py allows to use console for interactive prompting. Supports multiple gpu (tested with 13b model on two RTX3090) Supports multiple gpu's (tested with 13b model on two RTX3090).
Modified llama/generate.py to support the above functionality
Batch size set to 1 (equivalent to no batches at all)
The rest of the code is left unchanged
Added option to split model 7B into two GPUs (just use option -MP=2 as for model 13B). Further parallelization is possible, but I don't plan to implement it.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
example_plans		example_plans
llama		llama
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
FAQ.md		FAQ.md
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
app.py		app.py
bad_plan.t		bad_plan.t
default_request_settings.yaml		default_request_settings.yaml
download.sh		download.sh
example.py		example.py
launch.sh		launch.sh
requirements.txt		requirements.txt
settings.json		settings.json
setup.py		setup.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serve Multi-GPU LlaMa on Flask!

Original repository [and Readme] at https://github.com/facebookresearch/llama

Changes From Previous Fork

About

Releases

Packages

Languages

License

alpercanberk/llama_flask_server

Folders and files

Latest commit

History

Repository files navigation

Serve Multi-GPU LlaMa on Flask!

Original repository [and Readme] at https://github.com/facebookresearch/llama

Changes From Previous Fork

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages