🦙 Llama playground for CPU 💻

Since my desktop is getting bit aged, my options for running local LLMs is a bit limited.

Luckily, running quantized Llama models with llama.cpp on my CPU (with 8GB RAM) works alright.

Loading the model takes ~40 seconds and the inference runs at 3.6 tokens/second (not great, not terrible).

Quantize model

One way or another, quantize a Llama model. This project uses LlamaCpp on LangChain. To be fully honest, I don't really know the requirements, but I have been using a Llama2 model quantized to the .gguf format, and that works at the very least.

Place the path to the file inside a file named PATH_TO_MODEL_FILE before building the image.

Build & run the playground

This project uses Docker. Build the image using build.sh and run a container using run.sh.

The image will contain the artifact of the model, since I found that it loads much faster than if it were mounted.

When running the playground using python3 prompt.py, you will be asked to insert your prompt using nano. The prompts and outputs will be saved to /history.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh
prompt.py		prompt.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦙 Llama playground for CPU 💻

Quantize model

Build & run the playground

About

Languages

danieelst/llama-playground

Folders and files

Latest commit

History

Repository files navigation

🦙 Llama playground for CPU 💻

Quantize model

Build & run the playground

About

Resources

Stars

Watchers

Forks

Languages