one-click-llms

Tip

Post a new issue if you would like other templates.

Tip

Advanced inferencing scripts (incl. for function calling, data extraction, advanced RAG methods, and private data redaction) are available for purchase here.

These one click templates allow you to quickly boot up an API for a given language model.

Read through the README file on the templates!
Runpod is recommended (better user interface) if using larger GPUs like A6000, A100 or H100.
Vast.AI is recommended for lowest cost per hour with smaller GPUs like A600000 and A2000. However, the user experience is significantly worse with Vast.AI than runpod.

Runpod One-Click Templates

Tip

To support the Trelis Research YouTube channel, you can sign up for an account with this link. Trelis is supported by a commission when you use one-click templates.

GPU Choices and Tips

For best reliability around CUDA versions, I recommend:

A6000 (48 GB VRAM)
A100 SXM (more expensive than PCI but more reliably has up to date CUDA)
H100 PCI or SXM - best for fp8 models, but expensive.

Fine-tuning Notebook Setup

CUDA 12.1 one-click template here

vLLM (requires an A100 or H100 or A6000, i.e. ampere architecture):

Note: The vLLM image has compatibility issues with certain Runpod CUDA drivers, leading to issues on certain pods. A6000 Ada is typically an option that works.

Llama 3.1 Instruct 8B, Llama 3.1 Instruct 70B, Llama 3.1 Instruct FP8 405B, Llama 3.1 Instruct INT4 405B
Phi 3 Mini, Phi 3 Small, Phi 3 Medium
Mistral Nemo Instruct (fp8)
Llama 3 8B Instruct
Llama 3 70B Instruct
Mistral Instruct 7B AWQ
Mixtral Instruct 8x7B AWQ
Qwen1.5 Chat 72B AWQ. Needs to be run on an A100 or H100. The 48 GB of VRAM on an A6000 is insufficient.
CodeLlama 70B Instruct - 4bit AWQ. Requires an A6000 or A100 or H100.

Important

Note: vLLM runs into issues sometimes if the pod template does not have the correct CUDA drivers. Unfortunately there is no way to know when picking a GPU. An issue has been raised here. As an alternative, you can run TGI (and even query in openai style, guide here). TGI is faster than vLLM and recommended in general. Note however, that TGI does not automatically apply the chat template to the prompt when using the OpenAI style endpoint.

Text Generation Inference:

IDEFICS 2 8B multi-modal
Llama 3 - 8B Instruct
Llama 3 - 70B Instruct
OpenChat 3.5 7B AWQ API - RECOMMENDED, OpenChat 3.5 7B bf16 - TGI API - lowest perplexity
Mixtral Instruct API 4bit AWQ - RECOMMENDED, Mixtral Instruct API 8bit eetq, pod needs to be restarted multiple times to download all weights. Requires an A6000 or A100 or H100.
Zephyr 141B - a Mixtral 8x22B fine-tune
DRBX Instruct
Smaug 34B Chat (a Yi fine-tune) - fits in bf16 on an A100. BEWARE that guardrails are weaker on this model than Yi. As such, it may be best suited for structured generation
TowerInstruct 13B (multi-lingual Llama 2 fine-tune) - needs ~30 GB to run in bf16 (fits on an A6000). Add --quantize eetq to run with under 15 GB of VRAM (e.g. A600000).
Yi 34B Chat - fits in 16-bit on an A100
Gemma Chat 9B.
Notux 8x7B AWQ. Requires an A6000 or A100 or H100.
CodeLlama 70B Instruct - 4bit AWQ, CodeLlama 70B Instruct - 4bit bitsandbytes. Requires an A6000 or A100 or H100.
Mamba Instruct OpenHermes
Llama 70B API by TrelisResearch.
Deepseek Coder 33B Template.
Medusa Vicuna (high speed speculative decoding - mostly a glamour template because OpenChat with AWQ is better quality and faster)

llama.cpp One-click templates:

MoonDream Multi-modal API (openai-ish)

Moondream2 - a small but accurate model for querying images

Vast AI One-Click Templates

Tip

To support the Trelis Research YouTube channel, you can sign up for an account with this affiliate link. Trelis is supported by a commission when you use one-click templates.

Fine-tuning Notebook Setup

CUDA 12.1 one-click template here.

Text Generation Inference (fastest):

Mistral 7B api

vLLM (requires an A100 or H100 or A6000, i.e. ampere architecture):

Mistral 7B v0.2 AWQ
Post a new issue if you would like other templates

llama.cpp One-click templates:

Mistral 7B Instruct v0.2 8-bit

Function-calling One-Click Templates

One-click templates for function-calling are located on the HuggingFace model cards. Check out the collection here.

Tip

As of July 23rd 2024, function calling fine-tuned models are being deprecated in favour of a one-shot approach with stronger models. Find the "Tool Use" video on the Trelis YouTube Channel for more info.

Changelog

20Jul2023:

Update the ./llama-server.sh command in line with breaking changes to llama.cpp

Feb 16 2023:

Added a Mamba one click template.

Jan 21 2023:

Swapped Runpod to before Vast.AI as user experience is much better with Runpod.

Jan 9 2023:

Added Mixtral Instruct AWQ TGI

Dec 30 2023:

Support gated models by adding HUGGING_FACE_HUB_TOKEN env variable.
Speed up downloading using HuggingFace API.

Dec 29 2023:

Add in one-click llama.cpp server template.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
docker/llamacpp		docker/llamacpp
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

one-click-llms

Runpod One-Click Templates

GPU Choices and Tips

Fine-tuning Notebook Setup

vLLM (requires an A100 or H100 or A6000, i.e. ampere architecture):

Text Generation Inference:

llama.cpp One-click templates:

MoonDream Multi-modal API (openai-ish)

Vast AI One-Click Templates

Fine-tuning Notebook Setup

Text Generation Inference (fastest):

vLLM (requires an A100 or H100 or A6000, i.e. ampere architecture):

llama.cpp One-click templates:

Function-calling One-Click Templates

Changelog

About

Releases

Packages

Languages

TrelisResearch/one-click-llms

Folders and files

Latest commit

History

Repository files navigation

one-click-llms

Runpod One-Click Templates

GPU Choices and Tips

Fine-tuning Notebook Setup

vLLM (requires an A100 or H100 or A6000, i.e. ampere architecture):

Text Generation Inference:

llama.cpp One-click templates:

MoonDream Multi-modal API (openai-ish)

Vast AI One-Click Templates

Fine-tuning Notebook Setup

Text Generation Inference (fastest):

vLLM (requires an A100 or H100 or A6000, i.e. ampere architecture):

llama.cpp One-click templates:

Function-calling One-Click Templates

Changelog

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages