api.NCSA.ai - LLMs for all

Free & unbelievably easy LLaMA-2 inference for everyone at NCSA!

It’s an API: I host it, you use it. Quick and easy for jobs big and small.
Access it however you like: Python client, Curl/Postman, or a full web interface playground.
It’s running on NCSA Center of AI Innovation GPUs, and is fully private & secure thanks to https connections via Zero Trust CloudFlare Tunnels.
It works with LangChain 🦜🔗

Beautiful implementation detail: it’s a perfect clone of the OpenAI API, making my version a drop-in replacement for OpenAI calls (except embeddings). Say goodbye to huge OpenAI bills!:moneybag:

Usage

📜 I wrote beautiful usage docs & examples here 👀 It literally couldn’t be simpler to use 😇

🐍 In Python, it’s literally this easy:

import openai # pip install openai
openai.api_key = "irrelevant" # must be non-empty

# 👉 ONLY CODE CHANGE: use our GPUs instead of OpenAI's 👈
openai.api_base = "https://api.kastan.ai/v1"

# exact same api as normal!
completion = openai.Completion.create(
    model="llama-2-7b",
    prompt="What's the capitol of France?",
    max_tokens=200,
    temperature=0.7,
    stream=True)

# ⚡️⚡️ streaming
for token in completion:
  print(token.choices[0].text, end='')

🌐 Or from the command line:

curl https://api.kastan.ai/v1/completions \
    -H 'Content-Type: application/json' \
    -d '{ "prompt": "What is the capital of France?", "echo": true }'

UX Design Goals 🎯

🧠⚡️ Flawless API support for the best LLM of the day.
An exact clone of the OpenAI API, making it a drop-in replacement.
🤗 Support for 100% of the models on HuggingFace Hub.
Some will be easier to use than others.

Towards 100% Coverage of HuggingFace Hub

⭐️ S-Tier: For the best text LLM of the day, currently LLaMA-2 or Mistral, we offer persistant, ultra-low-latency inference with customized, fused, cuda kernels. This is suitable to build other applications on top of. Any app can now easily and reliably benefit from intelligence.

🥇 A-Tier: If you want a particular LLM, in the list of popular supported ones, that's fine too. They all have optimized inference cuda kernels.

👍 B-Tier: Most models on the HuggingFace Hub, all those that support AutoModel() and/or pipeline(). The only downside here is cold starts, download the model & loading it onto a GPU.

✨ C-Tier: Models that require custom pre/post-processing code, just supply your own load() and run() functions, typically copy-pasted from the Readme of a HuggingFace model card. Docs to come.

❌ F-Tier: The current status quo: every researcher doing this independently. It's slow, painful and usually extremely compute-wasteful.

Technical Design

Limitations with WIP solutions:

Cuda-OOM errors: If your model doesn't fit on our 4xA40 (48 GB) server we return an error. Coming soon, we should fallback to accelerate ZeRO stage-3 (CPU/Disk offload). And/or allow a flag for quantization, load_in_8bit=True or load_in_4bit=True.
Multi-node support. Currently it's only designed to loadbalance within a single node, soon we should use Ray Serve to support arbitrary hetrogeneous nodes.
Advanced batching -- when the queue contains separate requests for the same model, batch them and run all jobs requesting that model before moving onto the next model (with a max of 15-20 minutes with any one model in memory, if we have other jobs waiting in the queue. This should balance efficiency, i.e. batching, with fairness, i.e. FIFO queuing).

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
SAM		SAM
hydro		hydro
nginx		nginx
ray_serving		ray_serving
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

api.NCSA.ai - LLMs for all

Usage

UX Design Goals 🎯

Towards 100% Coverage of HuggingFace Hub

Technical Design

About

Releases

Packages

Contributors 3

Languages

License

UIUC-Chatbot/llm-serving

Folders and files

Latest commit

History

Repository files navigation

api.NCSA.ai - LLMs for all

Usage

UX Design Goals 🎯

Towards 100% Coverage of HuggingFace Hub

Technical Design

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages