Skip to content

Serve LLMs on NCSA hardware. Support the best FOSS models, and the long tail on HuggingFace Hub.

License

Notifications You must be signed in to change notification settings

UIUC-Chatbot/llm-serving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

api.NCSA.ai - LLMs for all

⭐️ https://api.NCSA.ai/

Free & unbelievably easy LLaMA-2 inference for everyone at NCSA!

  • It’s an API: I host it, you use it. Quick and easy for jobs big and small.
  • Access it however you like: Python client, Curl/Postman, or a full web interface playground.
  • It’s running on NCSA Center of AI Innovation GPUs, and is fully private & secure thanks to https connections via Zero Trust CloudFlare Tunnels.
  • It works with LangChain 🦜🔗

Beautiful implementation detail: it’s a perfect clone of the OpenAI API, making my version a drop-in replacement for OpenAI calls (except embeddings). Say goodbye to huge OpenAI bills!:moneybag:

Usage

📜 I wrote beautiful usage docs & examples here 👀 It literally couldn’t be simpler to use 😇

🐍 In Python, it’s literally this easy:

import openai # pip install openai
openai.api_key = "irrelevant" # must be non-empty

# 👉 ONLY CODE CHANGE: use our GPUs instead of OpenAI's 👈
openai.api_base = "https://api.kastan.ai/v1"

# exact same api as normal!
completion = openai.Completion.create(
    model="llama-2-7b",
    prompt="What's the capitol of France?",
    max_tokens=200,
    temperature=0.7,
    stream=True)

# ⚡️⚡️ streaming
for token in completion:
  print(token.choices[0].text, end='')

🌐 Or from the command line:

curl https://api.kastan.ai/v1/completions \
    -H 'Content-Type: application/json' \
    -d '{ "prompt": "What is the capital of France?", "echo": true }'

UX Design Goals 🎯

  1. 🧠⚡️ Flawless API support for the best LLM of the day.

    An exact clone of the OpenAI API, making it a drop-in replacement.

  2. 🤗 Support for 100% of the models on HuggingFace Hub.

    Some will be easier to use than others.

Towards 100% Coverage of HuggingFace Hub

⭐️ S-Tier: For the best text LLM of the day, currently LLaMA-2 or Mistral, we offer persistant, ultra-low-latency inference with customized, fused, cuda kernels. This is suitable to build other applications on top of. Any app can now easily and reliably benefit from intelligence.

🥇 A-Tier: If you want a particular LLM, in the list of popular supported ones, that's fine too. They all have optimized inference cuda kernels.

👍 B-Tier: Most models on the HuggingFace Hub, all those that support AutoModel() and/or pipeline(). The only downside here is cold starts, download the model & loading it onto a GPU.

✨ C-Tier: Models that require custom pre/post-processing code, just supply your own load() and run() functions, typically copy-pasted from the Readme of a HuggingFace model card. Docs to come.

❌ F-Tier: The current status quo: every researcher doing this independently. It's slow, painful and usually extremely compute-wasteful.

llm_sever_priorities

Technical Design

Limitations with WIP solutions:

  • Cuda-OOM errors: If your model doesn't fit on our 4xA40 (48 GB) server we return an error. Coming soon, we should fallback to accelerate ZeRO stage-3 (CPU/Disk offload). And/or allow a flag for quantization, load_in_8bit=True or load_in_4bit=True.
  • Multi-node support. Currently it's only designed to loadbalance within a single node, soon we should use Ray Serve to support arbitrary hetrogeneous nodes.
  • Advanced batching -- when the queue contains separate requests for the same model, batch them and run all jobs requesting that model before moving onto the next model (with a max of 15-20 minutes with any one model in memory, if we have other jobs waiting in the queue. This should balance efficiency, i.e. batching, with fairness, i.e. FIFO queuing).
api kastan ai_routing_design

About

Serve LLMs on NCSA hardware. Support the best FOSS models, and the long tail on HuggingFace Hub.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published