⭐️ https://api.NCSA.ai/ ⭐
Free & unbelievably easy LLaMA-2 inference for everyone at NCSA!
- It’s an API: I host it, you use it. Quick and easy for jobs big and small.
- Access it however you like: Python client, Curl/Postman, or a full web interface playground.
- It’s running on NCSA Center of AI Innovation GPUs, and is fully private & secure thanks to https connections via Zero Trust CloudFlare Tunnels.
- It works with LangChain 🦜🔗
Beautiful implementation detail: it’s a perfect clone of the OpenAI API, making my version a drop-in replacement for OpenAI calls (except embeddings). Say goodbye to huge OpenAI bills!:moneybag:
📜 I wrote beautiful usage docs & examples here 👀 It literally couldn’t be simpler to use 😇
🐍 In Python, it’s literally this easy:
import openai # pip install openai
openai.api_key = "irrelevant" # must be non-empty
# 👉 ONLY CODE CHANGE: use our GPUs instead of OpenAI's 👈
openai.api_base = "https://api.kastan.ai/v1"
# exact same api as normal!
completion = openai.Completion.create(
model="llama-2-7b",
prompt="What's the capitol of France?",
max_tokens=200,
temperature=0.7,
stream=True)
# ⚡️⚡️ streaming
for token in completion:
print(token.choices[0].text, end='')
🌐 Or from the command line:
curl https://api.kastan.ai/v1/completions \
-H 'Content-Type: application/json' \
-d '{ "prompt": "What is the capital of France?", "echo": true }'
-
🧠⚡️ Flawless API support for the best LLM of the day.
An exact clone of the OpenAI API, making it a drop-in replacement.
-
🤗 Support for 100% of the models on HuggingFace Hub.
Some will be easier to use than others.
⭐️ S-Tier: For the best text LLM of the day, currently LLaMA-2 or Mistral, we offer persistant, ultra-low-latency inference with customized, fused, cuda kernels. This is suitable to build other applications on top of. Any app can now easily and reliably benefit from intelligence.
🥇 A-Tier: If you want a particular LLM, in the list of popular supported ones, that's fine too. They all have optimized inference cuda kernels.
👍 B-Tier: Most models on the HuggingFace Hub, all those that support AutoModel()
and/or pipeline()
. The only downside here is cold starts, download the model & loading it onto a GPU.
✨ C-Tier: Models that require custom pre/post-processing code, just supply your own load()
and run()
functions, typically copy-pasted from the Readme of a HuggingFace model card. Docs to come.
❌ F-Tier: The current status quo: every researcher doing this independently. It's slow, painful and usually extremely compute-wasteful.
Limitations with WIP solutions:
- Cuda-OOM errors: If your model doesn't fit on our
4xA40 (48 GB)
server we return an error. Coming soon, we should fallback to accelerate ZeRO stage-3 (CPU/Disk offload). And/or allow a flag for quantization,load_in_8bit=True
orload_in_4bit=True
. - Multi-node support. Currently it's only designed to loadbalance within a single node, soon we should use Ray Serve to support arbitrary hetrogeneous nodes.
- Advanced batching -- when the queue contains separate requests for the same model, batch them and run all jobs requesting that model before moving onto the next model (with a max of 15-20 minutes with any one model in memory, if we have other jobs waiting in the queue. This should balance efficiency, i.e. batching, with fairness, i.e. FIFO queuing).