Skip to content

HuggingFace Text Generation Inference deployments in K8s/OpenShift

License

Notifications You must be signed in to change notification settings

rcarrat-AI/hftgi-llms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deploying HuggingFace TGI for Open Source LLMs in K8s / OpenShift with GitOps

The aim of this repository is to easily deploy our OpenSource LLMs in OpenShift or Kubernetes clusters using GitOps:

LLM0

Overview

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5.

This repo will deploy HuggingFace Text Generation Inference server deployments in K8s/OpenShift with GitOps:

LLM2

With this we can easily deploy different Open Source LLMs such as Llama2, Falcon, Mistral or FlanT5-XL among others in our OpenShift / Kubernetes clusters to be consumed as another application:

LLM2

Requirements

  • ROSA or OpenShift Clusters (can be also deployed in K8s with some tweaks)
  • GPU available (24gb vRAM recommended)
  • Node Feature Discovery Operator
  • NVIDIA GPU Operator
  • ArgoCD / OpenShift GitOps

Tested with A10G (g5.2xlarge) with Spot Instances using a ROSA cluster with 4.13 version and RHODS with 2.14.0

Models available to deploy using GitOps

kubectl apply -k gitops/mistral

LLM0

kubectl apply -k gitops/flant5xxl

LLM0

kubectl apply -k gitops/falcon

LLM0

kubectl apply -k gitops/llama2

LLM0

LLM0

NOTE: this model needs to set the HUGGING_FACE_HUB_TOKEN_BASE64 in a Secret to be downloaded.

export HUGGING_FACE_HUB_TOKEN_BASE64=$(echo -n 'your-token-value' | base64)
envsubst < hg-tgi/overlays/llama2-7b/hf-token-secret-template.yaml > /tmp/hf-token-secret.yaml
kubectl apply -f /tmp/hf-token-secret.yaml -n llms

LLM0

kubectl apply -k gitops/codellama

LLM0

kubectl apply -k gitops/starcoder

NOTE: this model needs to set the HF_TOKEN in a Secret to be downloaded.

export HUGGING_FACE_HUB_TOKEN_BASE64=$(echo -n 'your-token-value' | base64)
envsubst < hg-tgi/overlays/llama2-7b/hf-token-secret-template.yaml > /tmp/hf-token-secret.yaml
kubectl apply -f /tmp/hf-token-secret.yaml -n llms

LLM0

Inference to the LLMs

  • Check the Inference Guide to test your LLM deployed with Hugging Face Text Generation Inference

FrontEnd Gradio ChatBot powered by HF-TGI

We will deploy alongside the HF-TGI a Gradio ChatBot application with Memory powered by LangChain.

This FrontEnd will be using the HF-TGI deployed as a backend, powering and fueling the AI NPL Chat capabilities of this FrontEnd Chatbot App.

LLM0

Once the Gradio ChatBot is deployed, will access directly to the HF-TGI Server that serves the LLM of your choice (see section below), and will answer your questions:

LLM0

NOTE: If you want to know more, check the original source rh-aiservices-bu repository.

Extra Notes

About

HuggingFace Text Generation Inference deployments in K8s/OpenShift

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages