vLLM-simulation

This is the accompanying repo to the blog Throughput is all you need. The code here simulates a chat application, where a user engages with an LLM powered bot in a multi-turn conversation.

Here, we have used a flavor of Mistral 7B model developed by Teknium. You can find more information about them here.

Setup

Prerequisites

You need to have the following tools setup in your system

git
Docker CLI and Docker compose
Python 3

Step 0: Setup the environment

git clone https://github.com/cmeraki/vllm-simulation.git
cd vllm-simulation
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Step 1: Run the simulation

To replicate the experiments from the blog, we would need three running processes.

vLLM serving an LLM on an OpenAI-compatible server. After running step 0, run this command in a new terminal window:

python -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B --max-model-len 8192 --disable-log-requests

Monitoring setup: To set up the monitoring, follow the steps mentioned here. This will be important to visualize the metrics. To run the code in the given link above, you can either download the code or clone the repository itself.
Simulation: To finally run the simulation, open a new terminal window in this repository location and run the following command

python simulation.py --model teknium/OpenHermes-2.5-Mistral-7B -n 50 -l 10 -u 11

You can customize the following arguments to the above script.

usage: simulation.py [-h] [--model MODEL] [--uri URI] [--port PORT] [-r R] [-n N] [-l L] [-u U]

options:
  -h, --help     show this help message and exit
  --model MODEL  Model name that is called for inference
  --uri URI      URI where the model is available for inference
  --port PORT    Port where the model is available for inference
  -r R           Number of requests per second
  -n N           Number of requests to run
  -l L           Lower bound of conversations in a single chat
  -u U           Upper bound of conversations in a single chat

Given that you have run all the steps successfully, you would be able to visualize the metrics at http:https://localhost.com:3000.

Appendix

The simulation was run on an Nvidia RTX 4090 GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
llm.py		llm.py
requirements.txt		requirements.txt
simulation.py		simulation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM-simulation

Setup

Prerequisites

Step 0: Setup the environment

Step 1: Run the simulation

Appendix

About

Releases

Packages

Languages

cmeraki/vllm-simulation

Folders and files

Latest commit

History

Repository files navigation

vLLM-simulation

Setup

Prerequisites

Step 0: Setup the environment

Step 1: Run the simulation

Appendix

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages