Comparison and Quantization Possibilities between vLLM and TensorRT for OpenChat 3.5 #106

tsathya98 · 2023-12-01T08:42:26Z

Hello OpenChat Team,

First and foremost, I would like to express my sincere appreciation for your work on OpenChat 3.5. It's been a go-to model for my projects, and I'm truly impressed by its functionality and performance.

I'm reaching out with a couple of queries related to model optimization for OpenChat 3.5, particularly in the context of vLLM and TensorRT. The README.md notes the use of vLLM for optimizing the API server, which sparked my interest in a deeper comparison.

My primary question is:

Has there been any detailed performance comparison between vLLM and TensorRT for the OpenChat 3.5 model? I'm keen on understanding their relative efficiencies and capabilities in practical scenarios.

Additionally, I'm exploring the possibility of model quantization:

Is there a method to quantize the OpenChat 3.5 model to FP16 or bF16, and then utilize it with vLLM? If so, has anyone undertaken this process or can provide guidance on how to approach it?

Your insights or directions towards any relevant benchmarks, studies, or documentation would be immensely helpful. As someone who is still exploring LLMs and their optimization techniques, this information is crucial for my ongoing projects and understanding of these technologies.

Thank you for your time and the remarkable effort put into this project.

Best regards,
Sathya

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison and Quantization Possibilities between vLLM and TensorRT for OpenChat 3.5 #106

Comparison and Quantization Possibilities between vLLM and TensorRT for OpenChat 3.5 #106

tsathya98 commented Dec 1, 2023

Comparison and Quantization Possibilities between vLLM and TensorRT for OpenChat 3.5 #106

Comparison and Quantization Possibilities between vLLM and TensorRT for OpenChat 3.5 #106

Comments

tsathya98 commented Dec 1, 2023