You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First and foremost, I would like to express my sincere appreciation for your work on OpenChat 3.5. It's been a go-to model for my projects, and I'm truly impressed by its functionality and performance.
I'm reaching out with a couple of queries related to model optimization for OpenChat 3.5, particularly in the context of vLLM and TensorRT. The README.md notes the use of vLLM for optimizing the API server, which sparked my interest in a deeper comparison.
My primary question is:
Has there been any detailed performance comparison between vLLM and TensorRT for the OpenChat 3.5 model? I'm keen on understanding their relative efficiencies and capabilities in practical scenarios.
Additionally, I'm exploring the possibility of model quantization:
Is there a method to quantize the OpenChat 3.5 model to FP16 or bF16, and then utilize it with vLLM? If so, has anyone undertaken this process or can provide guidance on how to approach it?
Your insights or directions towards any relevant benchmarks, studies, or documentation would be immensely helpful. As someone who is still exploring LLMs and their optimization techniques, this information is crucial for my ongoing projects and understanding of these technologies.
Thank you for your time and the remarkable effort put into this project.
Best regards,
Sathya
The text was updated successfully, but these errors were encountered:
Hello OpenChat Team,
First and foremost, I would like to express my sincere appreciation for your work on OpenChat 3.5. It's been a go-to model for my projects, and I'm truly impressed by its functionality and performance.
I'm reaching out with a couple of queries related to model optimization for OpenChat 3.5, particularly in the context of vLLM and TensorRT. The README.md notes the use of vLLM for optimizing the API server, which sparked my interest in a deeper comparison.
My primary question is:
Additionally, I'm exploring the possibility of model quantization:
Your insights or directions towards any relevant benchmarks, studies, or documentation would be immensely helpful. As someone who is still exploring LLMs and their optimization techniques, this information is crucial for my ongoing projects and understanding of these technologies.
Thank you for your time and the remarkable effort put into this project.
Best regards,
Sathya
The text was updated successfully, but these errors were encountered: