Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vLLM Invocation Layer #52

Merged
merged 3 commits into from
Sep 12, 2023
Merged

Add vLLM Invocation Layer #52

merged 3 commits into from
Sep 12, 2023

Conversation

LLukas22
Copy link
Contributor

This contribution introduces a vLLM invocation layer for the prompt model. A primary benefit of utilizing vLLM is its capacity to cache an extensive number of tokens, attributed to the implementation of PagedAttention. This feature offers significant throughput, positioning it as a viable alternative to Hugging Face's Text Generation Inference, especially in light of its recent licensing modifications.

The invocation layer wrapper primarily manages tokenization and ensures that the prompt length remains within the defined limits. Beyond these functions, it is fundamentally built upon the OpenAIInvocationLayer.

Copy link
Member

@TuanaCelik TuanaCelik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question rather than a comment. So what would we add to the model_name_or_path? Wold it be one of the model names listed here? Under 'vLLM seamlessly supports many Huggingface models, including the following architectures'
https://github.com/vllm-project/vllm

And then I guess one nitpick but the vLLM requirement below could maybe be mentioned in Installation if they have to install separately to your package 🙏

Thanks for the contribution! This looks great!

@LLukas22
Copy link
Contributor Author

More of a question rather than a comment. So what would we add to the model_name_or_path? Wold it be one of the model names listed here? Under 'vLLM seamlessly supports many Huggingface models, including the following architectures'
https://github.com/vllm-project/vllm

It depends on what invocation layer you are using. If you are using the vLLMInvocationLayer you have to provide nothing, as the model will be infered form the vLLM server hosting the model, meaning we don't have to know what model is hosted on the server in advance. If you use the vLLMLocalInvocationLayer you have to provide a supported huggingface model, as the model will be downloaded and the inference performed locally.

And then I guess one nitpick but the vLLM requirement below could maybe be mentioned in Installation if they have to install separately to your package 🙏

The vLLM dependency is only required if you want to use the vLLMLocalInvocationLayer and as the main usecase for vLLM is to host a server somewhere on you network on a gpu node, which gets hit by many request to take advantage of the paged-attention, i decided not to include it as a requirement. This has the advantage of not pulling in transformers and pytorch as dependencies if i'm only using the vLLMInvocationLayer, which saves about ~2-3 GB in dependencies.

Do these updates look ok to you @LLukas22 ?
@TuanaCelik
Copy link
Member

Thanks for the context @LLukas22 - I tried to create a PR on your fork with some edit suggestions but it didn't work for some reason. Does the commit I made here look good to you? If yes, I will merge it 🙌

integrations/vllm.md Outdated Show resolved Hide resolved
integrations/vllm.md Outdated Show resolved Hide resolved
@TuanaCelik
Copy link
Member

Good catch, comments fixed @LLukas22

@LLukas22
Copy link
Contributor Author

Should be good to go 👍

@TuanaCelik TuanaCelik merged commit 4f6a9da into deepset-ai:main Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants