-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve, vllm] add vllm-example that we can reference to #36617
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far! I left some comments.
@edoakes this doc uses a GPU. Is there anything special we need to do to give it access to one in the CI?
if __name__ == "__main__": | ||
deployment = VLLMPredictDeployment.bind(model="facebook/opt-125m") | ||
serve.run(deployment) | ||
send_request() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to call print
on send_request()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you assert
what the user should see when they run send_request()
?
prompt = final_output.prompt | ||
text_outputs = [ | ||
prompt + output.text | ||
for output in final_output.outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, does final_output.outputs
accumulate all the generated outputs? It doesn't just contain the latest output, correct?
prompt + output.text | ||
for output in request_output.outputs | ||
] | ||
ret = {"text": text_outputs} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return only the generated output instead of prepending the prompt each time? That way, the user can see a single, coherent piece of text streamed back to them.
for output in request_output.outputs | ||
] | ||
ret = {"text": text_outputs} | ||
yield (json.dumps(ret) + "\0").encode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this null byte is not necessary and we should remove
# GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.) | ||
# see https://vllm.readthedocs.io/en/latest/getting_started/installation.html | ||
# for more details. | ||
deployment = VLLMPredictDeployment.bind(model="facebook/opt-125m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add port arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe a README/doc update that gives usage example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can revisit this when we write docs in #36650
Co-authored-by: shrekris-anyscale <[email protected]> Signed-off-by: Chen Shen <[email protected]>
Co-authored-by: shrekris-anyscale <[email protected]> Signed-off-by: Chen Shen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, this change looks good to me!
this PR is time sensitive and the CI is slow. I'll merge it and fix CI later. |
…)" This reverts commit cc983fc.
Called ray.init twice when vllm tensor_parallel_size > 1 |
…36617) This adds a vllm example on serve that we can refer to. --------- Signed-off-by: Chen Shen <[email protected]> Co-authored-by: shrekris-anyscale <[email protected]> Signed-off-by: 久龙 <[email protected]>
…36617) This adds a vllm example on serve that we can refer to. --------- Signed-off-by: Chen Shen <[email protected]> Co-authored-by: shrekris-anyscale <[email protected]> Signed-off-by: harborn <[email protected]>
…36617) This adds a vllm example on serve that we can refer to. --------- Signed-off-by: Chen Shen <[email protected]> Co-authored-by: shrekris-anyscale <[email protected]>
…36617) This adds a vllm example on serve that we can refer to. --------- Signed-off-by: Chen Shen <[email protected]> Co-authored-by: shrekris-anyscale <[email protected]> Signed-off-by: e428265 <[email protected]>
Why are these changes needed?
This adds a vllm example on serve that we can reference to.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.