New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blas-like Prompt Parallelization? (sequence processing mode) #55

Closed

paryska99 opened this issue May 10, 2023 · 1 comment · Fixed by #89

paryska99 commented May 10, 2023

Is it possible to make prompt processing faster with help of a gpu device, just like CuBLAS or ClBlast can with CPU hosted Llama models or other?

Collaborator

saharNooby commented May 10, 2023

It is possible, but would require implementing sequence processing mode. Currently, only RNN mode is implemented, that is, processing token-by-token.

saharNooby changed the title ~~Blas-like Prompt Parallelization?~~ Blas-like Prompt Parallelization? (sequence processing mode)

saharNooby mentioned this issue

Is it possible to implement the seq mode for loading prompt? #60

Closed

saharNooby linked a pull request

that will close this issue

Sequence mode prototype #89

Merged

saharNooby closed this as completed in #89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment