-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel decoding in GPT2 example #572
Conversation
Very cool! We may want to preserve the old versions of the gpt-2 example, though. Initially, I added support for So what I am trying to say is that I think it would be good to rename this to something else, maybe as a different |
I agree that the example gets a bit more convoluted every time we add a new feature to it. I will just move it to a new gpt2 example or a new cpp in this folder as you suggested once I verify that it works. |
There is an issue when the prompt is just one token,
|
I need to test it additionally, but it seems to produce ok results and it would be helpful to get a review.
I followed the batched.cpp example from the llama.cpp repository and applied the changes to the gpt-2 example.
I think this is a useful as it demonstrates how to do batched generations.
I decided to add this as I had the exact use case of wanting to do batched generations in one of my projects and I didn't find an isolated example showing this.
Added the example in separate .cpp file and target to not overly complicate the gpt-2 example as this batched example has all the latest feature combined -> ggml-alloc + ggml-backend + batched generation.
Here is a sample output