Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel decoding in GPT2 example #572

Merged
merged 12 commits into from
Oct 12, 2023

Conversation

YavorGIvanov
Copy link
Collaborator

@YavorGIvanov YavorGIvanov commented Oct 11, 2023

I need to test it additionally, but it seems to produce ok results and it would be helpful to get a review.

I followed the batched.cpp example from the llama.cpp repository and applied the changes to the gpt-2 example.

I think this is a useful as it demonstrates how to do batched generations.
I decided to add this as I had the exact use case of wanting to do batched generations in one of my projects and I didn't find an isolated example showing this.

Added the example in separate .cpp file and target to not overly complicate the gpt-2 example as this batched example has all the latest feature combined -> ggml-alloc + ggml-backend + batched generation.

Here is a sample output

$ gpt-2-batched -np 5 -m models/gpt-2-117M/ggml-model.bin -p "Hello my name is" -n 50
main: seed = 1697037431
gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 1024
gpt2_model_load: n_embd  = 768
gpt2_model_load: n_head  = 12
gpt2_model_load: n_layer = 12
gpt2_model_load: ftype   = 1
gpt2_model_load: qntvr   = 0
gpt2_model_load: ggml tensor size    = 320 bytes
gpt2_model_load: backend buffer size = 312.72 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5
gpt2_model_load: using CPU backend
gpt2_model_load: memory size =    72.00 MB, n_mem = 12288
gpt2_model_load: model size  =   239.08 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: compute buffer size: 3.26 MB
main: generating 5 sequences ...
main: prompt: 'Hello my name is'
main: number of tokens in prompt = 4, first 8 tokens: 15496 616 1438 318
sequence 0:
Hello my name is John. You can call me any way you want, if you want, but for my very first date, I will be on the phone with you. We're both in our early 20s, but I feel like it's all
sequence 1:
Hello my name is Robert, and I want to say that we're proud to have your company here on the world's largest platform for sharing your stories with us. This is a huge opportunity for our community. We have hundreds of people on this team and
sequence 2:
Hello my name is Jack. I'm the one who created you.
Jack is a boy with a big smile and a big heart. He is a handsome guy. He loves the outdoors and loves the people he meets. He wants to be a
sequence 3:
Hello my name is John. I am a Canadian citizen with a large number of family in Quebec and I am interested in studying. My aim is to take up a post in the Journal of the International Academy of Sciences of Canada which I am currently finishing.
sequence 4:
Hello my name is Dan. I am an entrepreneur. I am a great father. I am a great husband. I am a great husband. I am a great dad. And I am a great husband.
I love my life. I love
main:     load time =   880.80 ms
main:   sample time =    91.43 ms
main:  predict time =  2518.29 ms
main:    total time =  3544.32 ms

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

Very cool!

We may want to preserve the old versions of the gpt-2 example, though. Initially, I added support for ggml-alloc to the gpt-2 example as a way to showcase how to use it, and then that got lost when I modified the gpt-2 example again to use ggml-backend. There will be another version in the future, that will show how to use multiple backends simultaneously, with ggml-backend v2. I have restored the old versions in my work branch as main-ctx.cpp, main-alloc.cpp and main-backend.cpp.

So what I am trying to say is that I think it would be good to rename this to something else, maybe as a different main.cpp file in the gpt-2 example, or as a different example altogether, because as the example becomes more complex, it becomes less useful as a way to show how to use certain features.

@YavorGIvanov
Copy link
Collaborator Author

YavorGIvanov commented Oct 11, 2023

Very cool!

We may want to preserve the old versions of the gpt-2 example, though. Initially, I added support for ggml-alloc to the gpt-2 example as a way to showcase how to use it, and then that got lost when I modified the gpt-2 example again to use ggml-backend. There will be another version in the future, that will show how to use multiple backends simultaneously, with ggml-backend v2. I have restored the old versions in my work branch as main-ctx.cpp, main-alloc.cpp and main-backend.cpp.

So what I am trying to say is that I think it would be good to rename this to something else, maybe as a different main.cpp file in the gpt-2 example, or as a different example altogether, because as the example becomes more complex, it becomes less useful as a way to show how to use certain features.

I agree that the example gets a bit more convoluted every time we add a new feature to it. I will just move it to a new gpt2 example or a new cpp in this folder as you suggested once I verify that it works.

@YavorGIvanov YavorGIvanov marked this pull request as ready for review October 11, 2023 15:06
@YavorGIvanov YavorGIvanov requested review from ggerganov and slaren and removed request for ggerganov October 11, 2023 15:06
@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

There is an issue when the prompt is just one token, n_kv is zero and it breaks some ops:

$ build/bin/gpt-2-batched -m examples/gpt-2/models/gpt-2-117M/ggml-model-f16.bin -n 50 -np 10 -p "Hello"
main: seed = 1697039825
gpt2_model_load: loading model from 'examples/gpt-2/models/gpt-2-117M/ggml-model-f16.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 1024
gpt2_model_load: n_embd  = 768
gpt2_model_load: n_head  = 12
gpt2_model_load: n_layer = 12
gpt2_model_load: ftype   = 1
gpt2_model_load: qntvr   = 0
gpt2_model_load: ggml tensor size    = 320 bytes
gpt2_model_load: backend buffer size = 312.72 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
gpt2_model_load: using CPU backend
gpt2_model_load: memory size =    72.00 MB, n_mem = 12288
gpt2_model_load: model size  =   239.08 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
n_kv = 1024, n_tokens = 1
main: compute buffer size: 3.07 MB
n_kv = 0, n_tokens = 1
fish: Job 1, 'build/bin/gpt-2-batched -m exam…' terminated by signal SIGFPE (Floating point exception)

@ggerganov ggerganov merged commit 8e82832 into master Oct 12, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants