Add parallel decoding in GPT2 example #572

YavorGIvanov · 2023-10-11T12:18:07Z

I need to test it additionally, but it seems to produce ok results and it would be helpful to get a review.

I followed the batched.cpp example from the llama.cpp repository and applied the changes to the gpt-2 example.

I think this is a useful as it demonstrates how to do batched generations.
I decided to add this as I had the exact use case of wanting to do batched generations in one of my projects and I didn't find an isolated example showing this.

Added the example in separate .cpp file and target to not overly complicate the gpt-2 example as this batched example has all the latest feature combined -> ggml-alloc + ggml-backend + batched generation.

Here is a sample output

$ gpt-2-batched -np 5 -m models/gpt-2-117M/ggml-model.bin -p "Hello my name is" -n 50
main: seed = 1697037431
gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 1024
gpt2_model_load: n_embd  = 768
gpt2_model_load: n_head  = 12
gpt2_model_load: n_layer = 12
gpt2_model_load: ftype   = 1
gpt2_model_load: qntvr   = 0
gpt2_model_load: ggml tensor size    = 320 bytes
gpt2_model_load: backend buffer size = 312.72 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5
gpt2_model_load: using CPU backend
gpt2_model_load: memory size =    72.00 MB, n_mem = 12288
gpt2_model_load: model size  =   239.08 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: compute buffer size: 3.26 MB
main: generating 5 sequences ...
main: prompt: 'Hello my name is'
main: number of tokens in prompt = 4, first 8 tokens: 15496 616 1438 318
sequence 0:
Hello my name is John. You can call me any way you want, if you want, but for my very first date, I will be on the phone with you. We're both in our early 20s, but I feel like it's all
sequence 1:
Hello my name is Robert, and I want to say that we're proud to have your company here on the world's largest platform for sharing your stories with us. This is a huge opportunity for our community. We have hundreds of people on this team and
sequence 2:
Hello my name is Jack. I'm the one who created you.
Jack is a boy with a big smile and a big heart. He is a handsome guy. He loves the outdoors and loves the people he meets. He wants to be a
sequence 3:
Hello my name is John. I am a Canadian citizen with a large number of family in Quebec and I am interested in studying. My aim is to take up a post in the Journal of the International Academy of Sciences of Canada which I am currently finishing.
sequence 4:
Hello my name is Dan. I am an entrepreneur. I am a great father. I am a great husband. I am a great husband. I am a great dad. And I am a great husband.
I love my life. I love
main:     load time =   880.80 ms
main:   sample time =    91.43 ms
main:  predict time =  2518.29 ms
main:    total time =  3544.32 ms

slaren · 2023-10-11T13:36:20Z

Very cool!

We may want to preserve the old versions of the gpt-2 example, though. Initially, I added support for ggml-alloc to the gpt-2 example as a way to showcase how to use it, and then that got lost when I modified the gpt-2 example again to use ggml-backend. There will be another version in the future, that will show how to use multiple backends simultaneously, with ggml-backend v2. I have restored the old versions in my work branch as main-ctx.cpp, main-alloc.cpp and main-backend.cpp.

So what I am trying to say is that I think it would be good to rename this to something else, maybe as a different main.cpp file in the gpt-2 example, or as a different example altogether, because as the example becomes more complex, it becomes less useful as a way to show how to use certain features.

YavorGIvanov · 2023-10-11T14:50:00Z

Very cool!

We may want to preserve the old versions of the gpt-2 example, though. Initially, I added support for ggml-alloc to the gpt-2 example as a way to showcase how to use it, and then that got lost when I modified the gpt-2 example again to use ggml-backend. There will be another version in the future, that will show how to use multiple backends simultaneously, with ggml-backend v2. I have restored the old versions in my work branch as main-ctx.cpp, main-alloc.cpp and main-backend.cpp.

So what I am trying to say is that I think it would be good to rename this to something else, maybe as a different main.cpp file in the gpt-2 example, or as a different example altogether, because as the example becomes more complex, it becomes less useful as a way to show how to use certain features.

I agree that the example gets a bit more convoluted every time we add a new feature to it. I will just move it to a new gpt2 example or a new cpp in this folder as you suggested once I verify that it works.

…nd prompt

slaren · 2023-10-11T15:58:24Z

There is an issue when the prompt is just one token, n_kv is zero and it breaks some ops:

$ build/bin/gpt-2-batched -m examples/gpt-2/models/gpt-2-117M/ggml-model-f16.bin -n 50 -np 10 -p "Hello"
main: seed = 1697039825
gpt2_model_load: loading model from 'examples/gpt-2/models/gpt-2-117M/ggml-model-f16.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 1024
gpt2_model_load: n_embd  = 768
gpt2_model_load: n_head  = 12
gpt2_model_load: n_layer = 12
gpt2_model_load: ftype   = 1
gpt2_model_load: qntvr   = 0
gpt2_model_load: ggml tensor size    = 320 bytes
gpt2_model_load: backend buffer size = 312.72 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
gpt2_model_load: using CPU backend
gpt2_model_load: memory size =    72.00 MB, n_mem = 12288
gpt2_model_load: model size  =   239.08 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
n_kv = 1024, n_tokens = 1
main: compute buffer size: 3.07 MB
n_kv = 0, n_tokens = 1
fish: Job 1, 'build/bin/gpt-2-batched -m exam…' terminated by signal SIGFPE (Floating point exception)

ggml-ci

YavorGIvanov added 2 commits October 11, 2023 15:16

Initial attempt to make gpt2 do parallel decoding

ce6139c

Fix crash on trying to use empty embd

761db29

Make it work for n_parallel=1

38a1744

YavorGIvanov added 2 commits October 11, 2023 17:58

Add short way of passing n_parallel argument

845f39c

Move gpt-2 batched to a separate target and cpp file

42db404

YavorGIvanov marked this pull request as ready for review October 11, 2023 15:06

YavorGIvanov requested review from ggerganov and slaren and removed request for ggerganov October 11, 2023 15:06

Add batched sample output to README and remove hardcoded model path a…

5ffcbf4

…nd prompt

ggerganov and others added 6 commits October 11, 2023 22:13

gpt-2-batched : fix n_kv heuristic

d91540a

Free batch at end of example

af6a1d9

gpt-2-batched : simplify kv cache stuff (#574)

898718c

ggml-ci

Fix not generating n_predict tokens and fix warn

993d226

minor : readme

63ab3d6

Add check for end token and mark the stream as finished

c205875

ggerganov approved these changes Oct 12, 2023

View reviewed changes

ggerganov merged commit 8e82832 into master Oct 12, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel decoding in GPT2 example #572

Add parallel decoding in GPT2 example #572

YavorGIvanov commented Oct 11, 2023 •

edited

Loading

slaren commented Oct 11, 2023 •

edited

Loading

YavorGIvanov commented Oct 11, 2023 •

edited

Loading

slaren commented Oct 11, 2023 •

edited

Loading

Add parallel decoding in GPT2 example #572

Add parallel decoding in GPT2 example #572

Conversation

YavorGIvanov commented Oct 11, 2023 • edited Loading

slaren commented Oct 11, 2023 • edited Loading

YavorGIvanov commented Oct 11, 2023 • edited Loading

slaren commented Oct 11, 2023 • edited Loading

YavorGIvanov commented Oct 11, 2023 •

edited

Loading

slaren commented Oct 11, 2023 •

edited

Loading

YavorGIvanov commented Oct 11, 2023 •

edited

Loading

slaren commented Oct 11, 2023 •

edited

Loading