Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replit + MPT #145

Merged
merged 26 commits into from
May 17, 2023
Merged

Replit + MPT #145

merged 26 commits into from
May 17, 2023

Conversation

lukasmoellerch
Copy link
Contributor

@lukasmoellerch lukasmoellerch commented May 10, 2023

Implements #131 #136

Adds example code for mpt (https://huggingface.co/mosaicml/mpt-7b) and replit (https://huggingface.co/replit/replit-code-v1-3b). The code isn't too clean at the moment and I'll happily clean things up and implement suggestions, but might only be able to spend more time on this over the weekend.

Some hyperparameters are hardcoded such as the ffn / mlp ratio and the alibi max bias. Also, not all mpt-style models are supported, qkv clamping isn't implemented and a couple of other options aren't considered either.

The unigram tokenizer is comparably slow, but implementing a good one would add considerably more code to the example.

Also: Thank you @Leoputera2407 for helping with debugging the alibi problem

@lukasmoellerch
Copy link
Contributor Author

I can also merge both models into one example if that's preferred.

@klosax
Copy link
Contributor

klosax commented May 10, 2023

Does this support the MPT-7B-StoryWriter model with 65k context length?

@lukasmoellerch
Copy link
Contributor Author

Does this support the MPT-7B-StoryWriter model with 65k context length?

I didn't try it yet, but architecturally there is no difference between the base model and the story-writer fine-tuned model as to my knowledge. I can potentially try it tomorrow.

@abhi-mosaic
Copy link

Thank you so much @lukasmoellerch for building support for MPT models! All of us @ MosaicML are very excited :)

but architecturally there is no difference between the base model and the story-writer fine-tuned model

One thing I want to point out, the StoryWriter model does have two arch changes vs the other models. It uses alibi_bias_max=16 and clip_qkv=6. These were due to the long context setting. You can see the exact config here: https://huggingface.co/mosaicml/mpt-7b-storywriter/blob/main/config.json

@Leoputera2407
Copy link

Leoputera2407 commented May 11, 2023

Adding support for max_alibi_bias, seems quite do-able @lukasmoellerch.

I’m trying out a qkv_clamp on my branch, let me see if i can get that down by tomorrow. Really looking forward to try out story-writer on cpu, altho I wonder how much ram is required to support 65k context length.

@Leoputera2407
Copy link

Leoputera2407 commented May 11, 2023

I found an alternative implementation by folks from nomic-ai. They didn’t implement qkv-clamping, didn’t fix the alibi bug and didn’t implement the alibi_max_bias too. But, cool referrence I think

https://github.com/nomic-ai/gpt4all/blob/f8fdcccc5d253229808c0ceb9c5faae1ba42f68c/gpt4all-backend/mpt.cpp#L2

https://github.com/nomic-ai/gpt4all/blob/f8fdcccc5d253229808c0ceb9c5faae1ba42f68c/gpt4all-backend/scripts/convert_mpt_hf_to_ggml.py#L9

@lukasmoellerch
Copy link
Contributor Author

Sounds good, it seems like a lot of people are excited about the storywriter model, let’s get it integrated as well then, both modifications sound rather straightforward. @Leoputera2407 can you share what you’ve done regarding qkv clipping so far?

@klosax
Copy link
Contributor

klosax commented May 11, 2023

Inference of the StoryWriter model fails:

/main -m mpt-7b-storywriter-ggml-f16.bin 

main: seed = 1683800454
mpt_model_load: loading model from 'mpt-7b-storywriter-ggml-f16.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 1
mpt_model_load: ggml ctx size = 12683.13 MB
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 30479021776, available 13299222016)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 30479021776, available 13299222016)
Segmentation fault (core dumped)

Quantized model:

./main -m mpt-7b-storywriter-ggml-q5_1.bin 

main: seed = 1683800655
mpt_model_load: loading model from 'mpt-7b-storywriter-ggml-q5_1.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 9
mpt_model_load: ggml ctx size = 4756.88 MB
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 22167746256, available 4987946496)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 22167746256, available 4987946496)
Segmentation fault (core dumped)

The base model works.

@Green-Sky
Copy link
Contributor

needed 30479021776

29gigs anyone?

are you guys using f16 for the kv ?

@lukasmoellerch
Copy link
Contributor Author

needed 30479021776

29gigs anyone?

are you guys using f16 for the kv ?

It seems like the context size calculation is actually wrong, but in the wrong direction i.e. it calculates with f32, but later allocates an f16... I'll investigate later.

@klosax
Copy link
Contributor

klosax commented May 11, 2023

It seems like the context size calculation is actually wrong, but in the wrong direction i.e. it calculates with f32, but later allocates an f16... I'll investigate later.

I found a solution: I changed the types in the ctx_size calculation to uint64_t and changed the calculation of memory_k and memory_v to type F16

uint64_t ctx_size = 0;

  {
    const auto &hparams = model.hparams;

    const uint64_t n_embd = hparams.d_model;
    const uint64_t n_layer = hparams.n_layers;
    const uint64_t n_ctx = hparams.max_seq_len;
    const uint64_t n_vocab = hparams.n_vocab;

(...)

    ctx_size +=   n_ctx * n_layer * n_embd * ggml_type_sizef(GGML_TYPE_F16); // memory_k
    ctx_size +=   n_ctx * n_layer * n_embd * ggml_type_sizef(GGML_TYPE_F16); // memory_v

Working output:

./main -m mpt-7b-storywriter-ggml-f16.bin 
main: seed = 1683806544
mpt_model_load: loading model from 'mpt-7b-storywriter-ggml-f16.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 1
mpt_model_load: ggml ctx size = 45451.13 MB
mpt_model_load: memory_size = 32768.00 MB, n_mem = 2097152
mpt_model_load: ........................ done
mpt_model_load: model size = 12683.02 MB / num tensors = 194
main: number of tokens in prompt = 1
main: token[0] =    510

The last few years, she'd been able to come and go, she didn't come.

@klosax
Copy link
Contributor

klosax commented May 11, 2023

The generation is not very good, it seems to keep repeating itself after a while:

Base model:

./main -m mpt-7b-ggml-f16.bin -p "Once upon"

main: seed = 1683807627
mpt_model_load: loading model from 'mpt-7b-ggml-f16.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 2048
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 1
mpt_model_load: ggml ctx size = 13707.13 MB
mpt_model_load: memory_size =  1024.00 MB, n_mem = 65536
mpt_model_load: ........................ done
mpt_model_load: model size = 12683.02 MB / num tensors = 194
main: number of tokens in prompt = 2
main: token[0] =  10758
main: token[1] =   2220

Once upon a time, there was a boy and a girl. They went to a wedding. And while they were there, they met a woman.
They danced with her. They talked. The girl, and boy went away. And they walked. The girl with them. He, they danced.
But they had. They danced, went. Then they, he said, and talked, and went to go.
But, the night, they. And they, and walked. They were, they. They got And did. And they. And walked. And had,

StoryWriter model:

./main -m mpt-7b-storywriter-ggml-f16.bin -p "Once upon"

main: seed = 1683807952
mpt_model_load: loading model from 'mpt-7b-storywriter-ggml-f16.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 1
mpt_model_load: ggml ctx size = 45451.13 MB
mpt_model_load: memory_size = 32768.00 MB, n_mem = 2097152
mpt_model_load: ........................ done
mpt_model_load: model size = 12683.02 MB / num tensors = 194
main: number of tokens in prompt = 2
main: token[0] =  10758
main: token[1] =   2220

Once upon a time there was a boy named Nick who lived in Brooklyn and grew up to be a man in Manhattan.
His father was a businessman and his mother worked as a nurse. Nick had a job working working working for the dad.
Nick was working working working Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick

@Green-Sky
Copy link
Contributor

Green-Sky commented May 11, 2023

yea i can see the same

$ bin/mpt -m ../examples/mpt/models/mpt-7b-storywriter/mpt-7b-storywriter-ggml_v0-q4_0.bin
main: seed = 1683813966
mpt_model_load: loading model from '../examples/mpt/models/mpt-7b-storywriter/mpt-7b-storywriter-ggml_v0-q4_0.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 2
mpt_model_load: ggml ctx size = 36732.25 MB
mpt_model_load: memory_size = 32768.00 MB, n_mem = 2097152
mpt_model_load: ........................ done
mpt_model_load: model size =  3964.14 MB / num tensors = 194
main: number of tokens in prompt = 1
main: token[0] =   2993

She was so fat that I was not sure how she could still stand up walk. And I was not sure that I was so fat. And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And^C

there seems no repeat penalty and other stuff. needs some updates from the llama.cpp codebase.

edit: i am using the q4_0, and it seems to die faster. we also need an option to set the ctx size. since we preallocate, i can't run the f16.

@alextrott16
Copy link

StoryWriter does love to repeat itself but I've found some settings that you can use with HF generate the tend to work pretty well.

temperature: 0.8
top_p: 1.0
top_k: 0
repetition_penalty: 1.02
no_repeat_ngram_size: 6

These are the same settings you get by default in our demo space https://huggingface.co/spaces/mosaicml/mpt-7b-storywriter

@klosax
Copy link
Contributor

klosax commented May 11, 2023

there seems no repeat penalty and other stuff. needs some updates from the llama.cpp codebase.

I would love to see the common infrastructure of llama.cpp become something like "ggml-llm" and the code for the specific llm architectures (llama, gpt-2, gpt-j, mpt and others) become like add-ons at compile time.

@Green-Sky
Copy link
Contributor

Green-Sky commented May 11, 2023

I uploaded some ggml files, so we can test this more easily https://huggingface.co/Green-Sky/ggml-mpt-7b-storywriter
also this is still using the old conversion script, which needs to load the full pytorch model into memory for conversion

@ggerganov
Copy link
Owner

Looks like great progress - will be taking a more detailed look soon

there seems no repeat penalty and other stuff. needs some updates from the llama.cpp codebase.

I would love to see the common infrastructure of llama.cpp become something like "ggml-llm" and the code for the specific llm architectures (llama, gpt-2, gpt-j, mpt and others) become like add-ons at compile time.

Yes, this would be great. Now that we have various examples of LLM inference and I have a better understanding of the general API structure that is necessary, it will be easier to come up with a way to unify all these into a single interface

@lukasmoellerch
Copy link
Contributor Author

@ggerganov I think we might want to separate the model max_seq_length (which e.g. is used in the alibi bias offset) and the amount of k / v memory slots we allocate for inference. I temporarily hardcoded n_ctx in mpt to 4096 because otherwise my MacBook Air wasn't too happy, but this should probably be an inference parameter - should we at least just set it to params.n_predict + embd_inp.size() such that the user can set it using a command line flag?

@Green-Sky
Copy link
Contributor

@lukasmoellerch cant run storywriter model anymore 😆 , without or any quantization

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 9628746752, available 9628745984)
Segmentation fault (core dumped)

@klosax
Copy link
Contributor

klosax commented May 14, 2023

After much trial and error I found a formula for setting the memory buffer, as it needs more space with each evaluated token:

The formula only works if n_batch is equal 1.

static size_t buf_size = 64 * 1024 * 1024 + 261 * 1024 * n_past;

or even better set it in main to:

buf_size = 64 * 1024 * 1024 + 261 * 1024 * params.n_predict;
buf = malloc(buf_size);

with buf declared globally

n_predict should be forced to always be equal to or lower than n_ctx, both can be controlled from command-line.

@lukasmoellerch
Copy link
Contributor Author

lukasmoellerch commented May 14, 2023

@lukasmoellerch cant run storywriter model anymore 😆 , without or any quantization

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 9628746752, available 9628745984)
Segmentation fault (core dumped)

Yes, I just merged master with the quantisation changes which means that the object overhead adjustment wasn't correct, still a bit broken though.

@lukasmoellerch
Copy link
Contributor Author

nvm, just had to re-quantize the model

@ggerganov
Copy link
Owner

should we at least just set it to params.n_predict + embd_inp.size() such that the user can set it using a command line flag?

Yes, I think this is a good workaround for now.

Regarding the memory usage during inference, there are 2 things that can help to reduce it:

  • use "inplace" calls for some of the operators. for example: 5839d9e
  • use scratch buffers

The second one should significantly reduce the memory usage, but it is quite tricky to get right since the process is manual and very easy to mess-up.

We can do both of these optimizations in a later PR, if you prefer to get this merged soon

@lukasmoellerch
Copy link
Contributor Author

should we at least just set it to params.n_predict + embd_inp.size() such that the user can set it using a command line flag?

Yes, I think this is a good workaround for now.

I looked into this a bit, but can't really get it to be clean. n_predict is the number of additional tokens, thus the total number of tokens depends on the tokenizer being loaded. On the other hand we need the prompt to know the number of tokens which we don't have in load (and also shouldn't have) - I think we either want to separate k/v tensor creation from model loading or have it as a separate parameter.

But I can also do that in a follow-up PR.

Let me know what I can still do in this pr.


// a = self.ln_1(x)
{
cur = ggml_norm(ctx0, inpL);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python implementation uses LayerNorm - double check if this corresponds to ggml_norm or ggml_rms_norm. I'm not sure where to look for the source code of LayerNorm

@@ -6219,6 +6226,36 @@ struct ggml_tensor * ggml_alibi(
return result;
}

// ggml_alibi
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ggml_alibi
// ggml_clamp

@@ -10831,6 +10871,79 @@ static void ggml_compute_forward_alibi(
}
}


// ggml_compute_forward_alibi
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ggml_compute_forward_alibi
// ggml_compute_forward_clamp

Comment on lines +6245 to +6249

struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, 3);
((float *) b->data)[0] = min;
((float *) b->data)[1] = max;

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be surrounded with ggml_scratch_save() and ggml_scratch_load():

ggml/src/ggml.c

Lines 3925 to 3939 in 010203f

// IMPORTANT:
// when creating "opt" tensors, always save and load the scratch buffer
// this is an error prone process, but it is necessary to support inplace
// operators when using scratch buffers
// TODO: implement a better way
void ggml_scratch_save(struct ggml_context * ctx) {
ctx->scratch_save = ctx->scratch;
ctx->scratch.data = NULL;
}
void ggml_scratch_load(struct ggml_context * ctx) {
ctx->scratch = ctx->scratch_save;
}

See how they are used in other operators that pass parameters like this.
This is needed to support scratch buffers later

@ggerganov ggerganov merged commit 1d6a133 into ggerganov:master May 17, 2023
@ggerganov
Copy link
Owner

@lukasmoellerch and everyone else - thanks for this contribution

I'll probably play with this in the next days and will try to improve the memory allocation logic.

@lukasmoellerch
Copy link
Contributor Author

@lukasmoellerch and everyone else - thanks for this contribution

I'll probably play with this in the next days and will try to improve the memory allocation logic.

Thanks for your patience, I really like the project - let my know if any follow-up PRs are required, would be willing to work on them, was just a bit busy with other stuff last week.

@x4080
Copy link

x4080 commented May 22, 2023

I just tried today that the storywriter still repeating the story over and over again, is there any trick to avoid it ?
Thanks

@klosax
Copy link
Contributor

klosax commented May 22, 2023

I just tried today that the storywriter still repeating the story over and over again, is there any trick to avoid it ? Thanks

Repeat penalty is being implemented in pr #184 .

@jploski
Copy link
Contributor

jploski commented May 22, 2023

Note that repetition_penalty from pr #184 (and also as implemented in llama.cpp) is not the same as no_repeat_ngram_size which is used in the MPT-7 HuggingFace space (https://huggingface.co/spaces/mosaicml/mpt-7b-storywriter):

https://github.com/huggingface/transformers/blob/2f424d79797ea5344f1b3ac241be1a181cfc220d/src/transformers/generation/utils.py#LL860C30-L860C48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet