Replit + MPT #145

lukasmoellerch · 2023-05-10T21:59:25Z

Implements #131 #136

Adds example code for mpt (https://huggingface.co/mosaicml/mpt-7b) and replit (https://huggingface.co/replit/replit-code-v1-3b). The code isn't too clean at the moment and I'll happily clean things up and implement suggestions, but might only be able to spend more time on this over the weekend.

Some hyperparameters are hardcoded such as the ffn / mlp ratio and the alibi max bias. Also, not all mpt-style models are supported, qkv clamping isn't implemented and a couple of other options aren't considered either.

The unigram tokenizer is comparably slow, but implementing a good one would add considerably more code to the example.

Also: Thank you @Leoputera2407 for helping with debugging the alibi problem

lukasmoellerch · 2023-05-10T22:09:08Z

I can also merge both models into one example if that's preferred.

klosax · 2023-05-10T22:30:28Z

Does this support the MPT-7B-StoryWriter model with 65k context length?

lukasmoellerch · 2023-05-10T22:43:37Z

Does this support the MPT-7B-StoryWriter model with 65k context length?

I didn't try it yet, but architecturally there is no difference between the base model and the story-writer fine-tuned model as to my knowledge. I can potentially try it tomorrow.

abhi-mosaic · 2023-05-10T23:51:24Z

Thank you so much @lukasmoellerch for building support for MPT models! All of us @ MosaicML are very excited :)

but architecturally there is no difference between the base model and the story-writer fine-tuned model

One thing I want to point out, the StoryWriter model does have two arch changes vs the other models. It uses alibi_bias_max=16 and clip_qkv=6. These were due to the long context setting. You can see the exact config here: https://huggingface.co/mosaicml/mpt-7b-storywriter/blob/main/config.json

Leoputera2407 · 2023-05-11T00:03:17Z

Adding support for max_alibi_bias, seems quite do-able @lukasmoellerch.

I’m trying out a qkv_clamp on my branch, let me see if i can get that down by tomorrow. Really looking forward to try out story-writer on cpu, altho I wonder how much ram is required to support 65k context length.

Leoputera2407 · 2023-05-11T00:58:20Z

I found an alternative implementation by folks from nomic-ai. They didn’t implement qkv-clamping, didn’t fix the alibi bug and didn’t implement the alibi_max_bias too. But, cool referrence I think

https://github.com/nomic-ai/gpt4all/blob/f8fdcccc5d253229808c0ceb9c5faae1ba42f68c/gpt4all-backend/mpt.cpp#L2

https://github.com/nomic-ai/gpt4all/blob/f8fdcccc5d253229808c0ceb9c5faae1ba42f68c/gpt4all-backend/scripts/convert_mpt_hf_to_ggml.py#L9

lukasmoellerch · 2023-05-11T07:13:59Z

Sounds good, it seems like a lot of people are excited about the storywriter model, let’s get it integrated as well then, both modifications sound rather straightforward. @Leoputera2407 can you share what you’ve done regarding qkv clipping so far?

klosax · 2023-05-11T10:25:41Z

Inference of the StoryWriter model fails:

/main -m mpt-7b-storywriter-ggml-f16.bin 

main: seed = 1683800454
mpt_model_load: loading model from 'mpt-7b-storywriter-ggml-f16.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 1
mpt_model_load: ggml ctx size = 12683.13 MB
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 30479021776, available 13299222016)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 30479021776, available 13299222016)
Segmentation fault (core dumped)

Quantized model:

./main -m mpt-7b-storywriter-ggml-q5_1.bin 

main: seed = 1683800655
mpt_model_load: loading model from 'mpt-7b-storywriter-ggml-q5_1.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 9
mpt_model_load: ggml ctx size = 4756.88 MB
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 22167746256, available 4987946496)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 22167746256, available 4987946496)
Segmentation fault (core dumped)

The base model works.

Green-Sky · 2023-05-11T11:00:11Z

needed 30479021776

29gigs anyone?

are you guys using f16 for the kv ?

lukasmoellerch · 2023-05-11T11:34:51Z

needed 30479021776

29gigs anyone?

are you guys using f16 for the kv ?

It seems like the context size calculation is actually wrong, but in the wrong direction i.e. it calculates with f32, but later allocates an f16... I'll investigate later.

klosax · 2023-05-11T12:09:28Z

It seems like the context size calculation is actually wrong, but in the wrong direction i.e. it calculates with f32, but later allocates an f16... I'll investigate later.

I found a solution: I changed the types in the ctx_size calculation to uint64_t and changed the calculation of memory_k and memory_v to type F16

uint64_t ctx_size = 0;

  {
    const auto &hparams = model.hparams;

    const uint64_t n_embd = hparams.d_model;
    const uint64_t n_layer = hparams.n_layers;
    const uint64_t n_ctx = hparams.max_seq_len;
    const uint64_t n_vocab = hparams.n_vocab;

(...)

    ctx_size +=   n_ctx * n_layer * n_embd * ggml_type_sizef(GGML_TYPE_F16); // memory_k
    ctx_size +=   n_ctx * n_layer * n_embd * ggml_type_sizef(GGML_TYPE_F16); // memory_v

Working output:

./main -m mpt-7b-storywriter-ggml-f16.bin 
main: seed = 1683806544
mpt_model_load: loading model from 'mpt-7b-storywriter-ggml-f16.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 1
mpt_model_load: ggml ctx size = 45451.13 MB
mpt_model_load: memory_size = 32768.00 MB, n_mem = 2097152
mpt_model_load: ........................ done
mpt_model_load: model size = 12683.02 MB / num tensors = 194
main: number of tokens in prompt = 1
main: token[0] =    510

The last few years, she'd been able to come and go, she didn't come.

klosax · 2023-05-11T12:28:49Z

The generation is not very good, it seems to keep repeating itself after a while:

Base model:

./main -m mpt-7b-ggml-f16.bin -p "Once upon"

main: seed = 1683807627
mpt_model_load: loading model from 'mpt-7b-ggml-f16.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 2048
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 1
mpt_model_load: ggml ctx size = 13707.13 MB
mpt_model_load: memory_size =  1024.00 MB, n_mem = 65536
mpt_model_load: ........................ done
mpt_model_load: model size = 12683.02 MB / num tensors = 194
main: number of tokens in prompt = 2
main: token[0] =  10758
main: token[1] =   2220

Once upon a time, there was a boy and a girl. They went to a wedding. And while they were there, they met a woman.
They danced with her. They talked. The girl, and boy went away. And they walked. The girl with them. He, they danced.
But they had. They danced, went. Then they, he said, and talked, and went to go.
But, the night, they. And they, and walked. They were, they. They got And did. And they. And walked. And had,

StoryWriter model:

./main -m mpt-7b-storywriter-ggml-f16.bin -p "Once upon"

main: seed = 1683807952
mpt_model_load: loading model from 'mpt-7b-storywriter-ggml-f16.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 1
mpt_model_load: ggml ctx size = 45451.13 MB
mpt_model_load: memory_size = 32768.00 MB, n_mem = 2097152
mpt_model_load: ........................ done
mpt_model_load: model size = 12683.02 MB / num tensors = 194
main: number of tokens in prompt = 2
main: token[0] =  10758
main: token[1] =   2220

Once upon a time there was a boy named Nick who lived in Brooklyn and grew up to be a man in Manhattan.
His father was a businessman and his mother worked as a nurse. Nick had a job working working working for the dad.
Nick was working working working Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick Nick

Green-Sky · 2023-05-11T14:08:52Z

yea i can see the same

$ bin/mpt -m ../examples/mpt/models/mpt-7b-storywriter/mpt-7b-storywriter-ggml_v0-q4_0.bin
main: seed = 1683813966
mpt_model_load: loading model from '../examples/mpt/models/mpt-7b-storywriter/mpt-7b-storywriter-ggml_v0-q4_0.bin' - please wait ...
mpt_model_load: d_model       = 4096
mpt_model_load: max_seq_len   = 65536
mpt_model_load: n_heads       = 32
mpt_model_load: n_layers      = 32
mpt_model_load: n_vocab      = 50432
mpt_model_load: ftype   = 2
mpt_model_load: ggml ctx size = 36732.25 MB
mpt_model_load: memory_size = 32768.00 MB, n_mem = 2097152
mpt_model_load: ........................ done
mpt_model_load: model size =  3964.14 MB / num tensors = 194
main: number of tokens in prompt = 1
main: token[0] =   2993

She was so fat that I was not sure how she could still stand up walk. And I was not sure that I was so fat. And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And And^C

there seems no repeat penalty and other stuff. needs some updates from the llama.cpp codebase.

edit: i am using the q4_0, and it seems to die faster. we also need an option to set the ctx size. since we preallocate, i can't run the f16.

alextrott16 · 2023-05-11T15:08:59Z

StoryWriter does love to repeat itself but I've found some settings that you can use with HF generate the tend to work pretty well.

temperature: 0.8
top_p: 1.0
top_k: 0
repetition_penalty: 1.02
no_repeat_ngram_size: 6

These are the same settings you get by default in our demo space https://huggingface.co/spaces/mosaicml/mpt-7b-storywriter

klosax · 2023-05-11T17:54:21Z

there seems no repeat penalty and other stuff. needs some updates from the llama.cpp codebase.

I would love to see the common infrastructure of llama.cpp become something like "ggml-llm" and the code for the specific llm architectures (llama, gpt-2, gpt-j, mpt and others) become like add-ons at compile time.

Green-Sky · 2023-05-11T18:33:17Z

I uploaded some ggml files, so we can test this more easily https://huggingface.co/Green-Sky/ggml-mpt-7b-storywriter
also this is still using the old conversion script, which needs to load the full pytorch model into memory for conversion

ggerganov · 2023-05-11T21:53:00Z

Looks like great progress - will be taking a more detailed look soon

there seems no repeat penalty and other stuff. needs some updates from the llama.cpp codebase.

I would love to see the common infrastructure of llama.cpp become something like "ggml-llm" and the code for the specific llm architectures (llama, gpt-2, gpt-j, mpt and others) become like add-ons at compile time.

Yes, this would be great. Now that we have various examples of LLM inference and I have a better understanding of the general API structure that is necessary, it will be easier to come up with a way to unify all these into a single interface

lukasmoellerch · 2023-05-14T21:09:15Z

@ggerganov I think we might want to separate the model max_seq_length (which e.g. is used in the alibi bias offset) and the amount of k / v memory slots we allocate for inference. I temporarily hardcoded n_ctx in mpt to 4096 because otherwise my MacBook Air wasn't too happy, but this should probably be an inference parameter - should we at least just set it to params.n_predict + embd_inp.size() such that the user can set it using a command line flag?

Green-Sky · 2023-05-14T21:13:15Z

@lukasmoellerch cant run storywriter model anymore 😆 , without or any quantization

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 9628746752, available 9628745984)
Segmentation fault (core dumped)

klosax · 2023-05-14T21:13:29Z

After much trial and error I found a formula for setting the memory buffer, as it needs more space with each evaluated token:

The formula only works if n_batch is equal 1.

static size_t buf_size = 64 * 1024 * 1024 + 261 * 1024 * n_past;

or even better set it in main to:

buf_size = 64 * 1024 * 1024 + 261 * 1024 * params.n_predict;
buf = malloc(buf_size);

with buf declared globally

n_predict should be forced to always be equal to or lower than n_ctx, both can be controlled from command-line.

lukasmoellerch · 2023-05-14T21:20:31Z

@lukasmoellerch cant run storywriter model anymore 😆 , without or any quantization
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 9628746752, available 9628745984)
Segmentation fault (core dumped)

Yes, I just merged master with the quantisation changes which means that the object overhead adjustment wasn't correct, still a bit broken though.

lukasmoellerch · 2023-05-14T21:30:52Z

nvm, just had to re-quantize the model

ggerganov · 2023-05-15T08:30:16Z

should we at least just set it to params.n_predict + embd_inp.size() such that the user can set it using a command line flag?

Yes, I think this is a good workaround for now.

Regarding the memory usage during inference, there are 2 things that can help to reduce it:

use "inplace" calls for some of the operators. for example: 5839d9e
use scratch buffers

The second one should significantly reduce the memory usage, but it is quite tricky to get right since the process is manual and very easy to mess-up.

We can do both of these optimizations in a later PR, if you prefer to get this merged soon

lukasmoellerch · 2023-05-16T18:33:03Z

should we at least just set it to params.n_predict + embd_inp.size() such that the user can set it using a command line flag?

Yes, I think this is a good workaround for now.

I looked into this a bit, but can't really get it to be clean. n_predict is the number of additional tokens, thus the total number of tokens depends on the tokenizer being loaded. On the other hand we need the prompt to know the number of tokens which we don't have in load (and also shouldn't have) - I think we either want to separate k/v tensor creation from model loading or have it as a separate parameter.

But I can also do that in a follow-up PR.

Let me know what I can still do in this pr.

ggerganov · 2023-05-13T09:26:01Z

examples/replit/main.cpp

+
+ // a = self.ln_1(x)
+ {
+ cur = ggml_norm(ctx0, inpL);


Python implementation uses LayerNorm - double check if this corresponds to ggml_norm or ggml_rms_norm. I'm not sure where to look for the source code of LayerNorm

ggerganov · 2023-05-15T08:31:32Z

src/ggml.c

@@ -6219,6 +6226,36 @@ struct ggml_tensor * ggml_alibi(
 return result;
 }

+// ggml_alibi


Suggested change

// ggml_alibi

// ggml_clamp

ggerganov · 2023-05-15T08:31:50Z

src/ggml.c

@@ -10831,6 +10871,79 @@ static void ggml_compute_forward_alibi(
 }
 }

+
+// ggml_compute_forward_alibi


Suggested change

// ggml_compute_forward_alibi

// ggml_compute_forward_clamp

ggerganov · 2023-05-15T08:34:37Z

src/ggml.c

+
+ struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, 3);
+ ((float *) b->data)[0] = min;
+ ((float *) b->data)[1] = max;
+


This has to be surrounded with ggml_scratch_save() and ggml_scratch_load():

ggml/src/ggml.c

Lines 3925 to 3939 in 010203f

// IMPORTANT:

// when creating "opt" tensors, always save and load the scratch buffer

// this is an error prone process, but it is necessary to support inplace

// operators when using scratch buffers

// TODO: implement a better way

void ggml_scratch_save(struct ggml_context * ctx) {

ctx->scratch_save = ctx->scratch;

ctx->scratch.data = NULL;

}

void ggml_scratch_load(struct ggml_context * ctx) {

ctx->scratch = ctx->scratch_save;

}

See how they are used in other operators that pass parameters like this.
This is needed to support scratch buffers later

ggerganov · 2023-05-17T20:01:13Z

@lukasmoellerch and everyone else - thanks for this contribution

I'll probably play with this in the next days and will try to improve the memory allocation logic.

lukasmoellerch · 2023-05-17T20:06:11Z

@lukasmoellerch and everyone else - thanks for this contribution

I'll probably play with this in the next days and will try to improve the memory allocation logic.

Thanks for your patience, I really like the project - let my know if any follow-up PRs are required, would be willing to work on them, was just a bit busy with other stuff last week.

x4080 · 2023-05-22T01:45:44Z

I just tried today that the storywriter still repeating the story over and over again, is there any trick to avoid it ?
Thanks

klosax · 2023-05-22T20:37:05Z

I just tried today that the storywriter still repeating the story over and over again, is there any trick to avoid it ? Thanks

Repeat penalty is being implemented in pr #184 .

jploski · 2023-05-22T20:47:39Z

Note that repetition_penalty from pr #184 (and also as implemented in llama.cpp) is not the same as no_repeat_ngram_size which is used in the MPT-7 HuggingFace space (https://huggingface.co/spaces/mosaicml/mpt-7b-storywriter):

https://github.com/huggingface/transformers/blob/2f424d79797ea5344f1b3ac241be1a181cfc220d/src/transformers/generation/utils.py#LL860C30-L860C48

lukasmoellerch added 11 commits May 9, 2023 16:03

Add replit model

418d844

Add unigram tokenization support

c08ab88

Remove debug log

77e6c87

Port alibi attn bias fix

c880ab6

Remove torch input

1eb5cda

Fix hardcoded path

0ea3532

Remove unsupported hyperparams

9bea19d

Add mpt

bf237cb

Add replit quantization script

353873d

Remove debug print

ab0bc67

Add quantization support to mpt

b43281a

jploski mentioned this pull request May 10, 2023

Add GGML support mosaicml/llm-foundry#60

Closed

samhavens mentioned this pull request May 11, 2023

does it work on local machine or someone with limited resources mosaicml/llm-foundry#75

Closed

Green-Sky approved these changes May 14, 2023

View reviewed changes

lukasmoellerch added 3 commits May 14, 2023 23:00

Potentially fix gcc compilation error

26c9e91

Merge branch 'master' into replit

018d973

Fix warning

43c3245

Adjust object overhead

409da4f

Remove dead code

d404de8

Green-Sky mentioned this pull request May 15, 2023

MPT Models do not Load (GPT4ALL) LostRuins/koboldcpp#162

Closed

kuvaus mentioned this pull request May 16, 2023

gpt4all-backend: Fix MPT buffer use, deduplicate sampling and tokenizing nomic-ai/gpt4all#589

Merged

ardywibowo mentioned this pull request May 17, 2023

Scripts to generate ggml-mpt-7b-instruct.bin? nomic-ai/gpt4all#612

Closed

marella mentioned this pull request May 17, 2023

New ggml llamacpp file format support marella/ctransformers#4

Closed

ggerganov approved these changes May 17, 2023

View reviewed changes

ggerganov merged commit 1d6a133 into ggerganov:master May 17, 2023

gjmulder mentioned this pull request May 18, 2023

Push Docker images to Dockerhub using Github actions for running a llama-cpp-python REST server abetlen/llama-cpp-python#236

Closed

hanlint mentioned this pull request May 19, 2023

What are the hardware requirements? mosaicml/llm-foundry#111

Closed

marella mentioned this pull request May 23, 2023

Make llama.cpp depends on ggml #185

Open

This was referenced May 30, 2023

llama : add Falcon LLM support ggerganov/llama.cpp#1602

Closed

ggml : unified file format #220

Closed

marella mentioned this pull request Jun 24, 2023

Few questions / issues marella/ctransformers#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replit + MPT #145

Replit + MPT #145

lukasmoellerch commented May 10, 2023 •

edited

Loading

lukasmoellerch commented May 10, 2023

klosax commented May 10, 2023

lukasmoellerch commented May 10, 2023

abhi-mosaic commented May 10, 2023

Leoputera2407 commented May 11, 2023 •

edited

Loading

Leoputera2407 commented May 11, 2023 •

edited

Loading

lukasmoellerch commented May 11, 2023

klosax commented May 11, 2023

Green-Sky commented May 11, 2023

lukasmoellerch commented May 11, 2023

klosax commented May 11, 2023

klosax commented May 11, 2023

Green-Sky commented May 11, 2023 •

edited

Loading

alextrott16 commented May 11, 2023

klosax commented May 11, 2023

Green-Sky commented May 11, 2023 •

edited

Loading

ggerganov commented May 11, 2023

lukasmoellerch commented May 14, 2023

Green-Sky commented May 14, 2023

klosax commented May 14, 2023 •

edited

Loading

lukasmoellerch commented May 14, 2023 •

edited

Loading

lukasmoellerch commented May 14, 2023

ggerganov commented May 15, 2023

lukasmoellerch commented May 16, 2023

ggerganov May 13, 2023

ggerganov May 15, 2023

ggerganov May 15, 2023

ggerganov May 15, 2023

ggerganov commented May 17, 2023

lukasmoellerch commented May 17, 2023

x4080 commented May 22, 2023

klosax commented May 22, 2023

jploski commented May 22, 2023


	// IMPORTANT:
	// when creating "opt" tensors, always save and load the scratch buffer
	// this is an error prone process, but it is necessary to support inplace
	// operators when using scratch buffers
	// TODO: implement a better way
	void ggml_scratch_save(struct ggml_context * ctx) {
	ctx->scratch_save = ctx->scratch;
	ctx->scratch.data = NULL;
	}

	void ggml_scratch_load(struct ggml_context * ctx) {
	ctx->scratch = ctx->scratch_save;
	}

Replit + MPT #145

Replit + MPT #145

Conversation

lukasmoellerch commented May 10, 2023 • edited Loading

lukasmoellerch commented May 10, 2023

klosax commented May 10, 2023

lukasmoellerch commented May 10, 2023

abhi-mosaic commented May 10, 2023

Leoputera2407 commented May 11, 2023 • edited Loading

Leoputera2407 commented May 11, 2023 • edited Loading

lukasmoellerch commented May 11, 2023

klosax commented May 11, 2023

Green-Sky commented May 11, 2023

lukasmoellerch commented May 11, 2023

klosax commented May 11, 2023

klosax commented May 11, 2023

Green-Sky commented May 11, 2023 • edited Loading

alextrott16 commented May 11, 2023

klosax commented May 11, 2023

Green-Sky commented May 11, 2023 • edited Loading

ggerganov commented May 11, 2023

lukasmoellerch commented May 14, 2023

Green-Sky commented May 14, 2023

klosax commented May 14, 2023 • edited Loading

lukasmoellerch commented May 14, 2023 • edited Loading

lukasmoellerch commented May 14, 2023

ggerganov commented May 15, 2023

lukasmoellerch commented May 16, 2023

ggerganov May 13, 2023

Choose a reason for hiding this comment

ggerganov May 15, 2023

Choose a reason for hiding this comment

ggerganov May 15, 2023

Choose a reason for hiding this comment

ggerganov May 15, 2023

Choose a reason for hiding this comment

ggerganov commented May 17, 2023

lukasmoellerch commented May 17, 2023

x4080 commented May 22, 2023

klosax commented May 22, 2023

jploski commented May 22, 2023

lukasmoellerch commented May 10, 2023 •

edited

Loading

Leoputera2407 commented May 11, 2023 •

edited

Loading

Leoputera2407 commented May 11, 2023 •

edited

Loading

Green-Sky commented May 11, 2023 •

edited

Loading

Green-Sky commented May 11, 2023 •

edited

Loading

klosax commented May 14, 2023 •

edited

Loading

lukasmoellerch commented May 14, 2023 •

edited

Loading