add falcon7b example #231

apage43 · 2023-06-06T06:18:51Z

#217
adapted from gpt-neox example and work started in ggerganov/llama.cpp#1602

only supports 7b right now - 40b multiquery attention gets hairier, as its 128 query heads with 8 k and v heads, as opposed to 7B's 71 query heads with 1 k and v head

AndriyMulyar · 2023-06-06T14:35:53Z

Nice!

TheBloke · 2023-06-06T14:38:31Z

Well done guys! Really excited for this

ggerganov · 2023-06-07T16:21:09Z

Cool - will take a look soon
Is this using actual MQA or it's still doing the trick with the copies?

apage43 · 2023-06-07T16:36:10Z

Cool - will take a look soon Is this using actual MQA or it's still doing the trick with the copies?

does copies with ggml_repeat presently - also wound up fairly hackily creating a "dummy" tensor of the target shape since there wasn't one to use already handy as well as not transposing V before storing in the kv cache to deal with having to repeat the Vs

cmp-nct · 2023-06-07T22:25:38Z

Are you working on a 40B branch already ?

apage43 · 2023-06-07T22:34:56Z

Are you working on a 40B branch already ?

I'm not presently - being that it's big its a bit more inconvenient to hack on as I'd need to use a bigger machine than I'm usually on for dev stuff.

cmp-nct · 2023-06-07T23:10:10Z

Are you working on a 40B branch already ?

I'm not presently - being that it's big its a bit more inconvenient to hack on as I'd need to use a bigger machine than I'm usually on for dev stuff.

Did you see https://huggingface.co/jploski/falcon40b-mini-shakespeare ?

apage43 · 2023-06-07T23:19:06Z

Are you working on a 40B branch already ?

I'm not presently - being that it's big its a bit more inconvenient to hack on as I'd need to use a bigger machine than I'm usually on for dev stuff.

Did you see https://huggingface.co/jploski/falcon40b-mini-shakespeare ?

I have now :) I still probably won't get to it soon - but if someone figures out how to support that before this lands I'm happy to incorporate it

jploski · 2023-06-10T20:00:43Z

I did some work regarding 40B support today: 27cf1ad

After making my head nearly explode several times I reached a point where it generates okay sounding prose from the falcon40b-mini-shakespeare model, but it does not match the Python version output exactly as it should (and as it does for the 7B version).

The main obstacle seems to be that I am unable to make ggml_repeat broadcast multiple keys like the "k = torch.broadcast_to(k, q.shape)" in Python does (I get "1,2,1,2" instead of "1,1,2,2" so to say).

Another big problem is that the I only got the query matrix to look like the original Python one through some brute force offset calculations and copying of subvectors. It probably won't scale at all. I'm under impression that what needs to be done there can't be done using just reshaping or view operations. The memory format (as stored in Python and written by the conversion script) seems to be very difficult to work with in GGML.

Or maybe I'm just too inexperienced in this tensor wrestling... Once again giving up in hope that someone with more clue can pick it up.

jploski · 2023-06-11T23:26:17Z

I did some work regarding 40B support today: 27cf1ad

As a further explanation of the code and where the complexity comes from here's a visualization of the fused_kqv weights format (from falcon40b-mini-shakespeare config): https://docs.google.com/spreadsheets/d/1FoM6pIUj23GMW4zO_G1hjmEnUacBxBKN/edit?usp=sharing&ouid=111096390735143611797&rtpof=true&sd=true

KerfuffleV2 · 2023-06-11T23:39:30Z

Maybe just make your own repeat operation that works the way you need? Seems like the repeat op is only implemented for float32 so there's just one version of the function required.

You could create a new op and just cut-and-paste the existing _repeat functions:

ggml/src/ggml.c

Line 8773 in f52d2a0

static void ggml_compute_forward_repeat_f32(

The function looks relatively simple also.

jploski · 2023-06-13T20:35:07Z

Maybe just make your own repeat operation that works the way you need? Seems like the repeat op is only implemented for float32 so there's just one version of the function required.

I added a new ggml_repeat2 function as suggested (3352043) - although the original ggml_repeat also has a backwards pass and I'm not sure if it's the same for what I added.

With some more tweaks (commited in 3bc786b) I now have a version which works with all falcon-mini-shakespeare models I have unleashed upon this world (both 7B and 40B configs). At least in 32bit, haven't tested quantized yet. The (known) remaining problem is the for-loop-based splitting of query heads. I suspect it's gonna blow up with a real big model, either being slow or exceeding the max number of tensors (4096) allowed by GGML (or both).

(Also it's possible that the implementation does some unnecessary operations like permutes or 4d instead of 3d, but that's minor.)

bin/falcon -m /mnt/seagate/miniconda3/falcon40b/falcon40b-mini-shakespeare/ggml-model--f32.bin --top_p 1 --top_k 1 -s 42 -p "When we loop"

When we loop, and for his head,
And in his head's head's face,
And yet with his head's head is to him;
And now, in this land's face,
And with his head by his head he will die.

I tend to agree, tha'ts almost what happened to me.

maccam912 · 2023-06-13T22:41:46Z

With some more tweaks (commited in 3bc786b) I...

Ok I've been too afraid to ask, but how on earth are you doing these commits that aren't on any branch at all? I wanted to clone the repo and check out the commit but I have no idea how to.

jploski · 2023-06-13T22:52:26Z

With some more tweaks (commited in 3bc786b) I...

Ok I've been too afraid to ask, but how on earth are you doing these commits that aren't on any branch at all? I wanted to clone the repo and check out the commit but I have no idea how to.

Sorry for the confusion - these commits belong to branch falcon40b of my fork: https://github.com/jploski/ggml/tree/falcon40b - apparently GitHub not clever enough to indicate their source.

KerfuffleV2 · 2023-06-14T09:22:44Z

@jploski

I was able to convert the real 40B model with my change here to reduce memory during HF conversion (only loads a single part into RAM at a time): jploski#1

It required some work to get inference to actually run. I had to increase ctx_size:

ctx_size += ((size_t)3) * 1024 * 1024 * 1024;

Also, uhh... GGML_MAX_NODES at 4096 didn't quite cut it. Nor did 65535, I eventually just set it to 262144 and was able to run the model. Unfortunately, the output didn't make much sense:

main: seed = 1686733539
falcon_model_load: loading model from '/home/nope/personal/ai/models/falc40b.ggml' - please wait ...
falcon_model_load: n_vocab   = 65024
falcon_model_load: n_embd    = 8192
falcon_model_load: n_head    = 128
falcon_model_load: n_head_kv = 8
falcon_model_load: n_layer   = 60
falcon_model_load: ftype     = 2008
falcon_model_load: qntvr     = 2
falcon_model_load: ggml ctx size = 28175.96 MB
falcon_model_load: memory_size =   480.00 MB, n_mem = 122880
falcon_model_load: ............................................................ done
falcon_model_load: model size = 27436.06 MB / num tensors = 484
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 10
main: token[0] =   7107, Once
main: token[1] =   2918,  upon
main: token[2] =    241,  a
main: token[3] =    601,  time
main: token[4] =     23, ,
main: token[5] =    629,  there
main: token[6] =    398,  was
main: token[7] =    241,  a
main: token[8] =   1278,  little
main: token[9] =  27224,  fox

Once upon a time, there was a little fox and, I’re
'  to’ ' .
 it that,. is
,, of . for.' '- you,. we the- the
1 of a
. the

Although it didn't work, even with the crazy number of nodes it wasn't really that slow. It was about the same as a 65B Q4_K_M LLaMA model with llama.cpp.

The mini-Shakespeare model seems fine:

main: seed = 1686733831
falcon_model_load: loading model from '/home/nope/personal/ai/models/falcsp.ggml' - please wait ...
falcon_model_load: n_vocab   = 65024
falcon_model_load: n_embd    = 256
falcon_model_load: n_head    = 4
falcon_model_load: n_head_kv = 2
falcon_model_load: n_layer   = 4
falcon_model_load: ftype     = 2009
falcon_model_load: qntvr     = 2
falcon_model_load: ggml ctx size = 3105.91 MB
falcon_model_load: memory_size =     8.00 MB, n_mem = 8192
falcon_model_load: .... done
falcon_model_load: model size =    25.89 MB / num tensors = 36
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 1
main: token[0] =   4031, Now

Now, Clarence, my lord, I am a
the great men: I will to do this day you
In time they may live in men of tears are
Shall be not what we have fought in. What is this
and come to you? I have not made mine eyes,
Which now sent for, or I am so fast?
Your friends shall be revenged on thee, hoar!
And that you must, sirs, that you must do,
My friend to thee that news, with your love,
My father's wife and love for this day.
You are not hot, lords, and what I am not?
To take this, good sweet friend, I am not my life,
I warrant, as I, to have a little thing, my lord,
What you can stay with this good night do you all your tongue?
O, if not my fair soul to my brother, how well,
Where is

main: mem per token =   290292 bytes
main:     load time =   266.82 ms
main:   sample time =    64.16 ms
main:  predict time =   240.96 ms / 1.20 ms per token
main:    total time =   576.08 ms

Both models were quantized to Q5_0.

jploski · 2023-06-14T10:27:28Z

@jploski

I was able to convert the real 40B model with my change here to reduce memory during HF conversion (only loads a single part into RAM at a time): jploski#1

It required some work to get inference to actually run. I had to increase ctx_size:
ctx_size += ((size_t)3) * 1024 * 1024 * 1024;
Also, uhh... GGML_MAX_NODES at 4096 didn't quite cut it. Nor did 65535, I eventually just set it to 262144 and was able to run the model. Unfortunately, the output didn't make much sense:

Thanks for checking! I was able to reproduce wrong output using an unquantized mini version trained with n_embd = 1024, n_head = 128, n_head = 8. So there must still be a bug somewhere, which the previous three configs I used for testing did not catch.

KerfuffleV2 · 2023-06-14T10:49:16Z

If the problem is the complicated logic for dealing for the query heads, maybe the easiest way to deal with that is in the conversion tool from the Torch or numpy side. It should be relatively easy to shuffle things around at that point.

Reducing the complexity would make issues easier to debug too, I guess.

jploski · 2023-06-14T11:03:40Z

If the problem is the complicated logic for dealing for the query heads, maybe the easiest way to deal with that is in the conversion tool from the Torch or numpy side. It should be relatively easy to shuffle things around at that point.

Reducing the complexity would make issues easier to debug too, I guess.

Yes, I agree that reshuffling the weights during conversion will perhaps be the final and most elegant/efficient solution. I just haven't wrapped my head around it yet how changing the layout of the query_key_value tensor maps into fused_qkv from which the qkv vectors are extracted (fused_qkv = self.query_key_value(hidden_states)).

I'd also like to understand the current bug and have a working (if poorly implemented) version to improve on (even if the "improvement" will mean throwing away the overcomplicated code).

jploski · 2023-06-14T11:34:27Z

I'd also like to understand the current bug and have a working (if poorly implemented) version to improve on (even if the "improvement" will mean throwing away the overcomplicated code).

Understood and fixed in my falcon40b branch. Please recompile and try again.

KerfuffleV2 · 2023-06-14T11:47:16Z

It's alliiiiive!

main: seed = 1686742967
falcon_model_load: loading model from '/home/nope/personal/ai/models/falc40b.ggml' - please wait ...
falcon_model_load: n_vocab   = 65024
falcon_model_load: n_embd    = 8192
falcon_model_load: n_head    = 128
falcon_model_load: n_head_kv = 8
falcon_model_load: n_layer   = 60
falcon_model_load: ftype     = 2008
falcon_model_load: qntvr     = 2
falcon_model_load: ggml ctx size = 28175.96 MB
falcon_model_load: memory_size =   480.00 MB, n_mem = 122880
falcon_model_load: ............................................................ done
falcon_model_load: model size = 27436.06 MB / num tensors = 484
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 10
main: token[0] =   7107, Once
main: token[1] =   2918,  upon
main: token[2] =    241,  a
main: token[3] =    601,  time
main: token[4] =     23, ,
main: token[5] =    629,  there
main: token[6] =    398,  was
main: token[7] =    241,  a
main: token[8] =   1278,  little
main: token[9] =  27224,  fox

Once upon a time, there was a little fox named ‘Pee-Poo’ who had an important mission to accomplish.
She had been assigned the task of finding the ‘Guru of all Gurus’ who was hiding deep in the jungle. And so one day, Pee-Poo set out for her journey. She walked and walked and walked and asked everybody in the jungle where the Guru lived, but nobody could tell her.
“But, how can that be?” she thought to herself, “There has

main: mem per token =  6467732 bytes
main:     load time = 10538.50 ms
main:   sample time =    34.28 ms
main:  predict time = 90867.40 ms / 833.65 ms per token
main:    total time = 104610.47 ms

Not a fan of the name it chose though.

For reference, these are the changes I need to actually run it:

diff --git a/examples/falcon/main.cpp b/examples/falcon/main.cpp
index beac293..c77c610 100644
--- a/examples/falcon/main.cpp
+++ b/examples/falcon/main.cpp
@@ -198,6 +198,7 @@ bool falcon_model_load(const std::string & fname, falcon_model & model, gpt_voca
                     ggml_type_sizef(GGML_TYPE_F32);  // memory_v
 
         ctx_size += (5 + 10 * n_layer) * 256;  // object overhead TODO:
+        ctx_size += ((size_t)3) * 1024 * 1024 * 1024;
         printf("%s: ggml ctx size = %6.2f MB\n", __func__, ctx_size/(1024.0*1024.0));
     }
 
diff --git a/include/ggml/ggml.h b/include/ggml/ggml.h
index e770603..83b0d84 100644
--- a/include/ggml/ggml.h
+++ b/include/ggml/ggml.h
@@ -194,7 +194,7 @@
 #define GGML_QNT_VERSION_FACTOR 1000 // do not change this
 
 #define GGML_MAX_DIMS          4
-#define GGML_MAX_NODES         4096
+#define GGML_MAX_NODES         262144
 #define GGML_MAX_PARAMS        256
 #define GGML_MAX_CONTEXTS      64
 #define GGML_MAX_OPT           4

TheBloke · 2023-06-14T12:37:55Z

Amazing work guys!

So is https://github.com/jploski/ggml/tree/falcon40b the branch I should use to try converting and running GGMLs?

cmp-nct · 2023-06-14T12:38:59Z

262144 nodes, wtf :-)

Awesome to see it works so well!

jploski · 2023-06-14T12:43:47Z

Amazing work guys!

So is https://github.com/jploski/ggml/tree/falcon40b the branch I should use to try converting and running GGMLs?

I would suggest not converting them just yet - because if/when the qkv reshuffling during conversion is implemented, the binary format of the tensors would change again... which would make all the already published files incompatible.

TheBloke · 2023-06-14T12:49:20Z

OK fair enough!

ochafik · 2023-06-15T01:23:24Z

Your GGML_MEM_ALIGN theory sounds promising - I checked the ctx_size calculations (particularly bytes per element for f16 and the per-tensor overhead of sizeof(struct ggml_tensor) + GGML_OBJECT_SIZE = 256 bytes per tensor), but after my attempted fixes it was coming out even shorter than before...

As it turns out, GGML's API does have a ggml_tensor_overhead() that returns... 272 (Although in practice I only measured it at 224), so that could also add up.

Now testing a fix (jploski#2)

I also see that there are suspicious hardcoded fixed-size memory "scratch" buffers in falcon_eval, and I find it probably wrong that the allocations there are independent from the initial prompt size (N = number-of-tokens-in-prompt during first invocation of falcon_eval, 1 in subsequent invocations). But the crashes you reported happened even before that while loading the model.

Happy to dive into this another day if there's other crashes!

apage43 · 2023-06-15T01:55:58Z

Although the README says the tokenizer will probably only work for English

I just copied that from the gpt-neox example - the common gpt-style tokenizer in /examples is broken, but actually in a way that is more problematic for text that the vocabulary can tokenize efficiently (i.e. English) - because it will generate suboptimal tokenizations that don't match the way the tokenizer the models are trained with would tokenize the same text

an example of why that causes problems is that if there is an"about" token in the vocabulary, the model will have only seen "about" represented with that token and will have never seen "about" represented as ["ab", "out"] during training even if those tokens exist in the vocabulary, and every time it has seen those tokens it will have been in other contexts. Which means that models often understand suboptimally tokenized text poorly if at all

fixing this unfortunately requires the file to also contain the tokenizer's "merges" list which is not currently captured in to know what order to combine tokens in to match the original behavior, see #220 (comment)

(well, some I've seen fail to decode non-English text but that's because the converter script didn't store the tokens as raw bytes but did a utf-8 decoding/encoding roundtrip on them first - token vocabulary items are bytes, not necessarily valid utf8 strings, and a unicode character may be split across multiple tokens that do not individually contain complete valid utf8 sequences - this is just a problem with those convert scripts though, not the decoding logic, which is pretty simple)

ochafik · 2023-06-15T01:58:33Z

No crash with any model / quantization after jploski#2, ~~but it looks like falcon-7b-instruct q8_0 produces garbage (right after prompt):~~ (edit: false alarm, looks all good)

~~Write a thank you poem.&&%2)>>ANSWER<<;122>>ABSTRACT<<,1+20>>INTRODUCTION<<5)2>>TITLE<<~~

maddes8cht · 2023-06-15T08:35:24Z

Seems i have successfully compiled falcon.exe for windows, try to run falcon 7b version https://huggingface.co/RachidAR/falcon-7B-ggml with it.
Get the error message:

main: seed = 1686817064
falcon_model_load: loading model from '.\models\falcon-7b-q4_0-ggml.bin' - please wait ...
falcon_model_load: invalid model file '.\models\falcon-7b-q4_0-ggml.bin' (bad Falcon version: 9)
main: failed to load model from '.\models\falcon-7b-q4_0-ggml.bin'.

So there seems to be no 7b model version available as @TheBloke so far only has the 40b versions (thanks @TheBloke for the great work though)

So, while downloading (very slow from current PC) the original files from
https://huggingface.co/tiiuae/falcon-7b
im asking how to do the quatization with falcon-quantize.exe (never done this before), as the commandline help says:

falcon-quantize.exe --help
usage: falcon-quantize.exe model-f32.bin model-quant.bin type`

but there seems to be two files,
pytorch_model-00001-of-00002.bin and pytorch_model-00002-of-00002.bin

jploski · 2023-06-15T10:23:37Z

Seems i have successfully compiled falcon.exe for windows, try to run falcon 7b version https://huggingface.co/RachidAR/falcon-7B-ggml with it. Get the error message:
falcon_model_load: invalid model file '.\models\falcon-7b-q4_0-ggml.bin' (bad Falcon version: 9)

This error indicates that the GGML file you downloaded is outdated and not compatible with the current implementation.

As for DIY quantize, you need to perform two steps:

Convert the multiple pytorch*.bin checkpoint files to one GGML file

python3 convert-hf-to-ggml.py 0 /path/to/directory/with/the/pytorch/bin/files /path/to/output/directory 1

Quantize the GGML file to your desired format (e.g. 9 = q5_1)

bin/falcon-quantize /tmp/ggml-model--f32.bin /tmp/ggml-model--f32-q5_1.bin 9

KerfuffleV2 · 2023-06-15T15:54:13Z

I think something might be a tiny bit off with the new context size calculations:

falcon_model_load: loading model from '/somepath/WizardLM-Uncensored-Falcon-40b.ggmlv3.q5_1.bin' - please wait ...
falcon_model_load: n_vocab   = 65025
falcon_model_load: n_embd    = 8192
falcon_model_load: n_head    = 128
falcon_model_load: n_head_kv = 8
falcon_model_load: n_layer   = 60
falcon_model_load: ftype     = 2009
falcon_model_load: qntvr     = 2
falcon_model_load: ggml ctx size = 7483257978003.64 MB
GGML_ASSERT: /somepath/ggml-falcon/src/ggml.c:3982: ctx->mem_buffer != NULL

I tried with other models, including the one I generated myself (which worked the other day). It would probably work on a machine with 6.8 exobytes of RAM but sadly I don't have quite that much.

(ignore, it was due to ne[0] being passed incorrectly, not this code)

edit: This is way, way off:

~~size_needed += GGML_TYPE_SIZE[type]*(ne[0]/GGML_BLCK_SIZE[type]);~~

~~With some debug prints, I can see it produces an absolutely absurd initial value like 4631926893623383552.~~

edit: ~~Fix here: jploski#3~~ (no longer needed)

jploski · 2023-06-15T16:33:28Z

I think something might be a tiny bit off with the new context size calculations:
edit: Fix here: jploski#3

It seems we both independently fixed it the same way (although I don't quite understand what was incorrect about the parameter passing).

KerfuffleV2 · 2023-06-15T16:36:52Z

although I don't quite understand what was incorrect about the parameter passing

Sorry, that wasn't a good explanation. I think the issue was auto choosing the wrong type, not necessarily the parameter passing itself. I'm pretty sure that auto was choosing a 32bit type, which got passed as a pointer to something expecting a 64bit value. So it would just start reading from the pointer address and of course half of the 64bit value would be basically random so you'd get a crazy result.

jploski · 2023-06-15T16:40:41Z

although I don't quite understand what was incorrect about the parameter passing

Sorry, that wasn't a good explanation. I think the issue was auto choosing the wrong type, not necessarily the parameter passing itself. I'm pretty sure that auto was choosing a 32bit type, which got passed as a pointer to something expecting a 64bit value. So it would just start reading from the pointer address and of course half of the 64bit value would be basically random so you'd get a crazy result.

Ah yes, good old C where you can read half a variable and be happy. ;)

khimaros · 2023-06-15T18:33:09Z

CPU inference is working well here with the latest commits to falcon40b branch and TheBloke's falcon40b-instruct (q4_0) on debian bookworm. thank you for all of the work on this!

ggerganov · 2023-06-16T09:42:29Z

Will be looking into merging this during the weekend.

Wondering if #224 would be enough to avoid the extra ggml_repeat()s so we can have true MQA and save some memory.

One other thing worth mentioning (and I don't know if it's inherent with the model) is that generation seems to slow down as tokens are generated. It's much, much slower after generating 1000 tokens which LLaMA seems to go at close to the same speed up to the context length.

Most likely this is due to extra transpose and copies in the attention.
We had a similar issue in llama.cpp and we fixed it by avoiding the copies:

ggerganov/llama.cpp#775

Tensor overhead should be computed with ggml_tensor_overhead() - it takes into account memory alignment effects.

We should probably also think about an elegant way to plug this inference in llama.cpp and allow usage with both Falcon and LLaMA models. I have some ideas, but not sure what is the best way yet. In any case, the falcon python converter would most likely need to add the necessary padding in the generated ggml files to be able to match the mmap format of llama.cpp. We might also need to do some refactoring first in llama.cpp to simplify both the python and C++ loading code because it has lately become too complex, mostly because it is trying to support old formats that nobody uses anymore

maddes8cht · 2023-06-16T09:49:18Z

@jploski
"As for DIY quantize, you need to perform two steps:

Convert the multiple pytorch*.bin checkpoint files to one GGML file

python3 convert-hf-to-ggml.py 0 /path/to/directory/with/the/pytorch/bin/files /path/to/output/directory 1

Okay, created conda environment, installed pytorch, installed transformers, run your line

getting:

python convert-hf-to-ggml.py 0 X:\falcon.ggml\pytorch X:\falcon.ggml\models\ 1
Traceback (most recent call last):
  File "c:\Users\WaWiAdm\Documents\Github\falcon-ggml\examples\falcon\convert-hf-to-ggml.py", line 68, in <module>
    tokenizer = AutoTokenizer.from_pretrained(model_name)
  File "C:\Users\WaWiAdm\anaconda3\envs\falcon\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 658, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "C:\Users\WaWiAdm\anaconda3\envs\falcon\lib\site-packages\transformers\models\auto\configuration_auto.py", line 947, in from_pretrained
    trust_remote_code = resolve_trust_remote_code(
  File "C:\Users\WaWiAdm\anaconda3\envs\falcon\lib\site-packages\transformers\dynamic_module_utils.py", line 535, in resolve_trust_remote_code
    signal.signal(signal.SIGALRM, _raise_timeout_error)
AttributeError: module 'signal' has no attribute 'SIGALRM'. Did you mean: 'SIGABRT'?

what's missing?

KerfuffleV2 · 2023-06-16T09:56:39Z

@maddes8cht That's an issue with Transformers not the conversion script specifically. You could possibly just comment out that line in dynamic_module_utils.py — most likely that would just prevent timeouts from working (which is probably benign in this case).

TheBloke · 2023-06-16T10:20:25Z

We should probably also think about an elegant way to plug this inference in llama.cpp and allow usage with both Falcon and LLaMA models. I have some ideas, but not sure what is the best way yet. In any case, the falcon python converter would most likely need to add the necessary padding in the generated ggml files to be able to match the mmap format of llama.cpp. We might also need to do some refactoring first in llama.cpp to simplify both the python and C++ loading code because it has lately become too complex, mostly because it is trying to support old formats that nobody uses anymore

That would be amazing. I've wished for a long time that llama.cpp would be more like llm.cpp, and support other model types.

It feels to me like the gap between the capabilities of llama.cpp and non-Llama GGML is getting wider, just as more people are wishing to use non-Llama models, both because of licensing and because they want the special capabilities of other models, like StarCoder for coding.

Anything that could be done to support more model types would be really appreciated by the community I think. Llama.cpp has got amazingly powerful and I'd love if those features could become available for other models too.

jploski · 2023-06-16T11:00:02Z

Will be looking into merging this during the weekend.

Wondering if #224 would be enough to avoid the extra ggml_repeat()s so we can have true MQA and save some memory.

I just gave it a quick try, but no luck. Merged in #224 locally, commented out the application of ggml_repeat2 for K... No more assertion fail, but unfortunately the result from this multiplication differs from expected, so the implicit broadcast must have produced something different than ggml_repeat2. But I do not understand #224 at this point, so maybe it can be achieved with some extra trickery.

jploski · 2023-06-16T11:17:11Z

Tensor overhead should be computed with ggml_tensor_overhead() - it takes into account memory alignment effects.

@ggerganov Not sure if that is enough, though: the issue seems to be that ggml_tensor_overhead() + nelements * type_size is different than size_needed as calculated in ggml_new_tensor_impl.

Added falcon main and library based on llama.cpp CPU inference works (getting ~260ms/token on 7B 16 bit falcon) Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly) TODO/WIP: 1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary - even quantization from 16 -> 16 also fails, something is wrong in the tensors produced 2) mmap should work with quantized binaries once 1) is solved 3) CUDA support is mostly there, it's currently disabled (all CPU backend) 4) memory/context caluculations are off, GPU memory calculations are wrong either 5) the python conversion script is pre GGML 1 version (tokens without scores) 6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both 7) the GGML produced by the current python uses an old ftype method Makfiles: cmake on windows with build tools works the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something Changes to the codebase: * repeat2 has been added to ggml (jploski - ggerganov/ggml#231) including the backward variant (untested, probably fails) * minor changes to work with falcon (name length) * libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp

ggerganov · 2023-06-18T10:40:44Z

So I'm a bit confused - what is the difference between this branch and @jploski version?
Why @JPolski added repeat2 and here we haven't?

jploski · 2023-06-18T10:59:35Z

So I'm a bit confused - what is the difference between this branch and @jploski version? Why @JPolski added repeat2 and here we haven't?

Which branch are you referring to? I see ggml_repeat2 on all branches related to Falcon integration.

jploski · 2023-06-18T11:02:17Z

So I'm a bit confused - what is the difference between this branch and @jploski version? Why @JPolski added repeat2 and here we haven't?

Which branch are you referring to? I see ggml_repeat2 on all branches related to Falcon integration.

Oh, if you meant apage43:falcon, then it is out-of-date, as it only supports 7B (for which ggml_repeat2 was indeed not needed because there is only one KV to repeat, so the ordering does not matter). apache43:falcon is where I forked my jploski/ggml (falcon40b) to implement 40B. And later the development moved over to cmp-nct/ggllm.cpp

ggerganov · 2023-06-18T11:18:00Z

Ok, got it. I'll postpone merging this PR then. Want to focus on some ggml maintenance for while and it looks like there are already enough ongoing efforts for Falcon support.

Ideally, I think we want to avoid the ggml_repeat2() and figure out how to generalize the existing ops. I'm not sure it is possible as I haven't looked in understanding completely MQA, but I hope we figure it out eventually

jploski · 2023-06-18T11:26:36Z

Ok, got it. I'll postpone merging this PR then. Want to focus on some ggml maintenance for while and it looks like there are already enough ongoing efforts for Falcon support.

Ideally, I think we want to avoid the ggml_repeat2() and figure out how to generalize the existing ops. I'm not sure it is possible as I haven't looked in understanding completely MQA, but I hope we figure it out eventually

Yes, I agree that we should remove repeat2. If it is done on the cmp-nct/ggllm.cpp branch, I will update my https://github.com/jploski/ggml/tree/falcon40b accordingly. I think it would be helpful if you could check whether the mat_mul broadcast could somehow do the trick, as you are most familiar with the broadcast implementation (I suppose).

To understand how it needs to work, see:

https://docs.google.com/spreadsheets/d/1FoM6pIUj23GMW4zO_G1hjmEnUacBxBKN/edit#gid=2097305276

What we need to come out of repeat/broadcast (and what repeat2 produces) is: N[0].K[0], N[0].K[0], N[0].K[1], N[0].K[1]
But what we get from the standard repeat is: N[0].K[0], N[0].K[1], N[0].K[0], N[0].K[1]

jploski · 2023-06-18T11:42:14Z

Ok, got it. I'll postpone merging this PR then. Want to focus on some ggml maintenance for while and it looks like there are already enough ongoing efforts for Falcon support.

Ideally, I think we want to avoid the ggml_repeat2() and figure out how to generalize the existing ops. I'm not sure it is possible as I haven't looked in understanding completely MQA, but I hope we figure it out eventually

If by MQA you mean multi-query attention (sorry, you mentioned it earlier, but I did not manage to decipher it), in the original multi-query paper (which Falcon-7B adheres to), there is only one key vector and one value vector (n_head_kv=1), and this k/v vector is reused/shared by all queries (in the sense that each query vector is multiplied by the same key). Contrast this with the traditional GPT approach where there is the same number of queries as keys/values. The motivation from the paper was to save (KV) memory while retaining approximately the same quality of inference.

In the generalized n_head_kv>1 implementation, which Falcon-40B implements, and for which I found no paper, there are multiple "kv groups", each consisting of one KV pair and n queries that reuse/share that group's KV pair. This is somewhat of a compromise between having just one KV pair and one KV pair for each query.

So the ordering issue in ggml_repeat(2) is to make sure that the right queries are matched to the right keys for the multiplication (and the resulting weights to right values).

FSSRepo · 2023-08-22T23:12:29Z

llama.cpp will have the complete support for falcon #2717

falcon7b example

5f634ef

apage43 force-pushed the falcon branch from 67c11da to 5f634ef Compare June 6, 2023 06:33

apage43 mentioned this pull request Jun 6, 2023

llama : add Falcon LLM support ggerganov/llama.cpp#1602

Closed

don't overallocate falcon k/v cache

17da463

mudler mentioned this pull request Jun 6, 2023

feat: add experimental support for falcon-7b mudler/LocalAI#516

Merged

store decoded bytes in vocab (emoji fix)

c2eaac1

KerfuffleV2 mentioned this pull request Jun 15, 2023

Support Falcon rustformers/llm#293

Open

matthoffner mentioned this pull request Jun 16, 2023

Falcon support? marella/ctransformers#24

Closed

LLukas22 mentioned this pull request Jun 17, 2023

Add Falcon Support rustformers/llm#313

Merged

KerfuffleV2 mentioned this pull request Jun 18, 2023

Slowdown with tokens cmp-nct/ggllm.cpp#6

Open

apage43 closed this Oct 16, 2023

add falcon7b example #231

add falcon7b example #231

Conversation

apage43 commented Jun 6, 2023 • edited Loading

AndriyMulyar commented Jun 6, 2023

TheBloke commented Jun 6, 2023

ggerganov commented Jun 7, 2023

apage43 commented Jun 7, 2023

cmp-nct commented Jun 7, 2023

apage43 commented Jun 7, 2023 • edited Loading

cmp-nct commented Jun 7, 2023

apage43 commented Jun 7, 2023

jploski commented Jun 10, 2023

jploski commented Jun 11, 2023

KerfuffleV2 commented Jun 11, 2023

jploski commented Jun 13, 2023

maccam912 commented Jun 13, 2023 • edited Loading

jploski commented Jun 13, 2023 • edited Loading

KerfuffleV2 commented Jun 14, 2023

jploski commented Jun 14, 2023

KerfuffleV2 commented Jun 14, 2023

jploski commented Jun 14, 2023

jploski commented Jun 14, 2023

KerfuffleV2 commented Jun 14, 2023

TheBloke commented Jun 14, 2023

cmp-nct commented Jun 14, 2023

jploski commented Jun 14, 2023

TheBloke commented Jun 14, 2023

ochafik commented Jun 15, 2023

apage43 commented Jun 15, 2023 • edited Loading

ochafik commented Jun 15, 2023 • edited Loading

maddes8cht commented Jun 15, 2023

jploski commented Jun 15, 2023

KerfuffleV2 commented Jun 15, 2023 • edited Loading

jploski commented Jun 15, 2023

KerfuffleV2 commented Jun 15, 2023

jploski commented Jun 15, 2023

khimaros commented Jun 15, 2023

ggerganov commented Jun 16, 2023

maddes8cht commented Jun 16, 2023

KerfuffleV2 commented Jun 16, 2023

TheBloke commented Jun 16, 2023 • edited Loading

jploski commented Jun 16, 2023

jploski commented Jun 16, 2023

ggerganov commented Jun 18, 2023

jploski commented Jun 18, 2023

jploski commented Jun 18, 2023

ggerganov commented Jun 18, 2023

jploski commented Jun 18, 2023

jploski commented Jun 18, 2023

FSSRepo commented Aug 22, 2023

apage43 commented Jun 6, 2023 •

edited

Loading

apage43 commented Jun 7, 2023 •

edited

Loading

maccam912 commented Jun 13, 2023 •

edited

Loading

jploski commented Jun 13, 2023 •

edited

Loading

apage43 commented Jun 15, 2023 •

edited

Loading

ochafik commented Jun 15, 2023 •

edited

Loading

KerfuffleV2 commented Jun 15, 2023 •

edited

Loading

TheBloke commented Jun 16, 2023 •

edited

Loading