Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : grouped-query attention + LLaMAv2 70B support #2276

Merged
merged 6 commits into from
Jul 23, 2023
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Jul 19, 2023

ref #2262

  • Added support for GQA. Currently, the grouping factor (8 for 70Bv2, 1 for everything else) has to be passed from the command line: -gqa 8. With GGUF it will be read from the the model hparams
  • The ffn_dim_multiplier needed to determine the correct value for n_ff is hardcoded to 1.3 when a 80 layer model with GQA == 8 is loaded (i.e. this corresponds to 70Bv2). Otherwise, it defaults to 1.0. Also needs to be read from the model hparams in the future
  • CUDA support for GQA is provided by @JohannesGaessler
# usage
python3 convert.py --outfile models/70B-v2/ggml-model-f16.bin --outtype f16 ../llama2/llama/llama-2-70b/
./quantize ./models/70B-v2/ggml-model-f16.bin ./models/70B-v2/ggml-model-q4_0.bin q4_0
./main -m ./models/70B-v2/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap --ignore-eos -n 64 -t 8 -gqa 8

Some notes while working on this:

I haven't done perplexity calcs yet, but I ran a few text generations using the small 7Bv2 model. My impression is that with this new model, the Q4_0 and Q5_0 quantizations do not work well. What I notice is that after the first sentence ends, the new sentence starts of in wild ways, often switching to German or some other language. This is something I have not observed with the original LLaMA 7B. It would also often start the second sentence without a capital letter, which again has never been the case before.

The QX_1, QX_K and Q8_0 quantizations do not seem to exhibit this behaviour (or if they do it is to a much smaller extend)

Note, the examples below are not related to the changes in this PR. The same behaviour is observed on master using the LLaMAv2 7B model.

Here are a few examples:

# command
./main -m ${model} -p "I believe the meaning of life is" --ignore-eos -n 32 -t 8

# Q4_0
 I believe the meaning of life is to find your purpose. sierpamid2015
I've always been a very spiritual person and I think the purpose of life is that we

 I believe the meaning of life is to be happy.☉
I am happy to tell you what I think about you.☆
Do I have a right to feel hurt when someone doesn'

 I believe the meaning of life is to find your passion and to have the courage to pursue it. nobody should stop you from living that dream if they want it for themselves...

 I believe the meaning of life is to help others and make people laugh. Einzeln ist das Leben nicht viel wert...
„In the past, it was a joke in Hollywood that if

# Q4_1
 I believe the meaning of life is to find your purpose.
Honorable Mentions: Astonishing, Incredible, Excellent, Impressive, and Fant

 I believe the meaning of life is to be a good parent, spouse and person.
Kimberly is an Aussie and me a Kiwi but we’re married

 I believe the meaning of life is to find your passion and to have the courage to pursue it.
What was your first role as a kid? And who were you performing for?

 I believe the meaning of life is to learn and grow as a person. It's not about what you have in your bank account, how much money you make, or the size of your

# Q5_0
 I believe the meaning of life is to find your purpose. Hinweis auch, dass die Daten durch den Spieler aktuell gehalten werden müssen und somit regelmäßig gelösch

 I believe the meaning of life is to be a good man. Hinweis: Es ist keine göttliche oder universelle Antwort!
My 10-year old daughter recently asked me

 I believe the meaning of life is to find your passion and share it with others. nobody can tell you what's right for you in this life, if anything is wrong at all, it

 I believe the meaning of life is to learn and grow, but it's also about being happy. surely we don't need a million years to do that.
When you are young

# Q5_1
 I believe the meaning of life is to be a good example.
The following are just some short thoughts that have crossed my mind in relation to these scriptures:

 I believe the meaning of life is to find your passion and follow it.
The meaning of life is not happiness, but growth in a continuous journey

 I believe the meaning of life is to enjoy it. You’re only here for a short time, so do what you want!
The meaning of life in itself is not important. It

 I believe the meaning of life is to create and nurture an awareness that leads to self-love, authenticity and purpose.
My passion for this subject began with my

# Q4_K
 I believe the meaning of life is to create. You can make whatever you want in this world, whether it's a chair or a family. And if you think about it, everything is

 I believe the meaning of life is to enjoy it. Life has been a great teacher for me, and every day offers new lessons on how to live in peace and love.

 I believe the meaning of life is to find your passion and to love it. I think that we are here to learn, experience new things and grow so therefore if you aren’t doing what

 I believe the meaning of life is to be as happy, joyful and satisfied as you can throughout your time on earth. A life where, when you look back upon it at the end,

# Q8_0
 I believe the meaning of life is to be found in personal relationships. The only way for a relationship to grow and become meaningful, however, is that both parties are committed and devoted to it

 I believe the meaning of life is to find your purpose.
then fulfil it with a passion.
Finding my purpose was easy - but the journey that led me there, the less

 I believe the meaning of life is to learn and grow as a person. The meaning of life is to experience it, to see new things, meet new people, to discover new places, and

 I believe the meaning of life is to make it as beautiful and wonderful as possible.
This is a work in progress, but I’ll get there one day!

Old description below

This works for 70B LLaMA-v2:

python3 convert.py --outfile models/70B-v2/ggml-model-f16.bin --outtype f16 ../llama2/llama/llama-2-70b/
./quantize ./models/70B-v2/ggml-model-f16.bin ./models/70B-v2/ggml-model-q4_0.bin q4_0
./main -m ./models/70B-v2/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap --ignore-eos -n 64 -t 8
$ ▶ make -j && gdb --args ./bin/main -m ../models/70B-v2/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap --ignore-eos -n 64 -t 8
[  2%] Generating build details from Git
[  6%] Built target ggml
Consolidate compiler generated dependencies of target llama
-- Found Git: /usr/bin/git (found version "2.34.1") 
[  8%] Built target ggml_static
[ 10%] Building CXX object CMakeFiles/llama.dir/llama.cpp.o
[ 10%] Built target BUILD_INFO
[ 12%] Linking CXX static library libllama.a
[ 12%] Built target llama
[ 14%] Linking CXX executable ../bin/test-quantize-fns
[ 19%] Linking CXX executable ../bin/test-grad0
[ 19%] Linking CXX executable ../bin/test-sampling
[ 21%] Built target common
[ 27%] Linking CXX executable ../bin/test-tokenizer-0
[ 27%] Linking CXX executable ../../bin/quantize
[ 27%] Linking CXX executable ../../bin/quantize-stats
[ 29%] Linking CXX executable ../bin/test-quantize-perf
[ 53%] Linking CXX executable ../../bin/server
[ 53%] Linking CXX executable ../../bin/embedding
[ 53%] Linking CXX executable ../../bin/benchmark
[ 53%] Linking CXX executable ../../bin/train-text-from-scratch
[ 53%] Linking CXX executable ../../bin/main
[ 53%] Linking CXX executable ../../bin/save-load-state
[ 53%] Linking CXX executable ../../bin/vdot
[ 53%] Linking CXX executable ../../bin/simple
[ 53%] Linking CXX executable ../../bin/baby-llama
[ 53%] Built target embdinput
[ 57%] Linking CXX executable ../../bin/perplexity
[ 57%] Linking CXX executable ../../bin/q8dot
[ 59%] Linking CXX executable ../../bin/embd-input-test
[ 61%] Built target test-grad0
[ 63%] Built target test-quantize-fns
[ 65%] Built target test-quantize-perf
[ 68%] Built target quantize
[ 70%] Built target vdot
[ 72%] Built target test-tokenizer-0
[ 76%] Built target test-sampling
[ 76%] Built target q8dot
[ 78%] Built target benchmark
[ 82%] Built target embd-input-test
[ 82%] Built target baby-llama
[ 85%] Built target main
[ 87%] Built target save-load-state
[ 89%] Built target embedding
[ 91%] Built target simple
[ 93%] Built target perplexity
[ 95%] Built target train-text-from-scratch
[ 97%] Built target quantize-stats
[100%] Built target server
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <https://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./bin/main...
(gdb) r
Starting program: /home/ggerganov/development/github/llama.cpp/build-rwdi/bin/main -m ../models/70B-v2/ggml-model-q4_0.bin -p I\ believe\ the\ meaning\ of\ life\ is --no-mmap --ignore-eos -n 64 -t 8
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
main: build = 852 (294f424)
main: seed  = 1689771309
llama.cpp: loading model from ../models/70B-v2/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 37070.93 MB
llama_model_load_internal: mem required  = 40208.93 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0


 I believe the meaning of life is to give, not to get. I have found that the more we give our time, energy and love to others, the more fulfilled our lives become. Our relationships grow stronger and deeper.
I also believe that God has blessed each one of us with unique gifts and talents. It is up to
llama_print_timings:        load time = 14604.73 ms
llama_print_timings:      sample time =    25.78 ms /    64 runs   (    0.40 ms per token,  2483.03 tokens per second)
llama_print_timings: prompt eval time =  5393.69 ms /     8 tokens (  674.21 ms per token,     1.48 tokens per second)
llama_print_timings:        eval time = 57526.02 ms /    63 runs   (  913.11 ms per token,     1.10 tokens per second)
llama_print_timings:       total time = 62954.88 ms
[Inferior 1 (process 1698964) exited normally]
(gdb) q

Currently, only CPU.
This is very quick and dirty - just to see what changes are necessary.

Good news is the convert.py script does not require changes.
To add support for GPU and BLAS, fix the following TODOs:

llama.cpp/ggml.c

Lines 10729 to 10748 in 2d2bb6b

#if defined(GGML_USE_CLBLAST)
if (ggml_cl_can_mul_mat(src0, src1, dst)) {
// TODO: handle case when src0 is broadcast-able into src1 across 2nd,3rd dimension
// ref: https://github.com/ggerganov/ggml/pull/224
GGML_ASSERT(ne02 == ne12);
GGML_ASSERT(ne03 == ne13);
if (params->ith == 0 && params->type == GGML_TASK_COMPUTE) {
ggml_cl_mul_mat(src0, src1, dst, params->wdata, params->wsize);
}
return;
}
#endif
#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS)
if (ggml_compute_forward_mul_mat_use_blas(src0, src1, dst)) {
// TODO: handle case when src0 is broadcast-able into src1 across 2nd,3rd dimension
// ref: https://github.com/ggerganov/ggml/pull/224
GGML_ASSERT(ne02 == ne12);
GGML_ASSERT(ne03 == ne13);

I.e., implement the mul_mat broadcast logic from here:

llama.cpp/ggml.c

Lines 10832 to 10850 in 2d2bb6b

const int64_t i13 = (ir1/(ne12*ne11));
const int64_t i12 = (ir1 - i13*ne12*ne11)/ne11;
const int64_t i11 = (ir1 - i13*ne12*ne11 - i12*ne11);
const int64_t ir0 = (ir1/ne11)%(ne02*ne03);
const int64_t i03 = (ir0/(ne02));
// Hack for "Falcon multi-query-attention key stutter" / alternative to ggml_repeat2.
// See https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1606087470:
// GG: this is likely the correct way to broadcast, though need some more thought
// therefore leaving the comments to remind us for now
const int64_t i02 = (i12 / (ne12 / ne02));
// Original from PR/224 (and also essential/correct for non-broadcast matmuls in Falcon)
// const int64_t i02 = (ir0 - i03*ne02);
const int64_t i1 = i11;
const int64_t i2 = i12;
const int64_t i3 = i13;
const char * src0_row = (const char *) src0->data + ( 0 + i02*nb02 + i03*nb03 );

Looking for contributions to make this cleaner. If not, I will implement it some time in the next days.

@ggerganov ggerganov mentioned this pull request Jul 19, 2023
@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jul 19, 2023

So it seems another model parameter is needed, group size? Which is 1 for the older LLaMA models and 8 for the new ones?

@ggerganov
Copy link
Owner Author

ggerganov commented Jul 19, 2023

Maybe try to deduce it for now, to avoid file format mess. We'll add it in GGUF

Or temporary pass it from cmd line

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jul 19, 2023

Do I understand correctly that this mechanism will reduce the KV cache size as well?

@JohannesGaessler
Copy link
Collaborator

In terms of CUDA support: for me the release has come at a rather inopportune time. I'll maybe have some time to look into it on the weekend but I don't want to make any promises. Other people are welcome to work on it in the meantime (but if you do please notify me to avoid duplicate implementations).

@byildiz
Copy link

byildiz commented Jul 19, 2023

Do I understand correctly that this mechanism will reduce the KV cache size as well?

Yes, the mechanism (GQA) is solely for reducing KV cache size by compromising some amount of accuracy.

@gabrielcustodio
Copy link

How to run ./quantize ?

@wizzard0
Copy link
Contributor

@gabrielcustodio

git clone https://huggingface.co/meta-llama/Llama-2-70b

python convert.py Llama-2-70b

./quantize Llama-2-70b/ggml-model-f32.bin Llama-2-70b/ggml-model-q5_1.bin q5_1

./main -m Llama-2-70b/ggml-model-q5_1.bin --interactive-first

@nrbontha
Copy link

nrbontha commented Jul 20, 2023

@wizzard0

Thank you for clearing up the steps - however, I'm seeing this error:

$ python3 convert.py Llama-2-70b
Loading model file Llama-2-70b/consolidated.00.pth
Traceback (most recent call last):
  File "/Users/nrb/Projects/llama.cpp/convert.py", line 1264, in <module>
    main()
  File "/Users/nrb/Projects/llama.cpp/convert.py", line 1244, in main
    model_plus = load_some_model(args.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nrb/Projects/llama.cpp/convert.py", line 1165, in load_some_model
    models_plus.append(lazy_load_file(path))
                       ^^^^^^^^^^^^^^^^^^^^
  File "/Users/nrb/Projects/llama.cpp/convert.py", line 963, in lazy_load_file
    raise ValueError(f"unknown format: {path}")
ValueError: unknown format: Llama-2-70b/consolidated.00.pth

Looking at the convert.py script, it seems that .pth files are not supported - but I'm new here and may be missing a step.

Anything I can try to overcome this?

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jul 20, 2023

@nrbontha, .pth files are supported from the start of llama.cpp so I think the problem you have is that the files you have is corrupted.

@schappim
Copy link

schappim commented Jul 20, 2023

Is there an MD5 or Checksum SlyEcho?

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jul 20, 2023

Meta's script also downloads MD5 checksums, but just check if the file is not 0 bytes, I had that problem when downloading.

@schappim
Copy link

Thanks for trying to help @SlyEcho, but I'm afraid that isn't the issue (at least for me):

-rw-r--r--@  1 admin  staff  17246706245 14 Jul 09:12 consolidated.00.pth
-rw-r--r--@  1 admin  staff  17246706245 14 Jul 09:12 consolidated.01.pth
-rw-r--r--@  1 admin  staff  17246706245 14 Jul 09:12 consolidated.02.pth
-rw-r--r--@  1 admin  staff  17246706245 14 Jul 09:12 consolidated.03.pth
-rw-r--r--@  1 admin  staff  17246706245 14 Jul 09:12 consolidated.04.pth
-rw-r--r--@  1 admin  staff  17246706245 14 Jul 09:14 consolidated.06.pth
-rw-r--r--@  1 admin  staff  17246706245 14 Jul 09:15 consolidated.05.pth
-rw-r--r--@  1 admin  staff  17246706245 14 Jul 09:15 consolidated.07.pth

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jul 20, 2023

Alright, I have to check now.

@l0d0v1c
Copy link

l0d0v1c commented Jul 20, 2023

I got this error as an output from main on apple silicon
llama_model_load_internal: ggml ctx size = 37070,93 MB error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024 llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'llama-2-70B-chat.bin' main: error: unable to load model

@klosax
Copy link
Collaborator

klosax commented Jul 20, 2023

I got this error as an output from main on apple silicon
llama_model_load_internal: ggml ctx size = 37070,93 MB error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024

@l0d0v1c Are you compiling with the changes applied that this PR propose?
https://github.com/ggerganov/llama.cpp/pull/2276/files

@l0d0v1c
Copy link

l0d0v1c commented Jul 20, 2023

I got this error as an output from main on apple silicon
llama_model_load_internal: ggml ctx size = 37070,93 MB error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024

@l0d0v1c Are you compiling with the changes applied that this PR propose? https://github.com/ggerganov/llama.cpp/pull/2276/files

Thanks I just did it partially now it works fine

@schappim
Copy link

schappim commented Jul 20, 2023

-m ./models/70B-v2/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap --ignore-eos -n 64 -t 8

🎉 Success! I got it working by using cmake instead of make , adn then using the outputs in ./build/bin/ .

git checkout llama-v2-70b;
mkdir build;
cd build;
cmake ..;
cmake --build . --config Release;
python3 convert.py --outfile models/70B-v2/ggml-model-f16.bin --outtype f16 ../llama2/llama/llama-2-70b/ ;
./build/bin/quantize  ./models/70B-v2/ggml-model-f16.bin ./models/70B-v2/ggml-model-q4_0.bin q4_0;
./build/bin/main -m ./models/70B-v2/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt

Now remember this is running on the CPU, so be prepared for some chugging!

image

This is the kind of speed you can expect:

70-b-cpu-speed.mov

Y'know what I call this? I call this a good start!

@klosax
Copy link
Collaborator

klosax commented Jul 20, 2023

It looks like changes are needed in convert.py to work for 70b HF files as the find_n_mult() function wont work with this model.
See issue #2286

This function convert the feedforward_length parameter back to the original pth n_mult for saving. When the file is loaded it is converted back again... Luckily hacks like this wont be needed with the new GGUF file format.

A simple temporary solution would be to change the following line to n_mult = 256;

n_mult = find_n_mult(n_ff, n_embd);

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jul 20, 2023

I got it working by using cmake instead of make

Remember to run make clean first or use the -B flag.

@klosax
Copy link
Collaborator

klosax commented Jul 20, 2023

Getting garbage output from q4_0 converted from safetensor model https://huggingface.co/TheBloke/Llama-2-70B-fp16, but works fine with q4_0 converted from (original?) .pth model https://huggingface.co/anonymous4chan/llama-2-70b-original .

The safetensor model files were downloaded ok (checked sha256sums) so guessing they are corrupt or somehow incompatible with the convert.py script. The model was converted from pth using the latest llama 2 PR according to the model card.

@klosax
Copy link
Collaborator

klosax commented Jul 21, 2023

Tried to convert another 70b safetensor model https://huggingface.co/NousResearch/Llama-2-70b-hf and quantized it to q4_0 and it also outputs garbage. An llama2-7b safetensor https://huggingface.co/NousResearch/Llama-2-7b-hf works fine,

Is the current convert.py incompatible with the llama2-70b safetensor files?
Anyone else having this problem?

@auxon
Copy link

auxon commented Jul 21, 2023

@schappim What hardware are you using? I tried the 13b-chat model on my M1 Macbook Air, but it is incredibly slow. I cannot run on the GPU (even the 7b model) because I run out of memory (only 8GB total).

@TheBloke
Copy link
Contributor

TheBloke commented Jul 21, 2023

I too am getting garbage output, both from fp16 and q4_0.

Steps taken:

  1. Compiled this PR. I tried both make -j and mkdir build && cd build && cmake .. && cmake --build . --config Release
  2. Edited convert.py as per @klosax 's instructions, setting:
n_mult = 256
  1. Converted this model https://huggingface.co/TheBloke/llama-2-70b-Guanaco-QLoRA-fp16 (pytorch_model.bin) using convert.py, to fp16
  2. ./quantize /workspace/process/llama-2-70b-guanaco/ggml/llama-2-70b-guanaco.ggmlv3.fp16.bin /workspace/process/llama-2-70b-guanaco/ggml/llama-2-70b-guanaco.ggmlv3.q4_0.bin q4_0
  3. Inference with: ./main -m /workspace/process/llama-2-70b-guanaco/ggml/llama-2-70b-guanaco.ggmlv3.q4_0.bin -t 12 -p "### Human: write a story about llamas\n### Assistant:"

q4_0 result:

 ### Human: write a story about llamas\n### Assistant:, and have n a the end of n a of n are n a a and have a of the have and a a ns a,0th’ s, have,att1 oft is and a a and a and they4 ofs that he had been and a has a a are and has that had to be and hass it?ll�am of theira the samell, whichÃs and aand the of the of you canada will come a and a and a have.A’ and they7 of the in a have a have that is1 I, are have of

fp16 inference result is also not usable:

 ### Human: write a story about llamas\n### Assistant:1022402353l212100t9

Maybe the n_mult = 256 line is wrong?

@schappim what changes did you make to convert.py to make your fp16?

@klosax
Copy link
Collaborator

klosax commented Jul 21, 2023

Maybe the n_mult = 256 line is wrong?

The value does not matter as the PR do not use it anyway.
It is used for calculating the n_ff (intermediate_size) parameter, but it is hardcoded by the PR:

llama.cpp/llama.cpp

Lines 1029 to 1030 in 2d2bb6b

//const uint32_t n_ff = ((2*(4*hparams.n_embd)/3 + hparams.n_mult - 1)/hparams.n_mult)*hparams.n_mult;
const uint32_t n_ff = 28672;

@TheBloke
Copy link
Contributor

Ah OK, thanks

Then all I can think is maybe it worked for schappim because he was converting from PTH, and we're converting from HF? I don't know what practical difference that would make though. And that's not much help for the model I'm trying to convert now, which is a fine tune so I only have it in HF/pytorch format.

@klosax
Copy link
Collaborator

klosax commented Jul 21, 2023

It looks like Meta did change the structure of the 70b HF models somehow.
Maybe someone with good python knowledge can take a look at the updated HF llama scripts?

huggingface/transformers#24891

@Kangmo
Copy link

Kangmo commented Jul 23, 2023

ggerganov, thank you for the code fix for running 70B model!
I see the following error. not sure if I did something wrong.
llama_model_load_internal: ggml ctx size = 37070.96 MB error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected 8192 x 28416, got 8192 x 28672 llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model './models/llama-2-70b/ggml-model-q4_0.bin' main: error: unable to load model

steps to reproduce:
python convert.py models/llama-2-70b/ --outtype f16 ./quantize ./models/llama-2-70b/ggml-model-f16.bin ./models/llama-2-70b/ggml-model-q4_0.bin q4_0 ./main -m ./models/llama-2-70b/ggml-model-q4_0.bin --no-mmap --ignore-eos -n 64 -t 8 -gqa 8 -p "The true meaning of life is"

hardware & software :
Apple M1 Max ( Macbook pro 16) macos 13.4.1 (c)(22F770820d) compiler: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
model weights were downloaded by requesting access via official Meta download page.
Again, thank you for the code fix!

@ggerganov
Copy link
Owner Author

@Kangmo This error would happen if you ran the convert.py command before the changes from today and quantize with the changes from today. Try re-running the convert.py script if haven't done so

@Kangmo
Copy link

Kangmo commented Jul 23, 2023

@Kangmo This error would happen if you ran the convert.py command before the changes from today and quantize with the changes from today. Try re-running the convert.py script if haven't done so

Re-running convert.py fixed the problem. thank you!

@LostRuins
Copy link
Collaborator

LostRuins commented Jul 24, 2023

Apologize if someone already answered it, but why does 70B use smaller scratch buffers than 65B?

@ggerganov
Copy link
Owner Author

The K and V tensors have 8 times less heads so the KV cache uses 8 times less memory and some of the intermediate tensors are also 8 times smaller.

@matrix303
Copy link

Current on commit master-41c6741, still having same issue when trying to convert llama2-70b model. Same error for quantized versions q4_0 and q5_0. I have built using CMake and have also verified checksums of the downloaded models

COMMAND: build/bin/main -m models/llama-2-70b-chat/l2-70b-chat-ggml-model-f16.bin -n 128

ERROR: error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024

@ggerganov
Copy link
Owner Author

Add -gqa 8 argument to the command as shown in the description of the PR

@matrix303
Copy link

Add -gqa 8 argument to the command as shown in the description of the PR

Perfect, thank you. Thought when this was merged, it was automatically accounting for this.

@JohannesGaessler
Copy link
Collaborator

When I try to run 70b with master n_ff seems to be set to the wrong value:

error log
   ~/Pr/llama.cpp   #master-e76d630 *4 ?13    ./main --model models/opt/llama_2-70b-ggml-q4_k_m.bin_old -gqa 8 -eps 1e-5 --n-predict 2000 --ctx-size 2048 --batch-size 512 --threads 1 --mirostat 2 --file initial_llama_chat_karlsruhe.ai.txt -ngl 83 | tee chat.txt
main: build = 899 (41c6741)
main: seed  = 1690238748
ggml_init_cublas: found 3 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
  Device 1: Tesla P40, compute capability 6.1
  Device 2: Tesla P40, compute capability 6.1
llama.cpp: loading model from models/opt/llama_2-70b-ggml-q4_k_m.bin_old
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 1,0e-05
llama_model_load_internal: n_ff       = 28416
llama_model_load_internal: freq_base  = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0,21 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla P40) as main device
error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected  8192 x 28416, got  8192 x 28672
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/opt/llama_2-70b-ggml-q4_k_m.bin_old'
main: error: unable to load model

Uncommenting line 1062 which hard-codes n_ff to 28672 fixes the issue for me.

@halbtuerke
Copy link

Should this be working with Metal? I'm currently on master-41c6741 and trying to run this with -ngl 1 is failing with the following message:

main: warning: base model only supports context sizes no greater than 2048 tokens (4096 specified)
main: build = 900 (15d02e6)
main: seed  = 1690231994
llama.cpp: loading model from /Users/xxx/TheBloke/Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 40540.46 MB (+ 1280.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/xxx/code/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x13670a170
ggml_metal_init: loaded kernel_add_row                        0x13670a890
ggml_metal_init: loaded kernel_mul                            0x13670adb0
ggml_metal_init: loaded kernel_mul_row                        0x13670b3e0
ggml_metal_init: loaded kernel_scale                          0x13670b900
ggml_metal_init: loaded kernel_silu                           0x13670be20
ggml_metal_init: loaded kernel_relu                           0x13670c340
ggml_metal_init: loaded kernel_gelu                           0x13670c860
ggml_metal_init: loaded kernel_soft_max                       0x13670cf10
ggml_metal_init: loaded kernel_diag_mask_inf                  0x13670d570
ggml_metal_init: loaded kernel_get_rows_f16                   0x13670dbf0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x13670e3e0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13670ea60
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13670f0e0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13670f760
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13670fde0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x136710460
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x136710ae0
ggml_metal_init: loaded kernel_rms_norm                       0x1367111a0
ggml_metal_init: loaded kernel_norm                           0x1367119c0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x136712220
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x1367128e0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x136712fa0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x1367137e0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x136713ea0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x136714560
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x136714c00
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x136715700
ggml_metal_init: loaded kernel_rope                           0x136715c20
ggml_metal_init: loaded kernel_alibi_f32                      0x1367164e0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x136716d70
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x136717600
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x136717d70
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 36864.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  2804.80 MB, offs =  38439649280, (39669.25 / 49152.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    24.00 MB, (39693.25 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1282.00 MB, (40975.25 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   749.00 MB, (41724.25 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   304.00 MB, (42028.25 / 49152.00)

system_info: n_threads = 10 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Input prefix with BOS
Input prefix: ' [INST] '
Input suffix: ' [/INST]'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

GGML_ASSERT: ggml-metal.m:612: ne02 == ne12
GGML_ASSERT: ggml-metal.m:612: ne02 == ne12

Full command: ./main -m ~/TheBloke/Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_K_M.bin -gqa 8 -c 4096 -ngl 1 -n -1 -t 10 --in-prefix-bos --in-prefix ' [INST] ' --in-suffix ' [/INST]' -i -f ~/Downloads/llama-prompt.txt

@ggerganov
Copy link
Owner Author

@JohannesGaessler

n_mult == 256 is incorrect. It should be 4096. This might happen if you convert from HF model where we don't support yet. I've only tested will OG Meta model conversion.

@halbtuerke

No GQA support in Metal yet - I don't have a machine with enough RAM to test and fix this. Waiting for my M2 Ultra to arrive - 6 weeks shipment already and counting .. :(

@halbtuerke
Copy link

Thank you very much for the clarification.

@JohannesGaessler
Copy link
Collaborator

No, I converted the model from the official Meta weights. The problem is perhaps that I used the WIP PR for the conversion?

@ggerganov
Copy link
Owner Author

Yes, very likely. Re-run the convert.py and it should work

@JohannesGaessler
Copy link
Collaborator

I re-ran convert.py and this fixed my issue.

@appleguy
Copy link

@ggerganov - this may be of no help to you as I'd imagine you want a debugger at hand, but if I can be an interim bridge for you and run some tests on a 192GB M2 Ultra, let me know. Also just wanted to thank you for the level of effort you've put into guiding the architecture and principles of this project — a truly awesome accomplishment and appreciated by a large community!

@eugenepyvovarov
Copy link

not very relevant probably, but does base 70B model comes with 2K context only? when I pass it 4K context with parameter - it says that only 2k is supported?

@Green-Sky
Copy link
Collaborator

Green-Sky commented Aug 1, 2023

the 2k warning has not been updated since. gguf will come with [llm].context_length: uint32, and a warning will then be generated base on that.

@RDearnaley
Copy link

RDearnaley commented Aug 18, 2023

@ggerganov - Thanks for getting this working: I'm able to run Llama2 70B models (i.e state-of-the-art open-source models) in q6_K on an M2 Max MacBook Pro with 64GB.

However, it is somewhat slow, rather slower than reading speed, so it would be lovely to get the TODOs mentioned above fixed to enable Metal GPU acceleration.

@Green-Sky
Copy link
Collaborator

@RDearnaley there where prs the last few days that where merged, that should improve performance. When did you last pull?
#2615
#2627

@OthmanProgramming

This comment was marked as off-topic.

@OthmanProgramming

This comment was marked as off-topic.

@OthmanProgramming

This comment was marked as off-topic.

@Tha14
Copy link

Tha14 commented Aug 27, 2023

Can anyone help me solve the bi-level programming model for a sum of money

No. Fok off outta 'ere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.