Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numa #1556

Merged
merged 20 commits into from
Jun 26, 2023
Merged

Numa #1556

merged 20 commits into from
Jun 26, 2023

Conversation

zrm
Copy link
Collaborator

@zrm zrm commented May 21, 2023

ref: #1437

2S Xeon E5-2690v1

Before:
./main -n 1024 -m models/7B/ggml-model-q4_0.bin --ignore-eos -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 796.65 ms / 1024 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 44892.59 ms / 785 tokens ( 57.19 ms per token)
llama_print_timings: eval time = 178337.97 ms / 1020 runs ( 174.84 ms per token)

# echo 3 > /proc/sys/vm/drop_caches

After:
./main -n 1024 -m models/7B/ggml-model-q4_0.bin --ignore-eos -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 795.63 ms / 1024 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 46996.62 ms / 785 tokens ( 59.87 ms per token)
llama_print_timings: eval time = 101232.21 ms / 1020 runs ( 99.25 ms per token)

github-actions[bot]

This comment was marked as off-topic.

github-actions[bot]

This comment was marked as off-topic.

@zrm
Copy link
Collaborator Author

zrm commented May 21, 2023

Not sure if this one is alright (9d058c2). On machines with high thread synchronization overhead, distributing GGML_TASK_FINALIZE to the thread pool adds about 10% to eval time, so only do it for ops where finalize actually does something. But am I correct that this is currently none of them?

@rankaiyx
Copy link
Contributor

The method you just pull should be the right solution. I am excited about your improvement. I'm assembling a numa computer with 2 CPU and 8 channels of memory (E5-2698Bv3 x2 / DDR3 1866 32G x8), and I'll test it later.

@ggerganov ggerganov added performance Speed related topics threading Parallel processing and thread management labels May 22, 2023
github-actions[bot]

This comment was marked as off-topic.

@ggerganov
Copy link
Owner

Here are some results on a dual socket Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz. Each one has 18 cores

master

# 8 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 8 -s 6

llama_print_timings:        load time =  2649.90 ms
llama_print_timings:      sample time =    14.44 ms /    26 runs   (    0.56 ms per token)
llama_print_timings: prompt eval time =   488.38 ms /     8 tokens (   61.05 ms per token)
llama_print_timings:        eval time =  3450.39 ms /    25 runs   (  138.02 ms per token)
llama_print_timings:       total time =  6120.78 ms

# 12 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 12 -s 6

llama_print_timings:        load time =  2558.62 ms
llama_print_timings:      sample time =    13.90 ms /    26 runs   (    0.53 ms per token)
llama_print_timings: prompt eval time =   335.17 ms /     8 tokens (   41.90 ms per token)
llama_print_timings:        eval time =  2190.39 ms /    25 runs   (   87.62 ms per token)
llama_print_timings:       total time =  4768.89 ms

# 16 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 16 -s 6

llama_print_timings:        load time =  2422.32 ms
llama_print_timings:      sample time =    13.93 ms /    26 runs   (    0.54 ms per token)
llama_print_timings: prompt eval time =   264.24 ms /     8 tokens (   33.03 ms per token)
llama_print_timings:        eval time =  2008.39 ms /    25 runs   (   80.34 ms per token)
llama_print_timings:       total time =  4450.76 ms

# 18 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 18 -s 6

llama_print_timings:        load time =  2467.76 ms
llama_print_timings:      sample time =    14.18 ms /    26 runs   (    0.55 ms per token)
llama_print_timings: prompt eval time =   306.80 ms /     8 tokens (   38.35 ms per token)
llama_print_timings:        eval time =  1668.66 ms /    25 runs   (   66.75 ms per token)
llama_print_timings:       total time =  4156.82 ms

# 36 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 36 -s 6


llama_print_timings:        load time =  2440.28 ms
llama_print_timings:      sample time =    13.64 ms /    26 runs   (    0.52 ms per token)
llama_print_timings: prompt eval time =   283.70 ms /     8 tokens (   35.46 ms per token)
llama_print_timings:        eval time =  2977.92 ms /    25 runs   (  119.12 ms per token)
llama_print_timings:       total time =  5438.13 ms

zrm:numa

# 8 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 8 -s 6

llama_print_timings:        load time =  2784.40 ms
llama_print_timings:      sample time =    12.51 ms /    26 runs   (    0.48 ms per token)
llama_print_timings: prompt eval time =   626.98 ms /     8 tokens (   78.37 ms per token)
llama_print_timings:        eval time =  3395.67 ms /    25 runs   (  135.83 ms per token)
llama_print_timings:       total time =  6198.56 ms

# 12 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 12 -s 6

llama_print_timings:        load time =  2611.76 ms
llama_print_timings:      sample time =    12.17 ms /    26 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time =   407.14 ms /     8 tokens (   50.89 ms per token)
llama_print_timings:        eval time =  2819.72 ms /    25 runs   (  112.79 ms per token)
llama_print_timings:       total time =  5449.82 ms

# 16 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 16 -s 6

llama_print_timings:        load time =  2542.99 ms
llama_print_timings:      sample time =    11.76 ms /    26 runs   (    0.45 ms per token)
llama_print_timings: prompt eval time =   335.91 ms /     8 tokens (   41.99 ms per token)
llama_print_timings:        eval time =  2551.17 ms /    25 runs   (  102.05 ms per token)
llama_print_timings:       total time =  5111.75 ms

# 18 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 18 -s 6

llama_print_timings:        load time =  2500.30 ms
llama_print_timings:      sample time =    11.47 ms /    26 runs   (    0.44 ms per token)
llama_print_timings: prompt eval time =   300.49 ms /     8 tokens (   37.56 ms per token)
llama_print_timings:        eval time =  2538.30 ms /    25 runs   (  101.53 ms per token)
llama_print_timings:       total time =  5055.94 ms

# 36 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 36 -s 6

llama_print_timings:        load time =  2448.93 ms
llama_print_timings:      sample time =    11.44 ms /    26 runs   (    0.44 ms per token)
llama_print_timings: prompt eval time =   253.50 ms /     8 tokens (   31.69 ms per token)
llama_print_timings:        eval time =  2497.20 ms /    25 runs   (   99.89 ms per token)
llama_print_timings:       total time =  4963.27 ms

I don't observe the speed-up from the zrm:numa branch.
I suspect that this has something to do with mmap.

@zrm has done their tests without the --no-mmap option, so this means by default mmap is used.
On my system, if I use mmap, there is a significant slowdown - here is master using 18 threads:

make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 18 -s 6

llama_print_timings:        load time =   434.70 ms
llama_print_timings:      sample time =    14.18 ms /    26 runs   (    0.55 ms per token)
llama_print_timings: prompt eval time =   259.91 ms /     8 tokens (   32.49 ms per token)
llama_print_timings:        eval time =  3285.42 ms /    25 runs   (  131.42 ms per token)
llama_print_timings:       total time =  3740.55 ms

I recommend to re-do the analysis with always using --no-mmap and see if there is benefit from NUMA in that case

@zrm
Copy link
Collaborator Author

zrm commented May 24, 2023

I recommend to re-do the analysis with always using --no-mmap and see if there is benefit from NUMA in that case

master:

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32 --no-mmap
...
llama_print_timings: sample time = 399.22 ms / 512 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 15385.30 ms / 265 tokens ( 58.06 ms per token)
llama_print_timings: eval time = 86911.83 ms / 510 runs ( 170.42 ms per token)

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32
...
llama_print_timings: sample time = 397.81 ms / 512 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 15456.71 ms / 265 tokens ( 58.33 ms per token)
llama_print_timings: eval time = 97122.96 ms / 510 runs ( 190.44 ms per token)

The reason for this is that when master does mmap it uses MAP_POPULATE to prefault the pages and the kernel will then load the model into memory on an arbitrary NUMA node, commonly putting the whole thing on one node. This is basically the worst case scenario on NUMA so any change makes it better, including --no-mmap.

The numa branch disables MAP_POPULATE and MADV_WILLNEED on NUMA systems in favor of MADV_RANDOM, allowing the pages to be faulted in based on usage pattern, and pins threads to nodes so the same thread number is always running on the same node. That puts pages on the node that actually access them. But if you run it after you've run master the pages are already in the page cache and it barely does anything:

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32 --no-mmap
...
llama_print_timings: sample time = 396.27 ms / 512 runs ( 0.77 ms per token)
llama_print_timings: prompt eval time = 16307.36 ms / 265 tokens ( 61.54 ms per token)
llama_print_timings: eval time = 84633.24 ms / 510 runs ( 165.95 ms per token)

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32
...
llama_print_timings: sample time = 396.57 ms / 512 runs ( 0.77 ms per token)
llama_print_timings: prompt eval time = 18226.49 ms / 265 tokens ( 68.78 ms per token)
llama_print_timings: eval time = 84807.91 ms / 510 runs ( 166.29 ms per token)

So you have to drop the page cache once first:

# echo 3 > /proc/sys/vm/drop_caches

There is also a kernel parameter called numa_balancing which tries to migrate pages to the node it thinks will use them, but for pages that are already on the appropriate node it only makes it worse and should be disabled:

# echo 0 > /proc/sys/kernel/numa_balancing

The numa branch warns if this is enabled (which it is by default on many distributions).

Then you get this:

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32 --no-mmap
...
llama_print_timings: sample time = 396.31 ms / 512 runs ( 0.77 ms per token)
llama_print_timings: prompt eval time = 15202.07 ms / 265 tokens ( 57.37 ms per token)
llama_print_timings: eval time = 81976.09 ms / 510 runs ( 160.74 ms per token)

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32
...
llama_print_timings: sample time = 397.33 ms / 512 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 15527.65 ms / 265 tokens ( 58.59 ms per token)
llama_print_timings: eval time = 43868.49 ms / 510 runs ( 86.02 ms per token)

(note: don't actually run --no-mmap first, that loads the model into the page cache again)

@zrm
Copy link
Collaborator Author

zrm commented May 24, 2023

Let's see if I can better explain how this is working. ggml_compute_forward_mul_mat_f16_f32():

llama.cpp/ggml.c

Lines 9802 to 9807 in ac7876a

// rows per thread
const int dr = (nr + nth - 1)/nth;
// row range for this thread
const int ir0 = dr*ith;
const int ir1 = MIN(ir0 + dr, nr);

Thread ith always gets the same rows of the model, and they're several memory pages in size. If we use mmap() but don't preload, on the first eval each thread goes to read from that memory and hits page faults because the pages haven't been loaded off the filesystem yet. The kernel then loads the page into the memory of the node of the thread that hit the page fault. Since we pinned the threads to nodes, thread ith always runs on that node and when it goes to access the same page of the model for the next token, that page is in the memory of the local node.

This is much faster than all the pages being in the memory of one node and it's even faster than node interleave.

If the model was already in the system page cache when the program starts, there are no page faults. The kernel already has the model cached in RAM. If it was loaded by a previous run that accessed the pages like this, that's great, because we don't have to load it from the filesystem again and each thread is still going to access the same part of the model. But if the model was initially read into the page cache in some other way, e.g. using --no-mmap or mmap() with MAP_POPULATE, then the pages aren't where we want them and to get them there we have to drop the page cache (or reboot the machine, or call madvise() with MADV_DONTNEED) so the next run will reload the model from the filesystem.

@ggerganov
Copy link
Owner

@zrm

Thanks for the detailed explanation. It makes a lot of sense and it is clear now.

The strange thing is that I cannot make my system go faster with 36 threads.
I tried following your instructions, but the speed is still worse compared to master with 18 threads:

Here is a full log of a few runs on zrm/numa and then back on master:

gg:/llama.cpp$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
gg:/llama.cpp$ sudo sh -c "echo 0 > /proc/sys/kernel/numa_balancing"
gg:/llama.cpp$ git checkout zrm/numa 
Note: switching to 'zrm/numa'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 8502d51 fix args
gg:/llama.cpp$ make clean && make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
removed 'common.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'build-info.h'
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp ggml.o llama.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o -o vdot 

====  Run ./main -h for help.  ====

main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 51042.35 ms
llama_print_timings:      sample time =    15.32 ms /    32 runs   (    0.48 ms per token)
llama_print_timings: prompt eval time = 50070.67 ms /     8 tokens ( 6258.83 ms per token)
llama_print_timings:        eval time =  3589.51 ms /    31 runs   (  115.79 ms per token)
llama_print_timings:       total time = 54655.16 ms
gg:/llama.cpp$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   366.90 ms
llama_print_timings:      sample time =    15.52 ms /    32 runs   (    0.49 ms per token)
llama_print_timings: prompt eval time =   352.65 ms /     8 tokens (   44.08 ms per token)
llama_print_timings:        eval time =  4386.11 ms /    31 runs   (  141.49 ms per token)
llama_print_timings:       total time =  4776.22 ms
gg:/llama.cpp$ sudo sh -c "echo 0 > /proc/sys/kernel/numa_balancing"
gg:/llama.cpp$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
gg:/llama.cpp$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 51688.09 ms
llama_print_timings:      sample time =    15.15 ms /    32 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time = 50843.78 ms /     8 tokens ( 6355.47 ms per token)
llama_print_timings:        eval time =  2967.22 ms /    31 runs   (   95.72 ms per token)
llama_print_timings:       total time = 54678.23 ms
gg:/llama.cpp$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   371.08 ms
llama_print_timings:      sample time =    14.89 ms /    32 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time =   357.25 ms /     8 tokens (   44.66 ms per token)
llama_print_timings:        eval time =  4162.02 ms /    31 runs   (  134.26 ms per token)
llama_print_timings:       total time =  4555.69 ms
gg:/llama.cpp$ git checkout master
Previous HEAD position was 8502d51 fix args
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
gg:/llama.cpp$ make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 18 -s 7
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp ggml.o llama.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o -o vdot 

====  Run ./main -h for help.  ====

main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 18 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   770.77 ms
llama_print_timings:      sample time =    16.58 ms /    32 runs   (    0.52 ms per token)
llama_print_timings: prompt eval time =   335.86 ms /     8 tokens (   41.98 ms per token)
llama_print_timings:        eval time =  2636.63 ms /    31 runs   (   85.05 ms per token)
llama_print_timings:       total time =  3431.53 ms
gg:/llama.cpp$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
gg:/llama.cpp$ make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 18 -s 7
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

make: Nothing to be done for 'default'.
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 18 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 10943.07 ms
llama_print_timings:      sample time =    17.51 ms /    32 runs   (    0.55 ms per token)
llama_print_timings: prompt eval time =   250.01 ms /     8 tokens (   31.25 ms per token)
llama_print_timings:        eval time =  2212.98 ms /    31 runs   (   71.39 ms per token)
llama_print_timings:       total time = 13181.17 ms
gg:/llama.cpp$ make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 18 -s 7
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

make: Nothing to be done for 'default'.
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 18 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   427.84 ms
llama_print_timings:      sample time =    17.69 ms /    32 runs   (    0.55 ms per token)
llama_print_timings: prompt eval time =   243.10 ms /     8 tokens (   30.39 ms per token)
llama_print_timings:        eval time =  2141.87 ms /    31 runs   (   69.09 ms per token)
llama_print_timings:       total time =  2594.99 ms

Is it possible that I have some BIOS setting wrong?

@rankaiyx
Copy link
Contributor

rankaiyx commented May 26, 2023

@ggerganov
Have you tried "numactl --interleave=all"?
will it improve?

@rankaiyx
Copy link
Contributor

rankaiyx commented May 26, 2023

E5-2698Bv3 x 2 Each one has 16 cores 32 threads
DDR3 1866 32G x 8 100GB/s

master:

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"

$numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 128792 MB
node 0 free: 126491 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 129010 MB
node 1 free: 126895 MB
node distances:
node 0 1
0: 10 21
1: 21 10

$./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 16 -p "I believe the meaning of life is"
llama_print_timings: sample time = 17.66 ms / 32 runs ( 0.55 ms per token)
llama_print_timings: prompt eval time = 387.29 ms / 8 tokens ( 48.41 ms per token)
llama_print_timings: eval time = 3655.15 ms / 31 runs ( 117.91 ms per token)
$numactl -H
node 0 size: 128792 MB
node 0 free: 122891 MB
node 1 size: 129010 MB
node 1 free: 126834 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 14.38 ms / 26 runs ( 0.55 ms per token)
llama_print_timings: prompt eval time = 287.04 ms / 8 tokens ( 35.88 ms per token)
llama_print_timings: eval time = 3420.18 ms / 25 runs ( 136.81 ms per token)
$numactl -H
node 0 size: 128792 MB
node 0 free: 122893 MB
node 1 size: 129010 MB
node 1 free: 126815 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 32 -p "I believe the meaning of life is" --no-mmap
llama_print_timings: sample time = 9.78 ms / 17 runs ( 0.58 ms per token)
llama_print_timings: prompt eval time = 351.71 ms / 8 tokens ( 43.96 ms per token)
llama_print_timings: eval time = 2204.62 ms / 16 runs ( 137.79 ms per token)
$numactl -H
node 0 size: 128792 MB
node 0 free: 126579 MB
node 1 size: 129010 MB
node 1 free: 123042 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 60 -p "I believe the meaning of life is" --no-mmap
llama_print_timings: sample time = 17.75 ms / 32 runs ( 0.55 ms per token)
llama_print_timings: prompt eval time = 383.18 ms / 8 tokens ( 47.90 ms per token)
llama_print_timings: eval time = 4577.02 ms / 31 runs ( 147.65 ms per token)
$numactl -H
node 0 size: 128792 MB
node 0 free: 126554 MB
node 1 size: 129010 MB
node 1 free: 123107 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 16 -p "I believe the meaning of life is" --no-mmap
llama_print_timings: sample time = 17.82 ms / 32 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 434.39 ms / 8 tokens ( 54.30 ms per token)
llama_print_timings: eval time = 3614.41 ms / 31 runs ( 116.59 ms per token)
$numactl -H
node 0 size: 128792 MB
node 0 free: 122971 MB
node 1 size: 129010 MB
node 1 free: 126673 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$numactl --interleave=0-1 ./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 18.05 ms / 32 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 279.41 ms / 8 tokens ( 34.93 ms per token)
llama_print_timings: eval time = 2837.69 ms / 31 runs ( 91.54 ms per token)
again:
llama_print_timings: sample time = 18.02 ms / 32 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 303.77 ms / 8 tokens ( 37.97 ms per token)
llama_print_timings: eval time = 3170.54 ms / 31 runs ( 102.28 ms per token)
again:
llama_print_timings: sample time = 18.32 ms / 32 runs ( 0.57 ms per token)
llama_print_timings: prompt eval time = 311.67 ms / 8 tokens ( 38.96 ms per token)
llama_print_timings: eval time = 2847.70 ms / 31 runs ( 91.86 ms per token)
$numactl -H
node 0 size: 128792 MB
node 0 free: 124720 MB
node 1 size: 129010 MB
node 1 free: 124749 MB

zrm:

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 128 -t 16 -p "I believe the meaning of life is"
llama_print_timings: sample time = 71.39 ms / 128 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 363.49 ms / 8 tokens ( 45.44 ms per token)
llama_print_timings: eval time = 12101.28 ms / 127 runs ( 95.29 ms per token)
numactl -H
node 0 size: 128792 MB
node 0 free: 125005 MB
node 1 size: 129010 MB
node 1 free: 124674 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 128 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 53.07 ms / 94 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 353.08 ms / 8 tokens ( 44.13 ms per token)
llama_print_timings: eval time = 7880.36 ms / 93 runs ( 84.74 ms per token)
numactl -H
node 0 size: 128792 MB
node 0 free: 124720 MB
node 1 size: 129010 MB
node 1 free: 124951 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 128 -t 60 -p "I believe the meaning of life is"
llama_print_timings: sample time = 49.05 ms / 88 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 368.89 ms / 8 tokens ( 46.11 ms per token)
llama_print_timings: eval time = 8753.94 ms / 87 runs ( 100.62 ms per token)
numactl -H
node 0 size: 128792 MB
node 0 free: 124774 MB
node 1 size: 129010 MB
node 1 free: 124942 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 128 -t 32 -p "I believe the meaning of life is" --no-mmap
llama_print_timings: sample time = 44.54 ms / 79 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 394.15 ms / 8 tokens ( 49.27 ms per token)
llama_print_timings: eval time = 9881.16 ms / 78 runs ( 126.68 ms per token)
numactl -H
node 0 size: 128792 MB
node 0 free: 122947 MB
node 1 size: 129010 MB
node 1 free: 126708 MB

@rankaiyx
Copy link
Contributor

rankaiyx commented May 26, 2023

65b

master:

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 16 -p "I believe the meaning of life is"
llama_print_timings: sample time = 8.79 ms / 16 runs ( 0.55 ms per token)
llama_print_timings: prompt eval time = 3194.02 ms / 8 tokens ( 399.25 ms per token)
llama_print_timings: eval time = 16712.37 ms / 15 runs ( 1114.16 ms per token)
node 0 size: 128792 MB
node 0 free: 126487 MB
node 1 size: 129010 MB
node 1 free: 79816 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 9.00 ms / 16 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 2286.92 ms / 8 tokens ( 285.86 ms per token)
llama_print_timings: eval time = 19443.28 ms / 15 runs ( 1296.22 ms per token)
node 0 size: 128792 MB
node 0 free: 79686 MB
node 1 size: 129010 MB
node 1 free: 126610 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$numactl --interleave=0-1 ./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 18.26 ms / 32 runs ( 0.57 ms per token)
llama_print_timings: prompt eval time = 2232.64 ms / 8 tokens ( 279.08 ms per token)
llama_print_timings: eval time = 23214.56 ms / 31 runs ( 748.86 ms per token)
node 0 size: 128792 MB
node 0 free: 102975 MB
node 1 size: 129010 MB
node 1 free: 103229 MB

zrm:
$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 16 -p "I believe the meaning of life is"
llama_print_timings: sample time = 8.85 ms / 16 runs ( 0.55 ms per token)
llama_print_timings: prompt eval time = 3446.24 ms / 8 tokens ( 430.78 ms per token)
llama_print_timings: eval time = 10474.60 ms / 15 runs ( 698.31 ms per token)
node 0 size: 128792 MB
node 0 free: 103510 MB
node 1 size: 129010 MB
node 1 free: 103068 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 8.87 ms / 16 runs ( 0.55 ms per token)
llama_print_timings: prompt eval time = 2340.34 ms / 8 tokens ( 292.54 ms per token)
llama_print_timings: eval time = 8888.35 ms / 15 runs ( 592.56 ms per token)

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
$./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 60 -p "I believe the meaning of life is"
llama_print_timings: sample time = 9.02 ms / 16 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 2353.47 ms / 8 tokens ( 294.18 ms per token)
llama_print_timings: eval time = 8453.18 ms / 15 runs ( 563.55 ms per token)

@rankaiyx
Copy link
Contributor

--no-mmap

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is" --no-mmap
llama_print_timings: sample time = 8.85 ms / 16 runs ( 0.55 ms per token)
llama_print_timings: prompt eval time = 3461.39 ms / 8 tokens ( 432.67 ms per token)
llama_print_timings: eval time = 11593.47 ms / 15 runs ( 772.90 ms per token)

again:
llama_print_timings: sample time = 8.93 ms / 16 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 3142.02 ms / 8 tokens ( 392.75 ms per token)
llama_print_timings: eval time = 20544.88 ms / 15 runs ( 1369.66 ms per token)

again:
llama_print_timings: sample time = 8.96 ms / 16 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 6109.34 ms / 8 tokens ( 763.67 ms per token)
llama_print_timings: eval time = 37335.85 ms / 15 runs ( 2489.06 ms per token)

again:
llama_print_timings: sample time = 8.83 ms / 16 runs ( 0.55 ms per token)
llama_print_timings: prompt eval time = 7148.55 ms / 8 tokens ( 893.57 ms per token)
llama_print_timings: eval time = 49204.25 ms / 15 runs ( 3280.28 ms per token)

numactl -H
node 0 size: 128792 MB
node 0 free: 126433 MB
node 1 size: 129010 MB
node 1 free: 79861 MB

@rankaiyx
Copy link
Contributor

rankaiyx commented May 26, 2023

If one cpu core is occupied by another program, it will have a great impact on speed.
The reason is that my CPU frequency has been reduced after a period of full load work.

I'm downloading guanaco-65b_Q4_0. I'll test it later. Now I go bed.

@rankaiyx
Copy link
Contributor

rankaiyx commented May 27, 2023

zrm
65b q4_0

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 9.01 ms / 16 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 2885.66 ms / 8 tokens ( 360.71 ms per token)
llama_print_timings: eval time = 7546.16 ms / 15 runs ( 503.08 ms per token)

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 60 -p "I believe the meaning of life is"
llama_print_timings: sample time = 9.07 ms / 16 runs ( 0.57 ms per token)
llama_print_timings: prompt eval time = 1964.91 ms / 8 tokens ( 245.61 ms per token)
llama_print_timings: eval time = 6788.23 ms / 15 runs ( 452.55 ms per token)

disable numa in BIOS
master
q5
./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 8.92 ms / 16 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 2177.03 ms / 8 tokens ( 272.13 ms per token)
llama_print_timings: eval time = 9704.86 ms / 15 runs ( 646.99 ms per token)

q4
./main -m ../ggmlv3/guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 9.20 ms / 16 runs ( 0.57 ms per token)
llama_print_timings: prompt eval time = 1860.88 ms / 8 tokens ( 232.61 ms per token)
llama_print_timings: eval time = 7922.27 ms / 15 runs ( 528.15 ms per token)

./main -m ../ggmlv3/guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 62 -p "I believe the meaning of life is"
llama_print_timings: sample time = 9.06 ms / 16 runs ( 0.57 ms per token)
llama_print_timings: prompt eval time = 1863.77 ms / 8 tokens ( 232.97 ms per token)
llama_print_timings: eval time = 8602.44 ms / 15 runs ( 573.50 ms per token)

@rankaiyx
Copy link
Contributor

It doesn't seem to work on the windows platform.
Win10 2022H2

zrm:
D:\zrm\llama.cpp-win> .\main.exe -m ....\ggmlv3\guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 16 -p "I believe the meaning of life is"
llama_print_timings: sample time = 10.23 ms / 16 runs ( 0.64 ms per token)
llama_print_timings: prompt eval time = 6351.19 ms / 8 tokens ( 793.90 ms per token)
llama_print_timings: eval time = 55692.97 ms / 15 runs ( 3712.86 ms per token)

reboot
D:\zrm\llama.cpp-win> .\main.exe -m ....\ggmlv3\guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 17.52 ms / 16 runs ( 1.09 ms per token)
llama_print_timings: prompt eval time = 7219.65 ms / 8 tokens ( 902.46 ms per token)
llama_print_timings: eval time = 67883.02 ms / 15 runs ( 4525.53 ms per token)

master:
D:\llama-master-66874d4-bin-win-avx2-x64> .\main.exe -m ..\ggmlv3\guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 16 -p "I believe the meaning of life is"
llama_print_timings: sample time = 4.62 ms / 16 runs ( 0.29 ms per token)
llama_print_timings: prompt eval time = 4916.49 ms / 8 tokens ( 614.56 ms per token)
llama_print_timings: eval time = 33004.08 ms / 15 runs ( 2200.27 ms per token)

D:\llama-master-66874d4-bin-win-avx2-x64> .\main.exe -m ..\ggmlv3\guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 32 -p "I believe the meaning of life is"
llama_print_timings: sample time = 4.24 ms / 16 runs ( 0.27 ms per token)
llama_print_timings: prompt eval time = 7452.86 ms / 8 tokens ( 931.61 ms per token)
llama_print_timings: eval time = 23078.24 ms / 15 runs ( 1538.55 ms per token)

@rankaiyx
Copy link
Contributor

rankaiyx commented May 28, 2023

Windows does not seem to have a tool like numactl, and the numa disable setting in bios has no effect on Windows.

Under linux, my CPU will reduce the frequency after working at full load for dozens of seconds. I have changed a lot of bios settings, changed multiple distributions, and still can't solve it in a variety of ways.

I don't seem to have a good idea now.
The performance degradation is due to the memory voltage, which is solved by setting it to 1.5V.

@zrm
Copy link
Collaborator Author

zrm commented May 29, 2023

It doesn't seem to work on the windows platform.

I didn't write a Windows implementation because the APIs are different, I'm not especially familiar with them and I haven't got a Windows NUMA system to test with anyway. Someone so inclined could plausibly get it to work there assuming the Windows APIs to do it exist.

@Piezoid
Copy link
Contributor

Piezoid commented May 29, 2023

Someone so inclined could plausibly get it to work there assuming the Windows APIs to do it exist.

There is a support article about these APIs: https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support.
I am in the same situation with no access to a Windows NUMA machine.

@zrm
Copy link
Collaborator Author

zrm commented May 30, 2023

The strange thing is that I cannot make my system go faster with 36 threads.

I'm trying to guess what causes this, but it's hard without access to the same type of system to test.

You have more cores and the frequent thread synchronization is significantly more expensive across sockets. That could be contributing but I doubt it's the main problem. What happens on this system when the numa branch is run with 18 threads without --no-mmap?

What happens on this system if you drop the page cache and then run master with 36 threads under numactl --interleave=all? In this case the performance shouldn't be different with and without --no-mmap (both the mmap'd file and the copy in memory should be interleaved), so if there was we have something to investigate.

This code is an attempt to get the system to load pages into the right node without explicitly specifying it with libnuma. It's possible that some hardware difference is making that less effective on this system, e.g. your system has AVX-512 which could change the memory access pattern. Calling madvise() with MADV_RANDOM had no major effect on my 2-node system but significantly improved performance on the 4-node system (and the 4N system has AVX2 where the 2N system only has AVX, which may or may not be related). I might try writing the code that actually uses libnuma just to see if it works any better than this.

@rankaiyx
Copy link
Contributor

Since you only implemented the Linux part, why does it degrade performance on Windows? Is it because using w64devkit compilation causes the compiler to think that it is compiling for the Linux platform?

@rankaiyx
Copy link
Contributor

Is it possible that I have some BIOS setting wrong?

If numa is disabled in BIOS, memory interleaving may occur automatically, which can lead to this phenomenon.

@rankaiyx
Copy link
Contributor

rankaiyx commented May 30, 2023

Numa is cold knowledge for most people. Can we rely on the assistance of chatgpt to complete the development of the Windows part? I am willing to test for this.
I have used the following prompt to try to make chatgpt to complete the code, you can see if it can be used for reference?

I:
Here is a C language program, which only implements the linux part, as developers are not familiar with the API of Windows, can you help complete the Windows part? By the way, can you help explain all these codes?
ChatGPT:
Of course! I'll be happy to help you complete the Windows part of your C language program and explain the code to you. Please provide the code you have so far, and let me know which specific parts you need assistance with.
I: [the code]
ChatGPT:

#ifdef __linux__
void set_numa_thread_affinity(int thread_n, int n_threads)
{
    if (!ggml_is_numa()) { return; }
    // run thread on node_num thread_n / (threads per node)
    int node_num = thread_n / (n_threads / ggml_numa.n_nodes);
    struct ggml_numa_node *node = &ggml_numa.nodes[node_num];
    size_t setsize = CPU_ALLOC_SIZE(ggml_numa.total_cpus);
    cpu_set_t *cpus = CPU_ALLOC(ggml_numa.total_cpus);
    CPU_ZERO_S(setsize, cpus);
    for (size_t i = 0; i < node->n_cpus; ++i) {
        CPU_SET_S(node->cpus[i], setsize, cpus);
    }
    int rv = pthread_setaffinity_np(pthread_self(), setsize, cpus);
    if (rv) {
        fprintf(stderr, "warning: pthread_setaffinity_np() failed: %s\n", strerror(rv));
    }
    CPU_FREE(cpus);
}

void clear_numa_thread_affinity(void)
{
    if (!ggml_is_numa()) { return; }
    size_t setsize = CPU_ALLOC_SIZE(ggml_numa.total_cpus);
    cpu_set_t *cpus = CPU_ALLOC(ggml_numa.total_cpus);
    CPU_ZERO_S(setsize, cpus);
    for (unsigned i = 0; i < ggml_numa.total_cpus; ++i) {
        CPU_SET_S(i, setsize, cpus);
    }
    int rv = pthread_setaffinity_np(pthread_self(), setsize, cpus);
    if (rv) {
        fprintf(stderr, "warning: pthread_setaffinity_np() failed: %s\n", strerror(rv));
    }
    CPU_FREE(cpus);
}

#elif defined(_WIN32)

#include <windows.h>

void set_numa_thread_affinity(int thread_n, int n_threads)
{
    if (!ggml_is_numa()) { return; }

    int node_num = thread_n / (n_threads / ggml_numa.n_nodes);
    struct ggml_numa_node *node = &ggml_numa.nodes[node_num];

    DWORD_PTR affinityMask = 0;
    for (size_t i = 0; i < node->n_cpus; ++i) {
        affinityMask |= (1ULL << node->cpus[i]);
    }

    DWORD_PTR previousAffinityMask = SetThreadAffinityMask(GetCurrentThread(), affinityMask);
    if (previousAffinityMask == 0) {
        fprintf(stderr, "warning: SetThreadAffinityMask failed: %lu\n", GetLastError());
    }
}

void clear_numa_thread_affinity(void)
{
    if (!ggml_is_numa()) { return; }

    DWORD_PTR processAffinityMask;
    DWORD_PTR systemAffinityMask;
    if (!GetProcessAffinityMask(GetCurrentProcess(), &processAffinityMask, &systemAffinityMask)) {
        fprintf(stderr, "warning: GetProcessAffinityMask failed: %lu\n", GetLastError());
        return;
    }

    DWORD_PTR affinityMask = 0;
    for (unsigned i = 0; i < ggml_numa.total_cpus; ++i) {
        affinityMask |= (1ULL << i);
    }

    DWORD_PTR previousAffinityMask = SetThreadAffinityMask(GetCurrentThread(), affinityMask);
    if (previousAffinityMask == 0) {
        fprintf(stderr, "warning: SetThreadAffinityMask failed: %lu\n", GetLastError());
    }
}

#else
// Other platforms
void set_numa_thread_affinity(int thread_n, int n_threads) {}
void clear_numa_thread_affinity(void) {}
#endif

In the Windows implementation:

set_numa_thread_affinity:

It calculates the node number and retrieves the corresponding NUMA node information as in the Linux implementation.
The Windows API function is used to set the thread affinity. It takes the current thread handle (obtained using ) and an affinity mask. The affinity mask is created by bitwise OR'ing the individual processor masks associated with the CPUs of the NUMA node.SetThreadAffinityMaskGetCurrentThread()
If fails, an error message is printed.SetThreadAffinityMask

clear_numa_thread_affinity:

It retrieves the process affinity mask and system affinity mask using the Windows API function .GetProcessAffinityMask
It creates an affinity mask with all the available CPUs set.
It calls with the created affinity mask to clear the thread affinity.SetThreadAffinityMask
If fails, an error message is printed.SetThreadAffinityMask

@rankaiyx
Copy link
Contributor

rankaiyx commented Jun 19, 2023

The power supply of my computer is broken. I can't test the program until the power supply is replaced.

@ggerganov ggerganov linked an issue Jun 25, 2023 that may be closed by this pull request
@rankaiyx
Copy link
Contributor

My machine has been repaired, and the latest code with the "--numa" switch works well on my machine.
Now there is no problem with node calculation.
Looking forward to the merger

@ggerganov
Copy link
Owner

Can you also confirm that without --numa the inference is faster compared to master

@rankaiyx
Copy link
Contributor

zrm latest
sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
./main -m ../../../ggmlv3/wizardlm-30b.ggmlv3.q4_K_S.bin -n 50 -t 30 -p "I believe the meaning of life is" -s 5
...
I believe the meaning of life is to live it."

"But to live it for what?"

"For experience, growth, and joy. Life has its challenges, but it also has its rewards. It's up to us to make the most of
llama_print_timings: load time = 13126.72 ms
llama_print_timings: sample time = 26.49 ms / 50 runs ( 0.53 ms per token, 1887.36 tokens per second)
llama_print_timings: prompt eval time = 721.33 ms / 8 tokens ( 90.17 ms per token, 11.09 tokens per second)
llama_print_timings: eval time = 24223.18 ms / 49 runs ( 494.35 ms per token, 2.02 tokens per second)
llama_print_timings: total time = 24985.15 ms

sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
./main -m ../../../ggmlv3/wizardlm-30b.ggmlv3.q4_K_S.bin -n 50 -t 30 -p "I believe the meaning of life is" -s 5 --numa
...
I believe the meaning of life is to live it."

"But to live it for what?"

"For experience, growth, and joy. Life has its challenges, but it also has its rewards. It's up to us to make the most of
llama_print_timings: load time = 10600.34 ms
llama_print_timings: sample time = 27.32 ms / 50 runs ( 0.55 ms per token, 1830.23 tokens per second)
llama_print_timings: prompt eval time = 1201.06 ms / 8 tokens ( 150.13 ms per token, 6.66 tokens per second)
llama_print_timings: eval time = 11087.82 ms / 49 runs ( 226.28 ms per token, 4.42 tokens per second)
llama_print_timings: total time = 12330.62 ms

git clone https://github.com/ggerganov/llama.cpp.git
make

sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
./main -m ../../../ggmlv3/wizardlm-30b.ggmlv3.q4_K_S.bin -n 50 -t 30 -p "I believe the meaning of life is" -s 5
...
I believe the meaning of life is to live it."

"But is that all there is to it?" asked the boy, looking at her with a wistful expression in his eyes.

"Not necessarily," replied Babs, "but one has to make the
llama_print_timings: load time = 12658.29 ms
llama_print_timings: sample time = 26.67 ms / 50 runs ( 0.53 ms per token, 1874.84 tokens per second)
llama_print_timings: prompt eval time = 761.32 ms / 8 tokens ( 95.17 ms per token, 10.51 tokens per second)
llama_print_timings: eval time = 27040.05 ms / 49 runs ( 551.84 ms per token, 1.81 tokens per second)
llama_print_timings: total time = 27842.50 ms

@rankaiyx
Copy link
Contributor

The generated content seems to be different, even if the seeds are the same.

@rankaiyx
Copy link
Contributor

Therefore, the performance differences can not be accurately compared.

@rankaiyx
Copy link
Contributor

zrm (Merge branch 'master' into HEAD 8f98035)

sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
./main -m ../../../ggmlv3/wizardlm-30b.ggmlv3.q4_K_S.bin -n 50 -t 30 -p "I believe the meaning of life is" -s 5
...
I believe the meaning of life is to live it."

"But is that all there is to it?" asked the boy, looking at her with a wistful expression in his eyes.

"Not necessarily," replied Babs, "but one has to make the
llama_print_timings: load time = 12956.89 ms
llama_print_timings: sample time = 26.60 ms / 50 runs ( 0.53 ms per token, 1879.84 tokens per second)
llama_print_timings: prompt eval time = 710.49 ms / 8 tokens ( 88.81 ms per token, 11.26 tokens per second)
llama_print_timings: eval time = 24138.92 ms / 49 runs ( 492.63 ms per token, 2.03 tokens per second)
llama_print_timings: total time = 24890.51 ms

sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
./main -m ../../../ggmlv3/wizardlm-30b.ggmlv3.q4_K_S.bin -n 50 -t 30 -p "I believe the meaning of life is" -s 5 --numa
...
I believe the meaning of life is to live it."

"But is that all there is to it?" asked the boy, looking at her with a wistful expression in his eyes.

"Not necessarily," replied Babs, "but one has to make the
llama_print_timings: load time = 10534.27 ms
llama_print_timings: sample time = 27.49 ms / 50 runs ( 0.55 ms per token, 1818.58 tokens per second)
llama_print_timings: prompt eval time = 785.80 ms / 8 tokens ( 98.22 ms per token, 10.18 tokens per second)
llama_print_timings: eval time = 10917.93 ms / 49 runs ( 222.81 ms per token, 4.49 tokens per second)
llama_print_timings: total time = 11745.07 ms

ggerganov/llama.cpp.git
make

sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
./main -m ../../../ggmlv3/wizardlm-30b.ggmlv3.q4_K_S.bin -n 50 -t 30 -p "I believe the meaning of life is" -s 5
...
I believe the meaning of life is to live it."

"But is that all there is to it?" asked the boy, looking at her with a wistful expression in his eyes.

"Not necessarily," replied Babs, "but one has to make the
llama_print_timings: load time = 13865.09 ms
llama_print_timings: sample time = 26.75 ms / 50 runs ( 0.53 ms per token, 1869.37 tokens per second)
llama_print_timings: prompt eval time = 770.73 ms / 8 tokens ( 96.34 ms per token, 10.38 tokens per second)
llama_print_timings: eval time = 26737.24 ms / 49 runs ( 545.66 ms per token, 1.83 tokens per second)
llama_print_timings: total time = 27549.38 ms

ggml.c Outdated

state->node = NULL;
if(state->ith < node->n_tasks) {
ggml_compute_forward(&params, node);
} else {
break;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This break is incorrect now

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think this is ready for merge

Thank you, @zrm !
Very valuable contribution and analysis.

@ggerganov ggerganov merged commit b853d45 into ggerganov:master Jun 26, 2023
21 checks passed
byroneverson added a commit to byroneverson/llm.cpp that referenced this pull request Jun 30, 2023
commit b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf
Author: Howard Su <[email protected]>
Date:   Thu Jun 29 21:15:15 2023 +0800

    Use unsigned for random seed (#2006)

    * Use unsigned for random seed. Keep -1 as the value to use a time based seed.

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 96a712ca1b7f427e3bd7ffc0c70b2105cfc7fbf1
Author: LostRuins <[email protected]>
Date:   Thu Jun 29 11:56:43 2023 +0800

    Porting the improved K-Quant CUDA kernels to OpenCL (#1966)

    * Added broken new q4k quant

    * xx + ib0

    * Fix q2_k fast kernel

    * Use preprocessor for QK_K

    * Add q6_k fast matmul kernel

    * ported q3k speedup successfully

    * ported q2k and q5k speedups

    * remove old dot kernels and template

    * fixed global const struct types

    * fixing address spaces

    * fixed string too long CI issue

    ---------

    Co-authored-by: 0cc4m <[email protected]>

commit d3494bb86bf7ad5b0b60aae0220ea576f273b5c0
Author: m3ndax <[email protected]>
Date:   Wed Jun 28 20:39:08 2023 +0200

    llama : replacing auto &kv with const auto &kv (#2041)

    * Replacing auto &kv with const auto &kv

    * Create codacy.yml

    * Delete codacy.yml

commit 5b351e94d041742cd50ffcf2d44718d63bab398a
Author: Salvador E. Tropea <[email protected]>
Date:   Wed Jun 28 14:27:31 2023 -0300

    cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028)

    - Not used

commit 6432aabb6dc887436e4d57414b63116189c3b13b
Author: Salvador E. Tropea <[email protected]>
Date:   Wed Jun 28 14:26:26 2023 -0300

    cuda : fix missing const qualifier in casts (#2027)

commit b922bc351b69770cec2d35d2aa50fa052b95ca93
Author: Howard Su <[email protected]>
Date:   Wed Jun 28 10:13:02 2023 -0700

    llama : remove shards weight file support (#2000)

    * Remove multiple shards

    * Remove multiple file loaders

    * Remove llama_load_tensor_shard class

    * Simplify load logic

    * Remove dead code guess_n_parts function

    * Remove vocab_only from constructor of llama_model_loader

    * Remove alignment_prevents_mmap which is not more needed.

    * Remove useless check

commit 7f9753fa1263c4eded9a3de19778562f0e1093d7
Author: Johannes Gäßler <[email protected]>
Date:   Wed Jun 28 18:35:54 2023 +0200

    CUDA GPU acceleration for LoRAs + f16 models (#1970)

commit cfa0750bc9dbc2d957a91b8ed09ab0035d8f3d4e
Author: ningshanwutuobang <[email protected]>
Date:   Wed Jun 28 23:53:37 2023 +0800

    llama : support input embeddings directly  (#1910)

    * add interface for float input

    * fixed inpL shape and type

    * add examples of input floats

    * add test example for embd input

    * fixed sampling

    * add free for context

    * fixed add end condition for generating

    * add examples for llava.py

    * add READMD for llava.py

    * add READMD for llava.py

    * add example of PandaGPT

    * refactor the interface and fixed the styles

    * add cmake build for embd-input

    * add cmake build for embd-input

    * Add MiniGPT-4 example

    * change the order of the args of llama_eval_internal

    * fix ci error

commit 9d23589d638dc74577d5ff880e6d4248b795f12e
Author: Erik Scholz <[email protected]>
Date:   Tue Jun 27 19:06:33 2023 +0200

    fix pthreads setaffinity usage on android (#2020)

commit 0be54f75a6c3e9a09ea71bdfcdabf9a996a0549b
Author: Howard Su <[email protected]>
Date:   Tue Jun 27 13:07:13 2023 +0800

    baby-llama : fix build after ggml_rope change (#2016)

commit 181e8d975528a4e27eabb8ae6e9865f9ceae4b37
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 27 00:37:13 2023 +0300

    llama : fix rope usage after ChatGLM change

commit d9779021bd59ed96daae75e820a5ac5da47ca8ff
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 27 00:06:51 2023 +0300

    ggml : add support for ChatGLM RoPE

commit d38e45157862b58a1824387e64860d68ca3533a7
Author: Roman Parykin <[email protected]>
Date:   Mon Jun 26 22:47:59 2023 +0300

    readme : add Scala 3 bindings repo (#2010)

commit eaa6ca5a61b8c9501df9ebe3d264f45b75a5f8aa
Author: David Yang <[email protected]>
Date:   Tue Jun 27 03:45:32 2023 +0800

    ggml : increase max tensor name + clean up compiler warnings in train-text (#1988)

    * Clean up compiler warnings in train-text

    Some brackets to disambiguate order of operations

    * Increase GGML_MAX_NAME

    Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues

commit aa777abbb73655c4e1e9237b7c0ad66745e8e48c
Author: Gustavo Rocha Dias <[email protected]>
Date:   Mon Jun 26 16:34:45 2023 -0300

    readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux (#2007)

    * docs - Alternative way to build at Android, with CLBlast.

    * doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux.

    * doc- fix typo

commit c824d2e368d193d9f564ff29880a51cda9f90527
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 26 21:03:59 2023 +0300

    ggml : avoid conv 2d kernel round up

commit b853d456018b10820686362af41b2f2f75f1eec6
Author: zrm <[email protected]>
Date:   Mon Jun 26 13:57:59 2023 -0400

    ggml : add NUMA support (#1556)

    * detect NUMA systems and pin work threads to nodes (linux)

    * disable mmap prefetch/readahead for NUMA systems

    * avoid sending finalize op to thread pool if it does nothing

    * silence robot

    * fix args

    * make --numa a param

    * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement

    * lower synchronization overhead

    * statically allocate

    * move numa state to g_state

    * add description for --numa

    * ggml : minor style changes

    * ggml : minor style + try fix sanitizer build

    * llama : allow to initialize backend with NUMA support

    * llama : avoid ggml include in llama-util.h

    * ggml : style / formatting

    * ggml : fix handling of ops with n_threads > n_tasks > 1

    * server : utilize numa parameter

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 9225baef71407d799a6f7f563b77fd7f82791416
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 26 20:10:52 2023 +0300

    k-quants : fix indentation

commit a84ab1da8dc6a59a5b67420ae1322f09503ffc72
Author: katsu560 <[email protected]>
Date:   Tue Jun 27 01:47:02 2023 +0900

    tests : fix quantize perf (#1990)

    * fix test quantize perf

    * avoid the global state

commit 5743ca80928d8410754ec64a5673d5c2dd6cfbb7
Author: katsu560 <[email protected]>
Date:   Tue Jun 27 01:46:07 2023 +0900

    k-quants : add AVX support to dot functions (#1916)

    * k_quants : add AVX support

    * k_quants : apply review comments

commit 412c60e4739367144e51e59add5dc7749d084115
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 26 19:45:09 2023 +0300

    readme : add link to new k-quants for visibility

commit 6769e944c727c63612dcafbef52009d21ae00fff
Author: Kawrakow <[email protected]>
Date:   Mon Jun 26 19:43:07 2023 +0300

    k-quants : support for super-block size of 64 (#2001)

    * k_quants: WIP super-blocks with 64 weights

    * k_quants: WIP super-blocks with 64 weights

    Q6_K scalar and AVX2 works

    * k_quants: WIP super-blocks with 64 weights

    Q4_K scalar and AVX2 works

    * k_quants: WIP super-blocks with 64 weights

    Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
    than the scalar implementation)

    * k_quants: WIP super-blocks with 64 weights

    Q3_K scalar and AVX2 works.

    * k_quants: WIP super-blocks with 64 weights

    Q5_K scalar and AVX2 works, and with that all
    k_quants are done on AVX2 and scalar

    * k_quants: WIP super-blocks with 64 weights

    Q6_K working on CUDA. Cannot make it run quite as gast as
    with super-blocks with 256 weigths: 8% slower on 4080,
    20% slower on the 1660 (but there we fit 1 less layer on the
    GPU because pf the larger model size), so some fraction of
    these 20% is due to that,

    * k_quants: WIP super-blocks with 64 weights

    Q4_K working on CUDA. ~10% slower on GTX-1660,
    16% slower on 4080.

    * k_quants: WIP super-blocks with 64 weights

    Q2_K working on CUDA. ~3% slower on GTX-1660,
    10% slower on 4080.

    * k_quants: WIP super-blocks with 64 weights

    Q3_K working on CUDA.

    * k_quants: WIP super-blocks with 64 weights

    Q5_K working on CUDA, and with this CUDA is done.

    * k_quants: WIP super-blocks with 64 weights

    Q6_K working on ARM_NEON

    * k_quants: WIP super-blocks with 64 weights

    Q4_K working on ARM_NEON, but quite a bit slower than 256 weights

    * k_quants: WIP super-blocks with 64 weights

    Q2_K working on ARM_NEON, but quite a bit slower than 256 weights

    * k_quants: WIP super-blocks with 64 weights

    Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.

    * k_quants: WIP super-blocks with 64 weights

    Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.

    With that, we have full support for ARM_NEON, although
    performance is not quite there.

    * k_quants: WIP super-blocks with 64 weights

    Slightly more efficient Q3_K and Q5_K

    * k_quants: WIP super-blocks with 64 weights

    Another small improvement for Q3_K and Q5_K on ARM_NEON

    * k_quants: WIP super-blocks with 64 weights

    Yet another speedup for Q5_K on ARM_NEON.
    We are now within 10% of the QK_K = 256 version.

    * k_quants: WIP super-blocks with 64 weights

    * We are able to pass preprocessor macros to the Metal
      compiler
    * Q6_K works and is actually slightly more efficient than
      the QK_K = 256 version (25.2 ms vs 25.8 ms)

    * k_quants: WIP super-blocks with 64 weights

    Q4_K works on Metal and is actually slightly faster
    than QK_K = 256 (21.95 ms vs 24.0 ms).

    * k_quants: WIP super-blocks with 64 weights

    Q2_K works on Metal and is very slightly faster
    than QK_K = 256 (23.8 ms vs 24.2 ms).

    * k_quants: WIP super-blocks with 64 weights

    Q3_K works on Metal and is slightly faster
    than QK_K = 256 (26.6 ms vs 28.3 ms).

    * k_quants: WIP super-blocks with 64 weights

    Q5_K works on Metal and is slightly faster
    than QK_K = 256 (23.7 ms vs 26.3 ms).

    * k_quants: call them _K, not _k, also on Metal

    * k_quants: correctly define QK_K in llama.cpp

    * Fixed bug in q4_K quantization added with the 64-block addition

    * Simplify via lambda

    * k_quants: swicth Q3_K to 4-bit scales when QK_K = 64

    Otherwise there isn't much benefit from this
    quantization type. There is some very slight loss
    in accuracy, but we reduce size by ~7%.
    E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
    8.6131 with 8-bit scales and 8.6352 with 4-bit,
    while file size decreases from 1.53G to 1.44G.

    * k_quants: switch Q4_K to 4-bit scales when QK_K = 64

     Here the loss in accuracy is greater than for Q3_K,
     but the Q4_K points still move further to the left on
     the perplexity vs size curve.

    * k_quants: forgot to add the Metal changes in last commit

    * k_quants: change Q5_K to be type 0 when QK_K = 64

    Still needs AVX2 implementation

    * k_quants: AVX2 implementation for new 64-weight Q5_K

    * k_quants: 10% faster ARM_NEON Q5_K dot product

    * k_quants: fixed issue caused by merging with master

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit cbebf61ca7584e9709265395f0127ae7fc0f1882
Author: Howard Su <[email protected]>
Date:   Mon Jun 26 23:15:47 2023 +0800

    Fix assert when free invalid cuda pointer (#2005)

    Fix assert via initializing extra structure always.
    CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument

commit 447ccbe8c39332fcdd0d98a041b6e2ff6f06219d
Author: Georgi Gerganov <[email protected]>
Date:   Sun Jun 25 16:08:12 2023 +0300

    readme : add new roadmap + manifesto

commit bd34cdde38f8fd661890ddd5f57ca30bf279877b
Author: Georgi Gerganov <[email protected]>
Date:   Sun Jun 25 14:25:08 2023 +0300

    ggml : sync latest ggml (custom operators)

commit c2a08f87b8d180115d04b8688f383d1b2761b16d
Author: anon998 <[email protected]>
Date:   Sun Jun 25 08:48:36 2023 +0000

    fix server sampling: top k sampler first (#1977)

    Co-authored-by: anon <[email protected]>

commit 66a2555ba6cab954c56d653b29c27bfbbacfbfb1
Author: Georgi Gerganov <[email protected]>
Date:   Sun Jun 25 09:07:03 2023 +0300

    readme : add Azure CI discussion link

commit e65ca7e14ac76c4046091da39d41a9017abaa9b3
Author: sjinzh <[email protected]>
Date:   Sun Jun 25 13:45:44 2023 +0800

    zig : upgrade build system support (#1981)

    * upgrade zig build system support

    * zig : add new line at the end of the file

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 5ec8dd5a3c6a9a109351d2257bb9d53869bd0a94
Author: Robyn <[email protected]>
Date:   Sun Jun 25 04:10:29 2023 +1000

    #1869 Fix null reference errors when training from scratch with CUDA (#1907)

    * #1869 Fix null reference errors when training from scratch with CUDA build

    Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly.

    * ggml : do not dereference src0 if NULL

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 65bdd52a867539691007f85c5508146d507f72c1
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 24 19:40:18 2023 +0300

    tests : sync test-grad0 from ggml

commit fdd18609113862dc6eb34dfc44a093d54c59ff1f
Author: Rowan Hart <[email protected]>
Date:   Sat Jun 24 04:07:08 2023 -0700

    flake : fix ggml-metal.metal path and run nixfmt (#1974)

commit c943d823c14cef33092205ca3944de6fdf7abf99
Author: AN Long <[email protected]>
Date:   Sat Jun 24 19:02:06 2023 +0800

    convert : fix invalid params in write_vocab_only (#1975)

commit f2c754e1c38936fdde74e4848ac468a696eb73c6
Author: slaren <[email protected]>
Date:   Sat Jun 24 12:57:18 2023 +0200

    ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978)

    * Improve ggml_graph_dump_dot, add ggml_format_name

    * add more automatic names to view ops

    * fix name of copies

commit 11da1a85cd69af84b5861134738c7e9e20907470
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 24 13:38:18 2023 +0300

    readme : fix whitespaces

commit 235b610d650cbfed6dbd5d671f750d35fc18cd7d
Author: Alberto <[email protected]>
Date:   Sat Jun 24 12:32:13 2023 +0200

    readme : fixed termux instructions (#1973)

commit b061ba9e2a7a2c335a200df8c11aed5e31e4ccbb
Author: Alex Renda <[email protected]>
Date:   Sat Jun 24 03:15:01 2023 -0700

    llama : fix top-p sampling to match the canonical definition (#1953)

    * Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p)

    * top-p: correct gt to gte

    * add test for correct top-p behavior

commit 527b6fba1d237befb324fd846bda7418c0fa394d
Author: Didzis Gosko <[email protected]>
Date:   Sat Jun 24 11:47:58 2023 +0300

    llama : make model stateless and context stateful (llama_state) (#1797)

    * llama : make model stateless and context stateful

    * llama : minor cleanup

    * llama : update internal API declaration

    * Apply suggestions from code review

    fix style

    Co-authored-by: Georgi Gerganov <[email protected]>

    * Missing model memory release

    * Fix style

    * Add deprecated warning for public API function llama_init_from_file

    * Update public API use cases: move away from deprecated llama_init_from_file

    * Deprecate public API function llama_apply_lora_from_file

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit d7b7484f74d486f77feb4c0b7af7e1718ed91651
Author: eiery <[email protected]>
Date:   Fri Jun 23 04:38:01 2023 -0400

    Add OpenLLaMA instructions to the README (#1954)

    * add openllama to readme

commit 7487137227eb32ed9b12156338b865cb29b2dfd1
Author: Erik Scholz <[email protected]>
Date:   Thu Jun 22 14:20:47 2023 +0200

    rework convert.py to read hyper-parameters from config.json (#1958)

    * Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise.
      This allows converting open_llama 3B and other non-standard model designs.

commit bbca06e26949686d61a5126332680ba3cccf235c
Author: Johannes Gäßler <[email protected]>
Date:   Wed Jun 21 23:49:25 2023 +0200

    cmake: revert CUDA arch default to 52, 61 if f16 (#1959)

commit fb98254f99d769fcbbf20966ef386abdb48ef601
Author: Rahul Vivek Nair <[email protected]>
Date:   Thu Jun 22 03:18:43 2023 +0530

    Fix typo in README.md (#1961)

commit 049aa16b8c5c6d086246e4e6b9feb18de4fbd663
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 20 19:05:54 2023 +0300

    readme : add link to p1

commit 2322ec223a21625dfe9bd73ee677444a98a24ac9
Author: Xiake Sun <[email protected]>
Date:   Tue Jun 20 05:42:40 2023 -0700

    Fix typo (#1949)

commit aacdbd40562684665b6f7b8ba6695b7a2088bbb0
Author: Ettore Di Giacinto <[email protected]>
Date:   Tue Jun 20 03:24:39 2023 +0200

    llama : fix params struct slignment (#1936)

    * Workaround struct misalignment during value-copy

    Signed-off-by: mudler <[email protected]>

    * Move booleans at the bottom of the structure

    Signed-off-by: mudler <[email protected]>

    * Add comment

    Signed-off-by: mudler <[email protected]>

    ---------

    Signed-off-by: mudler <[email protected]>

commit 20568fe60f00155fa25e92eb3a7f6b911d557967
Author: Henri Vasserman <[email protected]>
Date:   Tue Jun 20 01:12:39 2023 +0300

    [Fix] Reenable server embedding endpoint (#1937)

    * Add back embedding feature

    * Update README

commit 18b35625c3c19c64b7818a12460ba5ddb006dfdc
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 19 20:43:30 2023 +0300

    ggml : fix bug in LBFGS optimizer (found by ggml tests)

commit ba4e85a8339b9dd7cdffad31838235f2fe45a8ea
Author: l3utterfly <[email protected]>
Date:   Mon Jun 19 23:20:06 2023 +0800

    llama : use aligned memory during ggml_init call from loading saved sessions (#1934)

    * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions

    * - removed commented out old code from fix
    - updated another instance of same issue below original

commit 23fc5c219a9aebd57c8af3fac454062cc4622980
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 19 18:18:34 2023 +0300

    cmake : fix trailing whitespaces

commit cb40dfca694b5cb849837548fd69932117c78362
Author: Kawrakow <[email protected]>
Date:   Mon Jun 19 18:17:03 2023 +0300

    llama : only use Q6_K for output weights if tensor size is multiple of 256 (#1932)

    * Only use Q6_K for output weights if tensor size is multiple of 256

    * Fixed copy/paste mistake

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit ca7c3f4da5d144d4cd1dd44903552e6ba49b8ec8
Author: Kawrakow <[email protected]>
Date:   Mon Jun 19 18:14:09 2023 +0300

    cuda : faster k-quants on older GPUs (#1930)

    * k_quants: hopefully much faster Q4_K on older GPUs

    On the GTX-1660 that I have available to represent
    "old GPUs", token prediction drops from 65.5 ms/tok
    to 41.5 ms/tok!

    * k_quants: hopefully much faster Q3_K on older GPUs

    On the GTX-1660 that I have available to represent
    "old GPUs", token prediction drops from 60.3 ms/tok
    to 41.0 ms/tok!

    * k_quants: faster Q2_K on older GPUs

    It looks like I didn't need to change anything
    compared to what we already had, so this is just
    adding clarifying comments. But I now measure
    36.3 ms/tok on the GTX-1660, instead fo the
    47.2 ms/tok that I have written in the faster
    k-quants PR.

    * k_quants: faster Q5_K on older GPUs

    68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
    For some reason the same access pattern that leads
    to such resounding success for Q2_K to Q4_K did not
    work at all for Q5_K.

    It is also more difficult to measure because for Q5_K_S
    we only have 32 layers on the GTX-1660, so output, tok embeddings
    and kv cache are done on the CPU.

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit b97ca431db35ec96a339a721acb1219c1dd78bed
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 19 18:12:33 2023 +0300

    ggml : sync latest ggml repo (#1924)

    * ggml : sync latest ggml repo

    * ggml : remove unused comments

    * ggml : asserts

commit 1e3abfcef073e73c2b31e8570cb06c5cb2fd1f55
Author: Howard Su <[email protected]>
Date:   Mon Jun 19 23:10:37 2023 +0800

    cmake : fix build shared ggml when CUDA is enabled (#1929)

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 16b9cd193965769089881bb8ec012fccca7b37b6
Author: Johannes Gäßler <[email protected]>
Date:   Mon Jun 19 10:23:56 2023 +0200

    Convert vector to f16 for dequantize mul mat vec (#1913)

    * Convert vector to f16 for dmmv

    * compile option

    * Added compilation option description to README

    * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"

commit b24c3049d96557c24782e4d32feaae65f47277af
Author: Johannes Gäßler <[email protected]>
Date:   Sun Jun 18 17:41:26 2023 +0200

    Added tokens per second to info prints (#1928)

commit 0ede372a51fd8160688e01b587582666c14e94e5
Author: Johannes Gäßler <[email protected]>
Date:   Sun Jun 18 16:07:09 2023 +0200

    Fixed incorrectly applying RMS norm twice (#1925)

commit 8596af427722775f0df4a7c90b9af067ba90d4ef
Author: l3utterfly <[email protected]>
Date:   Sun Jun 18 19:19:16 2023 +0800

    ggml : fix bug in ggml_compute_forward_add_q_f32 (#1918)

commit e1886cf4fe0d0f31661dda52a4a9f34bd9b9009a
Author: Mike <[email protected]>
Date:   Sun Jun 18 16:28:26 2023 +0800

    readme : update Android build instructions (#1922)

    Add steps for using termux on android devices to prevent common errors.

commit 8ab8ba62eb27cc340be2edf3418e051b1d967416
Author: Kawrakow <[email protected]>
Date:   Sun Jun 18 11:13:43 2023 +0300

    llama : prevent usage of k-quants when tensor size is not a multiple of 256 (#1921)

    * Fix examples/metal

    * k-quants: prevent usage when tensor size is not divisible by 256

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 90cc59d6ab1363a5c69c60c4b94db647d3a54a18
Author: Kawrakow <[email protected]>
Date:   Sun Jun 18 10:52:10 2023 +0300

    examples : fix examples/metal (#1920)

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit ce2c7d72e2d06988b5ddec6811ab923254542077
Author: Georgi Gerganov <[email protected]>
Date:   Sun Jun 18 09:09:47 2023 +0300

    metal : handle buffers larger than device's maxBufferLength (#1826)

    * metal : handle buffers larger than device's maxBufferLength

    * metal : print more verbose device info + handle errors

    * metal : fix prints for overlapping views

    * metal : minimize view overlap to try to utilize device memory better

commit 57cd69460f736031a3fc54af1e97c03f80128478
Author: Howard Su <[email protected]>
Date:   Sun Jun 18 12:29:47 2023 +0800

    cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917)

commit b2416493ab3ab21686d47c96669da6d6c6af08a4
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 17 20:55:03 2023 +0300

    make : do not print help for simple example

commit 4f9c43e3bd488b7561119785485e1155dba338d7
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 17 20:24:11 2023 +0300

    minor : warning fixes

commit 2c9380dd2f77e41149340f3ecb09764d793b16db
Author: Johannes Gäßler <[email protected]>
Date:   Sat Jun 17 19:15:02 2023 +0200

    Only one CUDA stream per device for async compute (#1898)

commit 051e1b0e6a6e3aee7d989b47760980e6fda5861c
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 17 19:30:22 2023 +0300

    llama : fix kv_cache `n` init (close #1903)

commit 86c7571864ff331f8cdb9e092f3abeb123729a56
Author: DaniAndTheWeb <[email protected]>
Date:   Sat Jun 17 18:17:22 2023 +0200

    make : update for latest Arch (#1701)

    With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed.

commit 3d59ec5935ea1d33e9d51060a8dd737169b9b89b
Author: Howard Su <[email protected]>
Date:   Sat Jun 17 23:46:15 2023 +0800

    ggml : fix warnings under MSVC (#1908)

commit 0711a5f6dce7f04c2a791b14bc47f7d4cb545408
Author: Aaron Miller <[email protected]>
Date:   Sat Jun 17 07:37:49 2023 -0700

    metal : add norm, cpy f16->f16, alibi kernels (#1823)

commit fc45a81bc642b9ef33d9004f2b363d558438a6c9
Author: Faez Shakil <[email protected]>
Date:   Sat Jun 17 17:13:05 2023 +0500

    exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863)

commit 794db3e7b982fee37e3995db9c3a216a57ff65e3
Author: Randall Fitzgerald <[email protected]>
Date:   Sat Jun 17 07:53:04 2023 -0400

    Server Example Refactor and Improvements (#1570)

    A major rewrite for the server example.

    Note that if you have built something on the previous server API, it will probably be incompatible.
    Check out the examples for how a typical chat app could work.

    This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.

    Summary of the changes:

    - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
    - applies missing top k sampler
    - removes interactive mode/terminal-like behavior, removes exclude parameter
    - moves threads and batch size to server command-line parameters
    - adds LoRA loading and matches command line parameters with main example
    - fixes stopping on EOS token and with the specified token amount with n_predict
    - adds server timeouts, host, and port settings
    - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
    - sets defaults for unspecified parameters between requests
    - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
    - adds CORS headers to responses
    - adds request logging, exception printing and optional verbose logging
    - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
    - adds printing an error when it can't bind to the host/port specified
    - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
    - prints timing and build info on startup
    - adds logit bias to request parameters
    - removes embedding mode
    - updates documentation; adds streaming Node.js and Bash examples
    - fixes code formatting
    - sets server threads to 1 since the current global state doesn't work well with simultaneous requests
    - adds truncation of the input prompt and better context reset
    - removes token limit from the input prompt
    - significantly simplified the logic and removed a lot of variables

    ---------

    Co-authored-by: anon998 <[email protected]>
    Co-authored-by: Henri Vasserman <[email protected]>
    Co-authored-by: Felix Hellmann <[email protected]>
    Co-authored-by: Johannes Gäßler <[email protected]>
    Co-authored-by: Lesaun Harvey <[email protected]>

commit 5ddf7ea1fb42bac21026de2f77e0f9c069b92234
Author: Jiří Podivín <[email protected]>
Date:   Sat Jun 17 12:32:48 2023 +0200

    hooks : setting up flake8 and pre-commit hooks (#1681)

    Small, non-functional changes were made to non-compliant files.
    These include breaking up long lines, whitespace sanitation and
    unused import removal.

    Maximum line length in python files was set to a generous 125 chars,
    in order to minimize number of changes needed in scripts and general
    annoyance. The "txt" prompts directory is excluded from the checks
    as it may contain oddly formatted files and strings for a good reason.

    Signed-off-by: Jiri Podivin <[email protected]>

commit bac19927c302737465a1deb14ac0943a221863e8
Author: Gustavo Rocha Dias <[email protected]>
Date:   Sat Jun 17 06:01:06 2023 -0300

    readme :  alternative way to build for Android with CLBlast. (#1828)

commit b4c6f46f17b6e02f1cd55a81339e7e64f3aaa688
Author: Kerfuffle <[email protected]>
Date:   Sat Jun 17 01:49:42 2023 -0600

    Allow cmake to build ggml as a library (#1896)

    * Allow cmake to build ggml as a library

    * A ggml_static library will be created

    * When BUILD_SHARED_LIBS is enabled, ggml_shared will also be built

commit 92f20d9942c86daeb78637bdad7296a572f4da28
Author: David Yang <[email protected]>
Date:   Sat Jun 17 14:51:54 2023 +0800

    train : get raw text instead of page with html (#1905)

    We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work.

commit d411968e990c37f51328849c96a743dd78f3c3dd
Author: 0cc4m <[email protected]>
Date:   Fri Jun 16 20:59:49 2023 +0200

    opencl : support k-quants (#1836)

    * Porting q2_k kernel to OpenCL

    * Set global and local sizes for kernel calls for dequantizing k-quants

    * Added q6_k kernel

    * Fix q4_k opencl struct order

    * Replace uchar with uint8_t

    * Finish dequant kernels

    * Added OpenCL DMMV kernels

    * Fix q2_k, improve code

    * Fix q3_k

    * Shorten switch statements

    * Improve code formatting

    ---------

    Co-authored-by: Concedo <[email protected]>

commit b41b4cad6f956b5f501db0711dd7007c32b5eee5
Author: SuperUserNameMan <[email protected]>
Date:   Fri Jun 16 20:58:09 2023 +0200

    examples : add "simple" (#1840)

    * Create `simple.cpp`

    * minimalist example `CMakeLists.txt`

    * Update Makefile for minimalist example

    * remove 273: Trailing whitespace

    * removed trailing white spaces simple.cpp

    * typo and comments simple.cpp

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 13fe9d2d84f30cab613c960bf66ac83916006694
Author: Zenix <[email protected]>
Date:   Sat Jun 17 03:53:04 2023 +0900

    cmake : add auto detection of BLAS_INCLUDE_DIRS (#1886)

commit ac3b8869538c7fbdb48ff141d78c4dea091789f0
Author: Johannes Gäßler <[email protected]>
Date:   Fri Jun 16 20:25:51 2023 +0200

    llama : fix embd when offloading non-repeating layers (#1891)

commit 5b9ccaf104cc1054d4f8f17bc8a4b8dc949e5527
Author: FrankHB <[email protected]>
Date:   Sat Jun 17 02:25:01 2023 +0800

    Fixed possible macro redefinition (#1892)

    MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined.

commit 9cbf50c041a525d781c7764f493a5443924e4e38
Author: Borislav Stanimirov <[email protected]>
Date:   Fri Jun 16 21:23:53 2023 +0300

    build : fix and ignore MSVC warnings (#1889)

commit 3d0112261042b356621e93db3fa4c6798a5d098f
Author: Kawrakow <[email protected]>
Date:   Fri Jun 16 20:08:44 2023 +0300

    CUDA : faster k-quant dot kernels (#1862)

    * cuda : faster k-quant dot kernels

    * Imrove Q2_K dot kernel on older GPUs

    We now have a K_QUANTS_PER_ITERATION macro, which should be
    set to 1 on older and to 2 on newer GPUs.
    With this, we preserve the performance of the original
    PR on RTX-4080, and are faster compared to master on
    GTX-1660.

    * Imrove Q6_K dot kernel on older GPUs

    Using the same K_QUANTS_PER_ITERATION macro as last commit,
    we preserve performance on RTX-4080 and speed up
    Q6_K on a GTX-1660.

    * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile

    Allowed values are 1 or 2. 2 gives the best performance on
    modern GPUs and is set as default. On older GPUs 1 may work
    better.

    * PR comments

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 602c748863e15270d80d74aa2c3bf86ab8139e07
Author: Borislav Stanimirov <[email protected]>
Date:   Fri Jun 16 09:58:11 2023 +0300

    gitignore : add several entries specific to Visual Studio (#1888)

commit a09f9195be39afb4b023b646c0a6ec8a86915174
Author: Johannes Gäßler <[email protected]>
Date:   Thu Jun 15 21:49:08 2023 +0200

    Fixed CUDA runtime version check (#1879)

commit bed92756172d4514b23aaf9744cf8e2dc892fc7b
Author: Georgi Gerganov <[email protected]>
Date:   Thu Jun 15 21:56:50 2023 +0300

    cmake : remove whitespaces

commit c36e81da62ebfe09a768201cc44fa8d712dd00ed
Author: yangli2 <[email protected]>
Date:   Thu Jun 15 11:05:53 2023 -0700

    examples : add chat-vicuna.sh (#1854)

    Co-authored-by: Yang Li <[email protected]>

commit 3559433fecedf365e7aba2fe3d5f89d9abb817c1
Author: Igor Okulist <[email protected]>
Date:   Thu Jun 15 12:51:26 2023 -0500

    cmake : set include path for OpenBlas (#1830)

commit 69b34a0e80300bfb3e996983ac3ea075f5526675
Author: Frederik Vogel <[email protected]>
Date:   Fri Jun 16 02:47:04 2023 +0900

    swift : Package compile breaks due to ggml-metal.metal (#1831)

    * Ignore metal file in spm

    * Add ggml.h to spm public Headers

    ---------

    Co-authored-by: Vogel Frederik <[email protected]>

commit cf267d1c71a781700698f8518e903239c3bcc929
Author: daboe01 <[email protected]>
Date:   Thu Jun 15 19:42:48 2023 +0200

    make : add train-text-from-scratch (#1850)

    * make finetuning example accessible

    * fixed: targed was in wrong line

    * fixed: name of executable was wrong

    * fixed: naming of binary

    * fixed: model path was wrong

    * fixed clean target

    * Update examples/train-text-from-scratch/README.md

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 9dda13e5e1f70bdfc25fbc0f0378f27c8b67e983
Author: Srinivas Billa <[email protected]>
Date:   Thu Jun 15 18:36:38 2023 +0100

    readme : server compile flag (#1874)

    Explicitly include the server make instructions for C++ noobsl like me ;)

commit 37e257c48e350cf03c353c10d31e777f8d00123d
Author: sandyiscool <[email protected]>
Date:   Thu Jun 15 23:06:06 2023 +0530

    make : clean *.so files (#1857)

commit 64cc19b4fe3df03bc20e520aa111c30cff3a655e
Author: Howard Su <[email protected]>
Date:   Fri Jun 16 01:29:59 2023 +0800

    Fix the validation of main device (#1872)

commit 4bfcc855abdb2c9fcc3c5a84747974521909fa41
Author: Georgi Gerganov <[email protected]>
Date:   Thu Jun 15 20:29:48 2023 +0300

    metal : parallel command buffer encoding (#1860)

    * metal : parallel command buffer encoding

    * metal : determine number of command buffers based on gf->n_threads

commit 6b8312e7979b852f6b6ac9d29cd51fda16c17948
Author: Johannes Gäßler <[email protected]>
Date:   Thu Jun 15 19:06:46 2023 +0200

    Better error when using both LoRA + GPU layers (#1861)

commit 254a7a7a5ff4c874ff8488f1f5cbdd7e9c89d682
Author: Johannes Gäßler <[email protected]>
Date:   Wed Jun 14 19:47:19 2023 +0200

    CUDA full GPU acceleration, KV cache in VRAM (#1827)

    * Fixed CUDA RoPE

    * ggml_cuda_mul_mat_vec_p021

    * ggml_cuda_scale

    * ggml_cuda_diag_mask_inf

    * ggml_is_permuted

    * ggml_cuda_cpy

    * flatten rows for ggml_cuda_op

    * Added a --low-vram option

    * Fixed Windows performance

    * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM

commit 92549202659fc23ba9fec5e688227d0da9b06b40
Author: 0xspringtime <[email protected]>
Date:   Tue Jun 13 15:37:54 2023 -0400

    baby-llama : fix operator!= (#1821)

    * Update baby-llama.cpp

    Seems to be an error in the implementation of the operator!= function. It attempts to compare the this pointer (a llama_hparams_lora object) with the other pointer (a llama_hparams object) using memcmp. This can lead to incorrect results because the sizes of the objects being compared (sizeof(llama_hparams) and sizeof(llama_hparams_lora)) are different, should now be able to compare two llama_hparams_lora objects for inequality.

    * Update baby-llama.cpp

    * Update baby-llama.cpp

commit e32089b2c20b1b87b22912f4a8b93fe01647d5b9
Author: xaedes <[email protected]>
Date:   Tue Jun 13 21:04:40 2023 +0200

    train : improved training-from-scratch example (#1652)

    * add python wrapper

    https://gist.github.com/abetlen/2b90e5f153f6efd00931d098de5c73ce

    * fix decoding error. adds errors=ignore parameter

    * add python bindings for functions to get and set the whole llama state
    (rng, logits, embedding and kv_cache)

    * update python bindings

    * add text generating baby-llama from scratch example

    * fix race condition bug in ggml_compute_forward_diag_mask_f32

    * implement ggml_soft_max_back for more performant backward pass of soft_max

    avoids creating big intermediate matrices of size n_embd x n_embd for llama layers and n_vocab x n_vocab for cross entropy loss

    * improve softmax backward pass

    go from quadratic runtime to linear runtime by simplifying the formulas

    * fix race condition bug in non-inplace ggml_compute_forward_diag_mask_f32

    memcpy needs to be synchronized across threads to avoid race conditions.
    => do it in INIT phase

    * fix bug in ggml_compute_forward_soft_max_back_f32 on DEBUG build

    * improve performance of mul_mat backward pass

    avoid transpose by using mul_mat with swapped arguments

    * avoid printing too much newlines in baby-llama-text

    * activate threading in baby-llama-text

    * add ggml_out_prod and use it for mul_mat backward pass for improved performance

    performance stats report improvement from 37 seconds to 16 seconds runtime during my training tests

    * better weight initialization improves training convergence at start

    * better weight initialization improves training convergence at start

    * improve ggml_out_prod performance

    - change iteration order (>15s -> 10s runtime)
    - parallelize over one more dimension: over dst matrix rows (10s -> <5s runtime)

    * add llama sampler, shuffle samples and constrain sampling to tokens occurring in train data

    * fix get_samples call, add model tensor names, increase model size, start training samples after newline

    * save train trained model to checkpoint and load model to be trained from checkpoint

    * use inplace functions where possible

    * initialize rng with srand

    * use different arguments for input and output checkpoint

    * ggml fixes to support backward pass on inplace operations

    * remove duplicate include

    * fix cross entropy loss

    - add target probabilities for each sample which is then used in cross entropy loss

    * print used memory before and after optimization

    * sample with non-greedy sampling parameters at the end of training

    * add cmake target for baby-llama-text

    * add ggml_add1_inplace to header

    * enable gradient propagation for inplace add1 and scale operations

    those functions backward passes don't need the original src0, so they also work when forward is inplace

    * implement AdamW in ggml_opt_adam by adding weight decay parameter (default 0.001f)

    also add a schedule parameter (default 1.0f) that can be used to scale alpha and decay according to learning schedule.
    setting the decay parameter to zero disables AdamW resulting in normal Adam optimizer.

    since the difference between Adam and AdamW is minimal it is not implemented as another optimizer, but integrated into the existing Adam optimizer.

    * use inplace operations in cross_entropy_loss

    * fix random weight initialization scale

    * add missing default parameters for adam optimizer

    * add ggml_opt_context, so that we can properly resume training

    otherwise the optimizer states, tracking statistics about the error function and its derivates,
    will reset to zero each time ggml_opt is called, hindering convergence on resumed training.

    now the optimizer context and all its memory is stored in a separate struct.

    * fix bug in llama_sample_token_mirostat_v2

    when all candidates are filtered out through mu threshold, the following soft_max operation will fail.
    so keep at least one.

    * add forward function without using cache, for more performant training

    during training on whole samples no cache is required.
    removing the cache and simplifying the remaining code results in performance and memory usage improvement.

    * print suppressed newline tokens as string "\n"

    printing too much actual newlines is suppressed to avoid flooding the console.

    * store optimizer state in training checkpoint and add learning schedule

    persistent optimizer state allows to resume training without resetting the optimizer
    learning schedule consists of linear warmup ramp followed by cosine decay with restarts

    * remove unused functions

    * fix bug in get_samples which corrupted training targets

    * save checkpoint only when it was trained

    * simplify code

    * remove trailing whitespace

    * simplify backward pass for SQRT

    * replace inefficient repeat backward pass with dedicated repeat_back operation

    * add ggml_cross_entropy_loss with backward pass for faster training

    cross entropy loss can also be implemented using softmax and log, but as dedicated operation it is faster and especially avoids unnecessary memory overhead.

    * add tests for cross_entropy_loss backward pass

    finite differences regularly results in estimated gradient of zero, despite the backward pass giving non zero gradient.
    _probably_ the finite differences fails due to numerical issues

    * use ggml_cross_entropy_loss in text training example

    * remove trailing whitespace

    * slightly improve how cross entropy loss is compute

    btw: directly implemented cross entropy loss seems to have way lower magnitudes than when implemented with softmax and log.
    probably the input to log gets closer to zero due to float numerics.
    maybe the multiplication by (1.0-eps)/sum is more accurate..

    * add llama_get_vocab to get the vocabulary as output parameters

    * set default model.type for unknown models with few layers

    * add export of training checkpoint to llama compatible model file

    * get vocabulary for exporting training checkpoint to llama compatible model file

    * implement backward pass of flash attention

    * bugfixes for backward pass of flash attention

    * test flash attention backward pass

    need to set loose error bounds to pass.
    the finitie differences are close to numeric limits and often return quite different values than the backward pass.
    reducing eps further lets the gradients vanish completely.
    likewise setting eps to big results in wronger values.
    the softmax in the middle of the function is probably the most responsible for the numeric issues using finite differences.

    * add option to train with flash attention and move options to the top of the main function

    training from scratch also works with flash attention
    training convergence and generation results after fix number of iterations are worse than when not using flash attention.
    maybe there still lingers a bug in the flash attention backward pass?
    but training works, just with slower convergence.

    flash attention is still worth to use, because it requires way less memory and is faster with high n_ctx

    * add train_params and command line option parser

    * remove unnecessary comments

    * add train params to specify memory size

    * remove python bindings

    * rename baby-llama-text to train-text-from-scratch

    * replace auto parameters in lambda function

    * add #include <climits>

    * add explicit cast to fix compile error

    "error: non-constant-expression cannot be narrowed from type 'int64_t' (aka 'long long') to 'uint32_t' (aka 'unsigned int') in initializer list [-Wc++11-narrowing]"

    * remove trailing whitespace

    * add ggml_opt_resume_g which accepts forward and backward cgraphs

    * fix formulas in comments

    * bug fix for ggml_compute_forward_get_rows_back_f32

    the result should be set to zero, not to whatever data is in opt0

    * improve training memory usage with scratch buffers

    instead of relying on the automatic backward pass, we manually create the graph for the backward pass.
    it turns out that all backward pass operations need only temporary memory which can be reused after each layer.

    will compute backward pass for ALL model parameters

    * add option to use scratch buffers in training or not

    make it configurable because currently training with scratch buffers implies flash attention and optimization over all parameters.

    * ci : disable temporary

    * store view offset and permute axes in opt[0] instead of storing it in padding

    use memcpy to store offset, because offset is of type size_t.
    when storing it as int32_t offset would have to be smaller than 2^31 which is not necessarily true.

    * minor : fix compile warnings + minor style changes

    * fix bug in threaded indices calculation of ggml_compute_forward_flash_attn_back_f32

    * store view offset like in master branch

    * bug fix in forward_batch_wo_cache_flash_attn_train

    * scratch buffer bug fixes in forward_batch_wo_cache_flash_attn_train

    data of permute and reshape is the same as their input.
    if we want to preserve the output of permute/reshape, we also need to preserve their inputs.

    replace reshape(src0, src1) with reshape_nd calls so that we don't need src1.

    replace (temporary) t03 with ggml_repeat(ctx0, layer.attention_norm, t02).
    in the future we could also use the new broadcasting ggml_mul to avoid these repeat calls.
    for this we need backward pass of broadcasting ggml_mul.

    * remove unnecessary scratch buffer 0

    buf 0 is persistent memory, so we can just disable scratch for this by using buf -1

    * avoid creating unnecessary grad tensors

    previously we need to create grads for model parameters, so that expand(..) correctly populates cgraph->leafs & cgraph->grads
    this wasted memory, because unnecessary grad for each op were automatically created:
    the automatically generated grad was unnecessary because we later manually set the grad (e.g. t35->grad = expand(gb, ...) ).
    this discarded the automatically generated grad resulting in wasted memory.

    improved this by changing expand(..) to not use ggml_build_forward_expand.
    expand set cgraph->nodes but not the leafs.
    cgraph->leafs & cgraph->grads are set in another pass after the last expand call.

    * print used training seed

    * zero initialize gfbuf and gbbuf

    * ci : re-enable workflows + add README for training

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 2347e45e7bdb09c9a7d74b2c0bc86c2b65f0c343
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 13 20:20:07 2023 +0300

    llama : do a warm-up eval at start for better timings (#1824)

commit 74d4cfa3438cb58bd177eed30014e6588694aaa8
Author: Kerfuffle <[email protected]>
Date:   Tue Jun 13 04:23:23 2023 -0600

    Allow "quantizing" to f16 and f32 (#1787)

    * Allow "quantizing" to f16 and f32

    Fix an issue where quantizing didn't respect LLAMA_NO_K_QUANTS

    Add brief help to the list of quantization types in the quantize tool

    Ignore case for quantization type arguments in the quantize tool

commit 74a6d922f12ccfe16b0c265f43be8978c6f25e98
Author: Kawrakow <[email protected]>
Date:   Mon Jun 12 22:39:21 2023 +0300

    Metal implementation for all k_quants (#1807)

    * metal : improve q4_K

    28.3 -> 26.0 ms/token by avoiding a branch in the
    calculation of the scales.

    * metal : small improvement for Q4_K

    * metal : still optimizing Q4_K

    This commit pushes it down to 25.3 ms / token.

    The crazy idea of using 6 bits for the scales is really costly on
    Metal: if I remove the bit fiddling necessary to make the block
    scales, time goes almost to the Q4_0 23 ms/token.

    Before pushing the k-quants upstream I had a Q4_K variant that
    had used 8-bit scales. It wasn't more accurate, used 0.125 bits more per weight,
    was running slightly slower on the CPU (due to the larger model size
    and being memory bound there), and the difference was entirely
    negligible under CUDA. So, I decided to publish the version with 6-bit
    scales. Perhaps I should re-consider and change to 8-bit scales?

    * metal : some more optimizations

    Q2_K: 25.4 ms/token
    Q6_K: 27.3 ms/token
    Q4_0: 22.8 ms/token
    Q4_1: 23.1 ms/token

    * metal : Q3_K support

    Something is not quite right yet.

    * metal : Q5_K support

    Initial version achieves 31.2 ms/token, 210 GB/s

    * metal : still not able to figure out why q3_K does not work

    * Minor

    * metal : yet another failed attempt to make q3_K work

    * metal : optimize Q5_K

    31.2 ms -> 27.8 ms.
    250 GB/s.

    * metal : q3_K still not working

    Adding a heavily commented q3_K metal kernel to explain
    my obviously faulty logic. Perhaps someone could spot the issue?

    * metal : q3_K finally working

    Not optimized at all.

    What was the issue? The scales are not 4-bytes aligned,
    and I was accessing them with a uint32_t pointer.
    When I tried that on CUDA, I got an error (illegal memory access)
    and added a memcpy to a local array of 3 uint32_t's.
    But on Metal it told me there is no memcpy, so I tried
    accessing directly. There is no error, just garbage results.
    At some point I did try accessing the scales with an uint16_t
    pointer (the scales are for sure 2-byte aligned), but was
    still getting garbage. I guess, there must have been another bug.

    No access to scales is via a uint16_t pointer and, after starting
    from scratch from the C dequantize function, it finally works.

    * metal : Q3_K 1st optimization pass

    * metal : Q3_K second optimization pass - 29.6 ms/token

    * metal : Q3_K cleanup

    * metal : fixed accidentally broken Q2_K

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit e4caa8da59c1c97dc23fa336f4d726984a20560f
Author: slaren <[email protected]>
Date:   Mon Jun 12 19:12:47 2023 +0200

    ci : run when changing only the CUDA sources (#1800)

commit 58970a4c39124a647ac2a640d9e178ea6c961e65
Author: Howard Su <[email protected]>
Date:   Mon Jun 12 20:44:16 2023 +0800

    Leverage mmap for offloading tensors to GPU (#1597)

    * Rebase to latest

    * Show progress

    * Add assert to make sure we only allocate temp buffer for non-CPU backend tensor

    Co-authored-by: Johannes Gäßler <[email protected]>

    ---------

    Co-authored-by: Johannes Gäßler <[email protected]>

commit 8c0a10e64dbf60fd9946c0cd5e6f59690800b123
Author: Kawrakow <[email protected]>
Date:   Mon Jun 12 14:31:36 2023 +0300

    metal : fix failure to load model (#1817)

    The number of buffers in the ggml context was left unitialized.
    This leads to sporadic failures to load the model on
    startup. It is actually strange that the failure occurred so
    infrequantly.

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit fa84c4b3e80199a5683438f062009c031a06c4fa
Author: Kerfuffle <[email protected]>
Date:   Sun Jun 11 08:19:17 2023 -0600

    Fix issue where interactive mode crashes when input exceeds ctx size (#1789)

    * Fix issue where interactive mode in the main example crashes when input exceeds ctx size

    * Ensure the context size is at least 8 tokens in the main example.

    Closes #1768

commit 12b063f0ecf280e98028e444fc492ee6222cdcdc
Author: Kyle Liang <[email protected]>
Date:   Sun Jun 11 21:20:52 2023 +0800

    Fixed WSL cuda's OOM error (#1594)

    * In the function , add the cuda error bypass.

    * remove excessive codes and prints

    ---------

    Co-authored-by: liang <[email protected]>

commit 31d2b5f4a4bae081e59b36ab37c6ff6f5b5940ad
Author: Ryan Landay <[email protected]>
Date:   Sun Jun 11 17:38:53 2023 +0800

    Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798)

commit 4de0334f5cabf4696eced2e5d6e279fdfaa6c0f2
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 10 22:56:53 2023 +0300

    cmake : fix Metal build (close #1791)

commit 3f1223155a462477ac933474ebc4eab0ce3ca264
Author: Artyom Lebedev <[email protected]>
Date:   Sat Jun 10 22:51:36 2023 +0300

    k-quants : GCC12 compilation fix (#1792)

commit 303f5809f1b4ec49823dbe70cacd2124ec1d0df0
Author: Andrei <[email protected]>
Date:   Sat Jun 10 10:47:34 2023 -0400

    metal : fix issue with ggml-metal.metal path. Closes #1769 (#1782)

    * Fix issue with ggml-metal.metal path

    * Add ggml-metal.metal as a resource for llama target

    * Update flake.nix metal kernel substitution

commit 059e99066d95d73d1ca26c3375d47c0e35596229
Author: Aisuko <[email protected]>
Date:   Sun Jun 11 00:08:11 2023 +1000

    doc : fix wrong address of BLIS.md (#1772)

    Signed-off-by: Aisuko <[email protected]>

commit 17c10acfb44ecb7af25e37fb67b9501cbc0034d2
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 10 12:06:45 2023 +0300

    ggml : force no_alloc == false when creating opt tensors (close #1699)

    This is needed to make operators like ggml_view() be able to store their
    parameters in the ggml context's memory and not get discarded when
    no_alloc is true

commit e9b66ee9829039d4ab54550d6222e42a0b31e52a
Author: Kawrakow <[email protected]>
Date:   Sat Jun 10 11:28:11 2023 +0300

    metal : add Q4_1 implementation (#1785)

    23.3 ms / token, so just ~1% slower than q4_0.
    Achieves 290 GB/s memory throughput.

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 4f0154b0bad775ac4651bf73b5c216eb43c45cdc
Author: Kerfuffle <[email protected]>
Date:   Sat Jun 10 01:59:17 2023 -0600

    llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691)

    * Add support for quantizing already quantized models

    * Threaded dequantizing and f16 to f32 conversion

    * Clean up thread blocks with spares calculation a bit

    * Use std::runtime_error exceptions.

commit ef3171d16241c18581d4d08374f0b9e396ade6b7
Author: Xingchen Song(宋星辰) <[email protected]>
Date:   Sat Jun 10 15:49:40 2023 +0800

    ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638)

commit 555275a693843273759230547001f9ae07fb537e
Author: rankaiyx <[email protected]>
Date:   Sat Jun 10 14:41:59 2023 +0800

    make : add SSSE3 compilation use case (#1659)

commit 98ed16557432d7a5179c57eddcc3a08a7ae6d54d
Author: Robert Sung-wook Shin <[email protected]>
Date:   Sat Jun 10 01:24:40 2023 +0900

    OpenCL: Add release memory (#1741)

    * Add opencl release memory

    * Rename function name

commit ae9663f1887513e152839e91f61c513075a19422
Author: Johannes Gäßler <[email protected]>
Date:   Fri Jun 9 13:58:15 2023 +0200

    Windows nvcc workaround (#1753)

    Fix gibberish output on Windows when using CUDA

commit b33dee282f5d8032b5f780152732dc45cbf2d349
Author: Georgi Gerganov <[email protected]>
Date:   Fri Jun 9 11:11:04 2023 +0300

    metal : fix build "tanhf" -> "tanh"

commit 92f44ff7f778ef1b94028b2ba6d39943b5ca0ada
Author: AT <[email protected]>
Date:   Fri Jun 9 04:00:51 2023 -0400

    metal : add GELU implementation (#1770)

    Co-authored-by: Adam Treat <[email protected]>

commit 245fc3c37da5ac5963f9f11a9f4f2ac08d96afc6
Author: Kawrakow <[email protected]>
Date:   Fri Jun 9 10:39:59 2023 +0300

    metal : faster q4_0 (#1775)

    * metal : 8% faster q4_0

    Avoid copying into local uchar4 anf float4.

    * metal : 17% faster Q4_0

    Use 64 threads in a thread group.

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 72ff5282bf0388c60821f504c4c8cc2b1f491aa6
Author: Kawrakow <[email protected]>
Date:   Thu Jun 8 22:28:21 2023 +0300

    metal : add Q2_K implementation (#1762)

    * metal : add Q2_K implementation

    27.1 ms / token on M2 Max 30-core GPU, so about the
    same speed as Q4_0. Memory throughput is ~156 GB/s.

    The access pattern used in the Q2_K
    CUDA implementation resulted in significantly lower
    performance (~31 ms/token).

    * Fixing merge conflicts

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 0bf7cf1b296fc9fca05411b37afdf08a531487d2
Author: Georgi Gerganov <[email protected]>
Date:   Thu Jun 8 20:48:14 2023 +0300

    Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)"

    This reverts commit 8432d4d9f716b25133e3ed671d91e21f6f3be867.

commit 8432d4d9f716b25133e3ed671d91e21f6f3be867
Author: le.chang <[email protected]>
Date:   Fri Jun 9 00:47:56 2023 +0800

    ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)

commit 0f291e1f65c1d68201e71ce99c89562a36686b6d
Author: Kawrakow <[email protected]>
Date:   Thu Jun 8 19:46:22 2023 +0300

    metal : Q6_K implementation (#1752)

    * Metal implementation for Q4_K

    Very slow for now:
    42 ms / token, Q4_0 runs in 28 ms/token on my
    30-core M2 Max GPU.

    * Optimizing Q4_K on metal

    The first token always takes longer, I guess because
    the metal kernel is being jit-compiled.
    So, using n = 128 to measure time.

    At this point Q4_K takes 29.5 ms / token
    compared to 27.2 ms / token for Q4_0.
    Quite a bit better than the initial attempt,
    but still not good enough.

    * Optimizing q4_K metal dot some more

    For n = 256 it is now 28.1 ms/token compared to
    27 ms/token for q4_0.

    * Fix after merge with master

    * Metal implementation for Q6_K

    Similar to the CUDA implementation.
    No idea if this is the optimum for Metal, but the few
    alternative variants I tried all had a lower performance.

    We get 36.5 ms / token on M2 Max with 30 GPU cores.
    This corresponds to ~200 GB/second throughput.

    * clang-tidy : add config back

    * Much better Q6_K implementation for metal

    28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
    other compute graph operations, we are left with ~19 ms
    for the matrix multiplications. The model is ~5.5 GB,
    so we are getting 1000 / 19 * 5.5 = 290 GB/s!

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 8fc8179919a11738910db07a800f2b176f8adf09
Author: qingfengfenga <[email protected]>
Date:   Thu Jun 8 15:58:53 2023 +0800

    Add llama.cpp docker support for non-latin languages (#1673)

    * Modify Dockerfile default character set to improve compatibility (#1673)

commit b50b570ed9d699d3d126d72fc02de92926bcd937
Author: Steven Roussey <[email protected]>
Date:   Thu Jun 8 00:12:28 2023 -0700

    ggml : fix fprintf warnings (#1720)

commit 53aba3f393f2e02a78ddaba2e934893a8bbf3246
Author: Georgi Gerganov <[email protected]>
Date:   Thu Jun 8 10:09:08 2023 +0300

    clang-tidy : restore dot file from accidental deletion

commit 4161bdc04debb70bf5f275492b4d89fd9330087c
Author: Kawrakow <[email protected]>
Date:   Thu Jun 8 10:08:23 2023 +0300

    metal : add Q4_K implementation (#1733)

    * Metal implementation for Q4_K

    Very slow for now:
    42 ms / token, Q4_0 runs in 28 ms/token on my
    30-core M2 Max GPU.

    * Optimizing Q4_K on metal

    The first token always takes longer, I guess because
    the metal kernel is being jit-compiled.
    So, using n = 128 to measure time.

    At this point Q4_K takes 29.5 ms / token
    compared to 27.2 ms / token for Q4_0.
    Quite a bit better than the initial attempt,
    but still not good enough.

    * Optimizing q4_K metal dot some more

    For n = 256 it is now 28.1 ms/token compared to
    27 ms/token for q4_0.

    * Fix after merge with master

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 0035858273ebe0694926bf4414d279f3e1cd109d
Author: johnson442 <[email protected]>
Date:   Thu Jun 8 08:02:48 2023 +0100

    k-quants : add missing compile definition to CMakeLists (#1748)

commit 5c64a0952ee58b2d742ee84e8e3d43cce5d366db
Author: Georgi Gerganov <[email protected]>
Date:   Wed Jun 7 10:59:52 2023 +0300

    k-quants : allow to optionally disable at compile time (#1734)

    * k-quants : put behind optional compile flag LLAMA_K_QUANTS

    * build : enable k-quants by default

commit 5b57a5b72676540b6a45a3f527126299969ad241
Author: jacobi petrucciani <[email protected]>
Date:   Wed Jun 7 00:15:31 2023 -0400

    flake : update to support metal on m1/m2 (#1724)

commit 4dc62c545df0af60635d579e9e4dd91bc5afff51
Author: Georgi Gerganov <[email protected]>
Date:   Wed Jun 7 07:15:08 2023 +0300

    readme : add June roadmap

commit 35a84916fb029905c44746127026079268216e7a
Author: Willy Tarreau <[email protected]>
Date:   Wed Jun 7 04:10:17 2023 +0200

    main: add the possibility to open the prompt cache read-only (#1640)

    The prompt cache constitutes a nice speed up when using the same prompt
    prefix across multiple evaluations, but when using it, it will also be
    updated, which is not always desirable. One use case is to have a large
    prompt containing some context and usage rules, and a second part
    containing variable data of the problem being studied. In this case it's
    desirable to be able to save the first part once, and to always reuse it
    as-is without updating it with the second part.

    The new argument --prompt-cache-ro enables this read-only mode on the
    prompt cache. The prompt's contents that match the cache are loaded
    from the cache but the rest is not modified. This allowed to reduce a
    total analysis time from 112s to 49.7s here, without having to backup
    and restore a copy of the prompt, which takes significant time at 500
    MB.

    Signed-off-by: Willy Tarreau <[email protected]>

commit 2d7bf110edd8c49209401a16132052cba706ffd0
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 6 22:54:39 2023 +0300

    llama : fix vram_scratch var

commit 2a4e41a086ce80da68c402457c75c77e52dcc698
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 6 22:41:53 2023 +0300

    llama : fix compile warnings

commit 17366df842e358768c0df7024484fffecfc7865b
Author: Johannes Gäßler <[email protected]>
Date:   Tue Jun 6 21:33:23 2023 +0200

    Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703)

    * CUDA multi GPU + scratch

    ggml_cuda_compute_forward

    Tensor parallelism

    ggml_cuda_add

    ggml_cuda_rms_norm

    ggml_cuda_silu

    CUDA scratch buffer

    --main-gpu CLI option

commit 44f906e8537fcec965e312d621c80556d6aa9bec
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 6 20:16:57 2023 +0300

    metal : add f16 support

commit d5b111f53d14972669eb52055f9df2567663ad8b
Author: LostRuins <[email protected]>
Date:   Wed Jun 7 01:00:01 2023 +0800

    Clblast fixes + enhancements to save VRAM and offload more layers (#1675)

    * Use events instead of clFinish, where possible

    * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel

    * Reduce queueing overhead for contiguous tensors by using single mul kernel call

    * Adapt to #1612 cl_mem malloc changes

    * Reduce code duplication between cuda and opencl branches

    * Improve implementation

    * Clblast fixes + enhancements to save VRAM:

    1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them.
    2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer
    3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it.

    * change max value size_t to use limits

    * removed flags from the CL pool malloc, apply code tidying suggestions.

commit 2d43387dafe9c60f15f57aa23ee0b37864b98b32
Author: Georgi Gerganov <ggerga…
byroneverson added a commit to byroneverson/llm.cpp that referenced this pull request Jun 30, 2023
commit b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf
Author: Howard Su <[email protected]>
Date:   Thu Jun 29 21:15:15 2023 +0800

    Use unsigned for random seed (#2006)

    * Use unsigned for random seed. Keep -1 as the value to use a time based seed.

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 96a712ca1b7f427e3bd7ffc0c70b2105cfc7fbf1
Author: LostRuins <[email protected]>
Date:   Thu Jun 29 11:56:43 2023 +0800

    Porting the improved K-Quant CUDA kernels to OpenCL (#1966)

    * Added broken new q4k quant

    * xx + ib0

    * Fix q2_k fast kernel

    * Use preprocessor for QK_K

    * Add q6_k fast matmul kernel

    * ported q3k speedup successfully

    * ported q2k and q5k speedups

    * remove old dot kernels and template

    * fixed global const struct types

    * fixing address spaces

    * fixed string too long CI issue

    ---------

    Co-authored-by: 0cc4m <[email protected]>

commit d3494bb86bf7ad5b0b60aae0220ea576f273b5c0
Author: m3ndax <[email protected]>
Date:   Wed Jun 28 20:39:08 2023 +0200

    llama : replacing auto &kv with const auto &kv (#2041)

    * Replacing auto &kv with const auto &kv

    * Create codacy.yml

    * Delete codacy.yml

commit 5b351e94d041742cd50ffcf2d44718d63bab398a
Author: Salvador E. Tropea <[email protected]>
Date:   Wed Jun 28 14:27:31 2023 -0300

    cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028)

    - Not used

commit 6432aabb6dc887436e4d57414b63116189c3b13b
Author: Salvador E. Tropea <[email protected]>
Date:   Wed Jun 28 14:26:26 2023 -0300

    cuda : fix missing const qualifier in casts (#2027)

commit b922bc351b69770cec2d35d2aa50fa052b95ca93
Author: Howard Su <[email protected]>
Date:   Wed Jun 28 10:13:02 2023 -0700

    llama : remove shards weight file support (#2000)

    * Remove multiple shards

    * Remove multiple file loaders

    * Remove llama_load_tensor_shard class

    * Simplify load logic

    * Remove dead code guess_n_parts function

    * Remove vocab_only from constructor of llama_model_loader

    * Remove alignment_prevents_mmap which is not more needed.

    * Remove useless check

commit 7f9753fa1263c4eded9a3de19778562f0e1093d7
Author: Johannes Gäßler <[email protected]>
Date:   Wed Jun 28 18:35:54 2023 +0200

    CUDA GPU acceleration for LoRAs + f16 models (#1970)

commit cfa0750bc9dbc2d957a91b8ed09ab0035d8f3d4e
Author: ningshanwutuobang <[email protected]>
Date:   Wed Jun 28 23:53:37 2023 +0800

    llama : support input embeddings directly  (#1910)

    * add interface for float input

    * fixed inpL shape and type

    * add examples of input floats

    * add test example for embd input

    * fixed sampling

    * add free for context

    * fixed add end condition for generating

    * add examples for llava.py

    * add READMD for llava.py

    * add READMD for llava.py

    * add example of PandaGPT

    * refactor the interface and fixed the styles

    * add cmake build for embd-input

    * add cmake build for embd-input

    * Add MiniGPT-4 example

    * change the order of the args of llama_eval_internal

    * fix ci error

commit 9d23589d638dc74577d5ff880e6d4248b795f12e
Author: Erik Scholz <[email protected]>
Date:   Tue Jun 27 19:06:33 2023 +0200

    fix pthreads setaffinity usage on android (#2020)

commit 0be54f75a6c3e9a09ea71bdfcdabf9a996a0549b
Author: Howard Su <[email protected]>
Date:   Tue Jun 27 13:07:13 2023 +0800

    baby-llama : fix build after ggml_rope change (#2016)

commit 181e8d975528a4e27eabb8ae6e9865f9ceae4b37
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 27 00:37:13 2023 +0300

    llama : fix rope usage after ChatGLM change

commit d9779021bd59ed96daae75e820a5ac5da47ca8ff
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 27 00:06:51 2023 +0300

    ggml : add support for ChatGLM RoPE

commit d38e45157862b58a1824387e64860d68ca3533a7
Author: Roman Parykin <[email protected]>
Date:   Mon Jun 26 22:47:59 2023 +0300

    readme : add Scala 3 bindings repo (#2010)

commit eaa6ca5a61b8c9501df9ebe3d264f45b75a5f8aa
Author: David Yang <[email protected]>
Date:   Tue Jun 27 03:45:32 2023 +0800

    ggml : increase max tensor name + clean up compiler warnings in train-text (#1988)

    * Clean up compiler warnings in train-text

    Some brackets to disambiguate order of operations

    * Increase GGML_MAX_NAME

    Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues

commit aa777abbb73655c4e1e9237b7c0ad66745e8e48c
Author: Gustavo Rocha Dias <[email protected]>
Date:   Mon Jun 26 16:34:45 2023 -0300

    readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux (#2007)

    * docs - Alternative way to build at Android, with CLBlast.

    * doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux.

    * doc- fix typo

commit c824d2e368d193d9f564ff29880a51cda9f90527
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 26 21:03:59 2023 +0300

    ggml : avoid conv 2d kernel round up

commit b853d456018b10820686362af41b2f2f75f1eec6
Author: zrm <[email protected]>
Date:   Mon Jun 26 13:57:59 2023 -0400

    ggml : add NUMA support (#1556)

    * detect NUMA systems and pin work threads to nodes (linux)

    * disable mmap prefetch/readahead for NUMA systems

    * avoid sending finalize op to thread pool if it does nothing

    * silence robot

    * fix args

    * make --numa a param

    * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement

    * lower synchronization overhead

    * statically allocate

    * move numa state to g_state

    * add description for --numa

    * ggml : minor style changes

    * ggml : minor style + try fix sanitizer build

    * llama : allow to initialize backend with NUMA support

    * llama : avoid ggml include in llama-util.h

    * ggml : style / formatting

    * ggml : fix handling of ops with n_threads > n_tasks > 1

    * server : utilize numa parameter

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 9225baef71407d799a6f7f563b77fd7f82791416
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 26 20:10:52 2023 +0300

    k-quants : fix indentation

commit a84ab1da8dc6a59a5b67420ae1322f09503ffc72
Author: katsu560 <[email protected]>
Date:   Tue Jun 27 01:47:02 2023 +0900

    tests : fix quantize perf (#1990)

    * fix test quantize perf

    * avoid the global state

commit 5743ca80928d8410754ec64a5673d5c2dd6cfbb7
Author: katsu560 <[email protected]>
Date:   Tue Jun 27 01:46:07 2023 +0900

    k-quants : add AVX support to dot functions (#1916)

    * k_quants : add AVX support

    * k_quants : apply review comments

commit 412c60e4739367144e51e59add5dc7749d084115
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 26 19:45:09 2023 +0300

    readme : add link to new k-quants for visibility

commit 6769e944c727c63612dcafbef52009d21ae00fff
Author: Kawrakow <[email protected]>
Date:   Mon Jun 26 19:43:07 2023 +0300

    k-quants : support for super-block size of 64 (#2001)

    * k_quants: WIP super-blocks with 64 weights

    * k_quants: WIP super-blocks with 64 weights

    Q6_K scalar and AVX2 works

    * k_quants: WIP super-blocks with 64 weights

    Q4_K scalar and AVX2 works

    * k_quants: WIP super-blocks with 64 weights

    Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
    than the scalar implementation)

    * k_quants: WIP super-blocks with 64 weights

    Q3_K scalar and AVX2 works.

    * k_quants: WIP super-blocks with 64 weights

    Q5_K scalar and AVX2 works, and with that all
    k_quants are done on AVX2 and scalar

    * k_quants: WIP super-blocks with 64 weights

    Q6_K working on CUDA. Cannot make it run quite as gast as
    with super-blocks with 256 weigths: 8% slower on 4080,
    20% slower on the 1660 (but there we fit 1 less layer on the
    GPU because pf the larger model size), so some fraction of
    these 20% is due to that,

    * k_quants: WIP super-blocks with 64 weights

    Q4_K working on CUDA. ~10% slower on GTX-1660,
    16% slower on 4080.

    * k_quants: WIP super-blocks with 64 weights

    Q2_K working on CUDA. ~3% slower on GTX-1660,
    10% slower on 4080.

    * k_quants: WIP super-blocks with 64 weights

    Q3_K working on CUDA.

    * k_quants: WIP super-blocks with 64 weights

    Q5_K working on CUDA, and with this CUDA is done.

    * k_quants: WIP super-blocks with 64 weights

    Q6_K working on ARM_NEON

    * k_quants: WIP super-blocks with 64 weights

    Q4_K working on ARM_NEON, but quite a bit slower than 256 weights

    * k_quants: WIP super-blocks with 64 weights

    Q2_K working on ARM_NEON, but quite a bit slower than 256 weights

    * k_quants: WIP super-blocks with 64 weights

    Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.

    * k_quants: WIP super-blocks with 64 weights

    Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.

    With that, we have full support for ARM_NEON, although
    performance is not quite there.

    * k_quants: WIP super-blocks with 64 weights

    Slightly more efficient Q3_K and Q5_K

    * k_quants: WIP super-blocks with 64 weights

    Another small improvement for Q3_K and Q5_K on ARM_NEON

    * k_quants: WIP super-blocks with 64 weights

    Yet another speedup for Q5_K on ARM_NEON.
    We are now within 10% of the QK_K = 256 version.

    * k_quants: WIP super-blocks with 64 weights

    * We are able to pass preprocessor macros to the Metal
      compiler
    * Q6_K works and is actually slightly more efficient than
      the QK_K = 256 version (25.2 ms vs 25.8 ms)

    * k_quants: WIP super-blocks with 64 weights

    Q4_K works on Metal and is actually slightly faster
    than QK_K = 256 (21.95 ms vs 24.0 ms).

    * k_quants: WIP super-blocks with 64 weights

    Q2_K works on Metal and is very slightly faster
    than QK_K = 256 (23.8 ms vs 24.2 ms).

    * k_quants: WIP super-blocks with 64 weights

    Q3_K works on Metal and is slightly faster
    than QK_K = 256 (26.6 ms vs 28.3 ms).

    * k_quants: WIP super-blocks with 64 weights

    Q5_K works on Metal and is slightly faster
    than QK_K = 256 (23.7 ms vs 26.3 ms).

    * k_quants: call them _K, not _k, also on Metal

    * k_quants: correctly define QK_K in llama.cpp

    * Fixed bug in q4_K quantization added with the 64-block addition

    * Simplify via lambda

    * k_quants: swicth Q3_K to 4-bit scales when QK_K = 64

    Otherwise there isn't much benefit from this
    quantization type. There is some very slight loss
    in accuracy, but we reduce size by ~7%.
    E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
    8.6131 with 8-bit scales and 8.6352 with 4-bit,
    while file size decreases from 1.53G to 1.44G.

    * k_quants: switch Q4_K to 4-bit scales when QK_K = 64

     Here the loss in accuracy is greater than for Q3_K,
     but the Q4_K points still move further to the left on
     the perplexity vs size curve.

    * k_quants: forgot to add the Metal changes in last commit

    * k_quants: change Q5_K to be type 0 when QK_K = 64

    Still needs AVX2 implementation

    * k_quants: AVX2 implementation for new 64-weight Q5_K

    * k_quants: 10% faster ARM_NEON Q5_K dot product

    * k_quants: fixed issue caused by merging with master

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit cbebf61ca7584e9709265395f0127ae7fc0f1882
Author: Howard Su <[email protected]>
Date:   Mon Jun 26 23:15:47 2023 +0800

    Fix assert when free invalid cuda pointer (#2005)

    Fix assert via initializing extra structure always.
    CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument

commit 447ccbe8c39332fcdd0d98a041b6e2ff6f06219d
Author: Georgi Gerganov <[email protected]>
Date:   Sun Jun 25 16:08:12 2023 +0300

    readme : add new roadmap + manifesto

commit bd34cdde38f8fd661890ddd5f57ca30bf279877b
Author: Georgi Gerganov <[email protected]>
Date:   Sun Jun 25 14:25:08 2023 +0300

    ggml : sync latest ggml (custom operators)

commit c2a08f87b8d180115d04b8688f383d1b2761b16d
Author: anon998 <[email protected]>
Date:   Sun Jun 25 08:48:36 2023 +0000

    fix server sampling: top k sampler first (#1977)

    Co-authored-by: anon <[email protected]>

commit 66a2555ba6cab954c56d653b29c27bfbbacfbfb1
Author: Georgi Gerganov <[email protected]>
Date:   Sun Jun 25 09:07:03 2023 +0300

    readme : add Azure CI discussion link

commit e65ca7e14ac76c4046091da39d41a9017abaa9b3
Author: sjinzh <[email protected]>
Date:   Sun Jun 25 13:45:44 2023 +0800

    zig : upgrade build system support (#1981)

    * upgrade zig build system support

    * zig : add new line at the end of the file

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 5ec8dd5a3c6a9a109351d2257bb9d53869bd0a94
Author: Robyn <[email protected]>
Date:   Sun Jun 25 04:10:29 2023 +1000

    #1869 Fix null reference errors when training from scratch with CUDA (#1907)

    * #1869 Fix null reference errors when training from scratch with CUDA build

    Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly.

    * ggml : do not dereference src0 if NULL

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 65bdd52a867539691007f85c5508146d507f72c1
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 24 19:40:18 2023 +0300

    tests : sync test-grad0 from ggml

commit fdd18609113862dc6eb34dfc44a093d54c59ff1f
Author: Rowan Hart <[email protected]>
Date:   Sat Jun 24 04:07:08 2023 -0700

    flake : fix ggml-metal.metal path and run nixfmt (#1974)

commit c943d823c14cef33092205ca3944de6fdf7abf99
Author: AN Long <[email protected]>
Date:   Sat Jun 24 19:02:06 2023 +0800

    convert : fix invalid params in write_vocab_only (#1975)

commit f2c754e1c38936fdde74e4848ac468a696eb73c6
Author: slaren <[email protected]>
Date:   Sat Jun 24 12:57:18 2023 +0200

    ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978)

    * Improve ggml_graph_dump_dot, add ggml_format_name

    * add more automatic names to view ops

    * fix name of copies

commit 11da1a85cd69af84b5861134738c7e9e20907470
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 24 13:38:18 2023 +0300

    readme : fix whitespaces

commit 235b610d650cbfed6dbd5d671f750d35fc18cd7d
Author: Alberto <[email protected]>
Date:   Sat Jun 24 12:32:13 2023 +0200

    readme : fixed termux instructions (#1973)

commit b061ba9e2a7a2c335a200df8c11aed5e31e4ccbb
Author: Alex Renda <[email protected]>
Date:   Sat Jun 24 03:15:01 2023 -0700

    llama : fix top-p sampling to match the canonical definition (#1953)

    * Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p)

    * top-p: correct gt to gte

    * add test for correct top-p behavior

commit 527b6fba1d237befb324fd846bda7418c0fa394d
Author: Didzis Gosko <[email protected]>
Date:   Sat Jun 24 11:47:58 2023 +0300

    llama : make model stateless and context stateful (llama_state) (#1797)

    * llama : make model stateless and context stateful

    * llama : minor cleanup

    * llama : update internal API declaration

    * Apply suggestions from code review

    fix style

    Co-authored-by: Georgi Gerganov <[email protected]>

    * Missing model memory release

    * Fix style

    * Add deprecated warning for public API function llama_init_from_file

    * Update public API use cases: move away from deprecated llama_init_from_file

    * Deprecate public API function llama_apply_lora_from_file

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit d7b7484f74d486f77feb4c0b7af7e1718ed91651
Author: eiery <[email protected]>
Date:   Fri Jun 23 04:38:01 2023 -0400

    Add OpenLLaMA instructions to the README (#1954)

    * add openllama to readme

commit 7487137227eb32ed9b12156338b865cb29b2dfd1
Author: Erik Scholz <[email protected]>
Date:   Thu Jun 22 14:20:47 2023 +0200

    rework convert.py to read hyper-parameters from config.json (#1958)

    * Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise.
      This allows converting open_llama 3B and other non-standard model designs.

commit bbca06e26949686d61a5126332680ba3cccf235c
Author: Johannes Gäßler <[email protected]>
Date:   Wed Jun 21 23:49:25 2023 +0200

    cmake: revert CUDA arch default to 52, 61 if f16 (#1959)

commit fb98254f99d769fcbbf20966ef386abdb48ef601
Author: Rahul Vivek Nair <[email protected]>
Date:   Thu Jun 22 03:18:43 2023 +0530

    Fix typo in README.md (#1961)

commit 049aa16b8c5c6d086246e4e6b9feb18de4fbd663
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 20 19:05:54 2023 +0300

    readme : add link to p1

commit 2322ec223a21625dfe9bd73ee677444a98a24ac9
Author: Xiake Sun <[email protected]>
Date:   Tue Jun 20 05:42:40 2023 -0700

    Fix typo (#1949)

commit aacdbd40562684665b6f7b8ba6695b7a2088bbb0
Author: Ettore Di Giacinto <[email protected]>
Date:   Tue Jun 20 03:24:39 2023 +0200

    llama : fix params struct slignment (#1936)

    * Workaround struct misalignment during value-copy

    Signed-off-by: mudler <[email protected]>

    * Move booleans at the bottom of the structure

    Signed-off-by: mudler <[email protected]>

    * Add comment

    Signed-off-by: mudler <[email protected]>

    ---------

    Signed-off-by: mudler <[email protected]>

commit 20568fe60f00155fa25e92eb3a7f6b911d557967
Author: Henri Vasserman <[email protected]>
Date:   Tue Jun 20 01:12:39 2023 +0300

    [Fix] Reenable server embedding endpoint (#1937)

    * Add back embedding feature

    * Update README

commit 18b35625c3c19c64b7818a12460ba5ddb006dfdc
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 19 20:43:30 2023 +0300

    ggml : fix bug in LBFGS optimizer (found by ggml tests)

commit ba4e85a8339b9dd7cdffad31838235f2fe45a8ea
Author: l3utterfly <[email protected]>
Date:   Mon Jun 19 23:20:06 2023 +0800

    llama : use aligned memory during ggml_init call from loading saved sessions (#1934)

    * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions

    * - removed commented out old code from fix
    - updated another instance of same issue below original

commit 23fc5c219a9aebd57c8af3fac454062cc4622980
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 19 18:18:34 2023 +0300

    cmake : fix trailing whitespaces

commit cb40dfca694b5cb849837548fd69932117c78362
Author: Kawrakow <[email protected]>
Date:   Mon Jun 19 18:17:03 2023 +0300

    llama : only use Q6_K for output weights if tensor size is multiple of 256 (#1932)

    * Only use Q6_K for output weights if tensor size is multiple of 256

    * Fixed copy/paste mistake

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit ca7c3f4da5d144d4cd1dd44903552e6ba49b8ec8
Author: Kawrakow <[email protected]>
Date:   Mon Jun 19 18:14:09 2023 +0300

    cuda : faster k-quants on older GPUs (#1930)

    * k_quants: hopefully much faster Q4_K on older GPUs

    On the GTX-1660 that I have available to represent
    "old GPUs", token prediction drops from 65.5 ms/tok
    to 41.5 ms/tok!

    * k_quants: hopefully much faster Q3_K on older GPUs

    On the GTX-1660 that I have available to represent
    "old GPUs", token prediction drops from 60.3 ms/tok
    to 41.0 ms/tok!

    * k_quants: faster Q2_K on older GPUs

    It looks like I didn't need to change anything
    compared to what we already had, so this is just
    adding clarifying comments. But I now measure
    36.3 ms/tok on the GTX-1660, instead fo the
    47.2 ms/tok that I have written in the faster
    k-quants PR.

    * k_quants: faster Q5_K on older GPUs

    68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
    For some reason the same access pattern that leads
    to such resounding success for Q2_K to Q4_K did not
    work at all for Q5_K.

    It is also more difficult to measure because for Q5_K_S
    we only have 32 layers on the GTX-1660, so output, tok embeddings
    and kv cache are done on the CPU.

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit b97ca431db35ec96a339a721acb1219c1dd78bed
Author: Georgi Gerganov <[email protected]>
Date:   Mon Jun 19 18:12:33 2023 +0300

    ggml : sync latest ggml repo (#1924)

    * ggml : sync latest ggml repo

    * ggml : remove unused comments

    * ggml : asserts

commit 1e3abfcef073e73c2b31e8570cb06c5cb2fd1f55
Author: Howard Su <[email protected]>
Date:   Mon Jun 19 23:10:37 2023 +0800

    cmake : fix build shared ggml when CUDA is enabled (#1929)

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 16b9cd193965769089881bb8ec012fccca7b37b6
Author: Johannes Gäßler <[email protected]>
Date:   Mon Jun 19 10:23:56 2023 +0200

    Convert vector to f16 for dequantize mul mat vec (#1913)

    * Convert vector to f16 for dmmv

    * compile option

    * Added compilation option description to README

    * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"

commit b24c3049d96557c24782e4d32feaae65f47277af
Author: Johannes Gäßler <[email protected]>
Date:   Sun Jun 18 17:41:26 2023 +0200

    Added tokens per second to info prints (#1928)

commit 0ede372a51fd8160688e01b587582666c14e94e5
Author: Johannes Gäßler <[email protected]>
Date:   Sun Jun 18 16:07:09 2023 +0200

    Fixed incorrectly applying RMS norm twice (#1925)

commit 8596af427722775f0df4a7c90b9af067ba90d4ef
Author: l3utterfly <[email protected]>
Date:   Sun Jun 18 19:19:16 2023 +0800

    ggml : fix bug in ggml_compute_forward_add_q_f32 (#1918)

commit e1886cf4fe0d0f31661dda52a4a9f34bd9b9009a
Author: Mike <[email protected]>
Date:   Sun Jun 18 16:28:26 2023 +0800

    readme : update Android build instructions (#1922)

    Add steps for using termux on android devices to prevent common errors.

commit 8ab8ba62eb27cc340be2edf3418e051b1d967416
Author: Kawrakow <[email protected]>
Date:   Sun Jun 18 11:13:43 2023 +0300

    llama : prevent usage of k-quants when tensor size is not a multiple of 256 (#1921)

    * Fix examples/metal

    * k-quants: prevent usage when tensor size is not divisible by 256

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 90cc59d6ab1363a5c69c60c4b94db647d3a54a18
Author: Kawrakow <[email protected]>
Date:   Sun Jun 18 10:52:10 2023 +0300

    examples : fix examples/metal (#1920)

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit ce2c7d72e2d06988b5ddec6811ab923254542077
Author: Georgi Gerganov <[email protected]>
Date:   Sun Jun 18 09:09:47 2023 +0300

    metal : handle buffers larger than device's maxBufferLength (#1826)

    * metal : handle buffers larger than device's maxBufferLength

    * metal : print more verbose device info + handle errors

    * metal : fix prints for overlapping views

    * metal : minimize view overlap to try to utilize device memory better

commit 57cd69460f736031a3fc54af1e97c03f80128478
Author: Howard Su <[email protected]>
Date:   Sun Jun 18 12:29:47 2023 +0800

    cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917)

commit b2416493ab3ab21686d47c96669da6d6c6af08a4
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 17 20:55:03 2023 +0300

    make : do not print help for simple example

commit 4f9c43e3bd488b7561119785485e1155dba338d7
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 17 20:24:11 2023 +0300

    minor : warning fixes

commit 2c9380dd2f77e41149340f3ecb09764d793b16db
Author: Johannes Gäßler <[email protected]>
Date:   Sat Jun 17 19:15:02 2023 +0200

    Only one CUDA stream per device for async compute (#1898)

commit 051e1b0e6a6e3aee7d989b47760980e6fda5861c
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 17 19:30:22 2023 +0300

    llama : fix kv_cache `n` init (close #1903)

commit 86c7571864ff331f8cdb9e092f3abeb123729a56
Author: DaniAndTheWeb <[email protected]>
Date:   Sat Jun 17 18:17:22 2023 +0200

    make : update for latest Arch (#1701)

    With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed.

commit 3d59ec5935ea1d33e9d51060a8dd737169b9b89b
Author: Howard Su <[email protected]>
Date:   Sat Jun 17 23:46:15 2023 +0800

    ggml : fix warnings under MSVC (#1908)

commit 0711a5f6dce7f04c2a791b14bc47f7d4cb545408
Author: Aaron Miller <[email protected]>
Date:   Sat Jun 17 07:37:49 2023 -0700

    metal : add norm, cpy f16->f16, alibi kernels (#1823)

commit fc45a81bc642b9ef33d9004f2b363d558438a6c9
Author: Faez Shakil <[email protected]>
Date:   Sat Jun 17 17:13:05 2023 +0500

    exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863)

commit 794db3e7b982fee37e3995db9c3a216a57ff65e3
Author: Randall Fitzgerald <[email protected]>
Date:   Sat Jun 17 07:53:04 2023 -0400

    Server Example Refactor and Improvements (#1570)

    A major rewrite for the server example.

    Note that if you have built something on the previous server API, it will probably be incompatible.
    Check out the examples for how a typical chat app could work.

    This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.

    Summary of the changes:

    - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
    - applies missing top k sampler
    - removes interactive mode/terminal-like behavior, removes exclude parameter
    - moves threads and batch size to server command-line parameters
    - adds LoRA loading and matches command line parameters with main example
    - fixes stopping on EOS token and with the specified token amount with n_predict
    - adds server timeouts, host, and port settings
    - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
    - sets defaults for unspecified parameters between requests
    - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
    - adds CORS headers to responses
    - adds request logging, exception printing and optional verbose logging
    - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
    - adds printing an error when it can't bind to the host/port specified
    - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
    - prints timing and build info on startup
    - adds logit bias to request parameters
    - removes embedding mode
    - updates documentation; adds streaming Node.js and Bash examples
    - fixes code formatting
    - sets server threads to 1 since the current global state doesn't work well with simultaneous requests
    - adds truncation of the input prompt and better context reset
    - removes token limit from the input prompt
    - significantly simplified the logic and removed a lot of variables

    ---------

    Co-authored-by: anon998 <[email protected]>
    Co-authored-by: Henri Vasserman <[email protected]>
    Co-authored-by: Felix Hellmann <[email protected]>
    Co-authored-by: Johannes Gäßler <[email protected]>
    Co-authored-by: Lesaun Harvey <[email protected]>

commit 5ddf7ea1fb42bac21026de2f77e0f9c069b92234
Author: Jiří Podivín <[email protected]>
Date:   Sat Jun 17 12:32:48 2023 +0200

    hooks : setting up flake8 and pre-commit hooks (#1681)

    Small, non-functional changes were made to non-compliant files.
    These include breaking up long lines, whitespace sanitation and
    unused import removal.

    Maximum line length in python files was set to a generous 125 chars,
    in order to minimize number of changes needed in scripts and general
    annoyance. The "txt" prompts directory is excluded from the checks
    as it may contain oddly formatted files and strings for a good reason.

    Signed-off-by: Jiri Podivin <[email protected]>

commit bac19927c302737465a1deb14ac0943a221863e8
Author: Gustavo Rocha Dias <[email protected]>
Date:   Sat Jun 17 06:01:06 2023 -0300

    readme :  alternative way to build for Android with CLBlast. (#1828)

commit b4c6f46f17b6e02f1cd55a81339e7e64f3aaa688
Author: Kerfuffle <[email protected]>
Date:   Sat Jun 17 01:49:42 2023 -0600

    Allow cmake to build ggml as a library (#1896)

    * Allow cmake to build ggml as a library

    * A ggml_static library will be created

    * When BUILD_SHARED_LIBS is enabled, ggml_shared will also be built

commit 92f20d9942c86daeb78637bdad7296a572f4da28
Author: David Yang <[email protected]>
Date:   Sat Jun 17 14:51:54 2023 +0800

    train : get raw text instead of page with html (#1905)

    We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work.

commit d411968e990c37f51328849c96a743dd78f3c3dd
Author: 0cc4m <[email protected]>
Date:   Fri Jun 16 20:59:49 2023 +0200

    opencl : support k-quants (#1836)

    * Porting q2_k kernel to OpenCL

    * Set global and local sizes for kernel calls for dequantizing k-quants

    * Added q6_k kernel

    * Fix q4_k opencl struct order

    * Replace uchar with uint8_t

    * Finish dequant kernels

    * Added OpenCL DMMV kernels

    * Fix q2_k, improve code

    * Fix q3_k

    * Shorten switch statements

    * Improve code formatting

    ---------

    Co-authored-by: Concedo <[email protected]>

commit b41b4cad6f956b5f501db0711dd7007c32b5eee5
Author: SuperUserNameMan <[email protected]>
Date:   Fri Jun 16 20:58:09 2023 +0200

    examples : add "simple" (#1840)

    * Create `simple.cpp`

    * minimalist example `CMakeLists.txt`

    * Update Makefile for minimalist example

    * remove 273: Trailing whitespace

    * removed trailing white spaces simple.cpp

    * typo and comments simple.cpp

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 13fe9d2d84f30cab613c960bf66ac83916006694
Author: Zenix <[email protected]>
Date:   Sat Jun 17 03:53:04 2023 +0900

    cmake : add auto detection of BLAS_INCLUDE_DIRS (#1886)

commit ac3b8869538c7fbdb48ff141d78c4dea091789f0
Author: Johannes Gäßler <[email protected]>
Date:   Fri Jun 16 20:25:51 2023 +0200

    llama : fix embd when offloading non-repeating layers (#1891)

commit 5b9ccaf104cc1054d4f8f17bc8a4b8dc949e5527
Author: FrankHB <[email protected]>
Date:   Sat Jun 17 02:25:01 2023 +0800

    Fixed possible macro redefinition (#1892)

    MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined.

commit 9cbf50c041a525d781c7764f493a5443924e4e38
Author: Borislav Stanimirov <[email protected]>
Date:   Fri Jun 16 21:23:53 2023 +0300

    build : fix and ignore MSVC warnings (#1889)

commit 3d0112261042b356621e93db3fa4c6798a5d098f
Author: Kawrakow <[email protected]>
Date:   Fri Jun 16 20:08:44 2023 +0300

    CUDA : faster k-quant dot kernels (#1862)

    * cuda : faster k-quant dot kernels

    * Imrove Q2_K dot kernel on older GPUs

    We now have a K_QUANTS_PER_ITERATION macro, which should be
    set to 1 on older and to 2 on newer GPUs.
    With this, we preserve the performance of the original
    PR on RTX-4080, and are faster compared to master on
    GTX-1660.

    * Imrove Q6_K dot kernel on older GPUs

    Using the same K_QUANTS_PER_ITERATION macro as last commit,
    we preserve performance on RTX-4080 and speed up
    Q6_K on a GTX-1660.

    * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile

    Allowed values are 1 or 2. 2 gives the best performance on
    modern GPUs and is set as default. On older GPUs 1 may work
    better.

    * PR comments

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 602c748863e15270d80d74aa2c3bf86ab8139e07
Author: Borislav Stanimirov <[email protected]>
Date:   Fri Jun 16 09:58:11 2023 +0300

    gitignore : add several entries specific to Visual Studio (#1888)

commit a09f9195be39afb4b023b646c0a6ec8a86915174
Author: Johannes Gäßler <[email protected]>
Date:   Thu Jun 15 21:49:08 2023 +0200

    Fixed CUDA runtime version check (#1879)

commit bed92756172d4514b23aaf9744cf8e2dc892fc7b
Author: Georgi Gerganov <[email protected]>
Date:   Thu Jun 15 21:56:50 2023 +0300

    cmake : remove whitespaces

commit c36e81da62ebfe09a768201cc44fa8d712dd00ed
Author: yangli2 <[email protected]>
Date:   Thu Jun 15 11:05:53 2023 -0700

    examples : add chat-vicuna.sh (#1854)

    Co-authored-by: Yang Li <[email protected]>

commit 3559433fecedf365e7aba2fe3d5f89d9abb817c1
Author: Igor Okulist <[email protected]>
Date:   Thu Jun 15 12:51:26 2023 -0500

    cmake : set include path for OpenBlas (#1830)

commit 69b34a0e80300bfb3e996983ac3ea075f5526675
Author: Frederik Vogel <[email protected]>
Date:   Fri Jun 16 02:47:04 2023 +0900

    swift : Package compile breaks due to ggml-metal.metal (#1831)

    * Ignore metal file in spm

    * Add ggml.h to spm public Headers

    ---------

    Co-authored-by: Vogel Frederik <[email protected]>

commit cf267d1c71a781700698f8518e903239c3bcc929
Author: daboe01 <[email protected]>
Date:   Thu Jun 15 19:42:48 2023 +0200

    make : add train-text-from-scratch (#1850)

    * make finetuning example accessible

    * fixed: targed was in wrong line

    * fixed: name of executable was wrong

    * fixed: naming of binary

    * fixed: model path was wrong

    * fixed clean target

    * Update examples/train-text-from-scratch/README.md

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 9dda13e5e1f70bdfc25fbc0f0378f27c8b67e983
Author: Srinivas Billa <[email protected]>
Date:   Thu Jun 15 18:36:38 2023 +0100

    readme : server compile flag (#1874)

    Explicitly include the server make instructions for C++ noobsl like me ;)

commit 37e257c48e350cf03c353c10d31e777f8d00123d
Author: sandyiscool <[email protected]>
Date:   Thu Jun 15 23:06:06 2023 +0530

    make : clean *.so files (#1857)

commit 64cc19b4fe3df03bc20e520aa111c30cff3a655e
Author: Howard Su <[email protected]>
Date:   Fri Jun 16 01:29:59 2023 +0800

    Fix the validation of main device (#1872)

commit 4bfcc855abdb2c9fcc3c5a84747974521909fa41
Author: Georgi Gerganov <[email protected]>
Date:   Thu Jun 15 20:29:48 2023 +0300

    metal : parallel command buffer encoding (#1860)

    * metal : parallel command buffer encoding

    * metal : determine number of command buffers based on gf->n_threads

commit 6b8312e7979b852f6b6ac9d29cd51fda16c17948
Author: Johannes Gäßler <[email protected]>
Date:   Thu Jun 15 19:06:46 2023 +0200

    Better error when using both LoRA + GPU layers (#1861)

commit 254a7a7a5ff4c874ff8488f1f5cbdd7e9c89d682
Author: Johannes Gäßler <[email protected]>
Date:   Wed Jun 14 19:47:19 2023 +0200

    CUDA full GPU acceleration, KV cache in VRAM (#1827)

    * Fixed CUDA RoPE

    * ggml_cuda_mul_mat_vec_p021

    * ggml_cuda_scale

    * ggml_cuda_diag_mask_inf

    * ggml_is_permuted

    * ggml_cuda_cpy

    * flatten rows for ggml_cuda_op

    * Added a --low-vram option

    * Fixed Windows performance

    * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM

commit 92549202659fc23ba9fec5e688227d0da9b06b40
Author: 0xspringtime <[email protected]>
Date:   Tue Jun 13 15:37:54 2023 -0400

    baby-llama : fix operator!= (#1821)

    * Update baby-llama.cpp

    Seems to be an error in the implementation of the operator!= function. It attempts to compare the this pointer (a llama_hparams_lora object) with the other pointer (a llama_hparams object) using memcmp. This can lead to incorrect results because the sizes of the objects being compared (sizeof(llama_hparams) and sizeof(llama_hparams_lora)) are different, should now be able to compare two llama_hparams_lora objects for inequality.

    * Update baby-llama.cpp

    * Update baby-llama.cpp

commit e32089b2c20b1b87b22912f4a8b93fe01647d5b9
Author: xaedes <[email protected]>
Date:   Tue Jun 13 21:04:40 2023 +0200

    train : improved training-from-scratch example (#1652)

    * add python wrapper

    https://gist.github.com/abetlen/2b90e5f153f6efd00931d098de5c73ce

    * fix decoding error. adds errors=ignore parameter

    * add python bindings for functions to get and set the whole llama state
    (rng, logits, embedding and kv_cache)

    * update python bindings

    * add text generating baby-llama from scratch example

    * fix race condition bug in ggml_compute_forward_diag_mask_f32

    * implement ggml_soft_max_back for more performant backward pass of soft_max

    avoids creating big intermediate matrices of size n_embd x n_embd for llama layers and n_vocab x n_vocab for cross entropy loss

    * improve softmax backward pass

    go from quadratic runtime to linear runtime by simplifying the formulas

    * fix race condition bug in non-inplace ggml_compute_forward_diag_mask_f32

    memcpy needs to be synchronized across threads to avoid race conditions.
    => do it in INIT phase

    * fix bug in ggml_compute_forward_soft_max_back_f32 on DEBUG build

    * improve performance of mul_mat backward pass

    avoid transpose by using mul_mat with swapped arguments

    * avoid printing too much newlines in baby-llama-text

    * activate threading in baby-llama-text

    * add ggml_out_prod and use it for mul_mat backward pass for improved performance

    performance stats report improvement from 37 seconds to 16 seconds runtime during my training tests

    * better weight initialization improves training convergence at start

    * better weight initialization improves training convergence at start

    * improve ggml_out_prod performance

    - change iteration order (>15s -> 10s runtime)
    - parallelize over one more dimension: over dst matrix rows (10s -> <5s runtime)

    * add llama sampler, shuffle samples and constrain sampling to tokens occurring in train data

    * fix get_samples call, add model tensor names, increase model size, start training samples after newline

    * save train trained model to checkpoint and load model to be trained from checkpoint

    * use inplace functions where possible

    * initialize rng with srand

    * use different arguments for input and output checkpoint

    * ggml fixes to support backward pass on inplace operations

    * remove duplicate include

    * fix cross entropy loss

    - add target probabilities for each sample which is then used in cross entropy loss

    * print used memory before and after optimization

    * sample with non-greedy sampling parameters at the end of training

    * add cmake target for baby-llama-text

    * add ggml_add1_inplace to header

    * enable gradient propagation for inplace add1 and scale operations

    those functions backward passes don't need the original src0, so they also work when forward is inplace

    * implement AdamW in ggml_opt_adam by adding weight decay parameter (default 0.001f)

    also add a schedule parameter (default 1.0f) that can be used to scale alpha and decay according to learning schedule.
    setting the decay parameter to zero disables AdamW resulting in normal Adam optimizer.

    since the difference between Adam and AdamW is minimal it is not implemented as another optimizer, but integrated into the existing Adam optimizer.

    * use inplace operations in cross_entropy_loss

    * fix random weight initialization scale

    * add missing default parameters for adam optimizer

    * add ggml_opt_context, so that we can properly resume training

    otherwise the optimizer states, tracking statistics about the error function and its derivates,
    will reset to zero each time ggml_opt is called, hindering convergence on resumed training.

    now the optimizer context and all its memory is stored in a separate struct.

    * fix bug in llama_sample_token_mirostat_v2

    when all candidates are filtered out through mu threshold, the following soft_max operation will fail.
    so keep at least one.

    * add forward function without using cache, for more performant training

    during training on whole samples no cache is required.
    removing the cache and simplifying the remaining code results in performance and memory usage improvement.

    * print suppressed newline tokens as string "\n"

    printing too much actual newlines is suppressed to avoid flooding the console.

    * store optimizer state in training checkpoint and add learning schedule

    persistent optimizer state allows to resume training without resetting the optimizer
    learning schedule consists of linear warmup ramp followed by cosine decay with restarts

    * remove unused functions

    * fix bug in get_samples which corrupted training targets

    * save checkpoint only when it was trained

    * simplify code

    * remove trailing whitespace

    * simplify backward pass for SQRT

    * replace inefficient repeat backward pass with dedicated repeat_back operation

    * add ggml_cross_entropy_loss with backward pass for faster training

    cross entropy loss can also be implemented using softmax and log, but as dedicated operation it is faster and especially avoids unnecessary memory overhead.

    * add tests for cross_entropy_loss backward pass

    finite differences regularly results in estimated gradient of zero, despite the backward pass giving non zero gradient.
    _probably_ the finite differences fails due to numerical issues

    * use ggml_cross_entropy_loss in text training example

    * remove trailing whitespace

    * slightly improve how cross entropy loss is compute

    btw: directly implemented cross entropy loss seems to have way lower magnitudes than when implemented with softmax and log.
    probably the input to log gets closer to zero due to float numerics.
    maybe the multiplication by (1.0-eps)/sum is more accurate..

    * add llama_get_vocab to get the vocabulary as output parameters

    * set default model.type for unknown models with few layers

    * add export of training checkpoint to llama compatible model file

    * get vocabulary for exporting training checkpoint to llama compatible model file

    * implement backward pass of flash attention

    * bugfixes for backward pass of flash attention

    * test flash attention backward pass

    need to set loose error bounds to pass.
    the finitie differences are close to numeric limits and often return quite different values than the backward pass.
    reducing eps further lets the gradients vanish completely.
    likewise setting eps to big results in wronger values.
    the softmax in the middle of the function is probably the most responsible for the numeric issues using finite differences.

    * add option to train with flash attention and move options to the top of the main function

    training from scratch also works with flash attention
    training convergence and generation results after fix number of iterations are worse than when not using flash attention.
    maybe there still lingers a bug in the flash attention backward pass?
    but training works, just with slower convergence.

    flash attention is still worth to use, because it requires way less memory and is faster with high n_ctx

    * add train_params and command line option parser

    * remove unnecessary comments

    * add train params to specify memory size

    * remove python bindings

    * rename baby-llama-text to train-text-from-scratch

    * replace auto parameters in lambda function

    * add #include <climits>

    * add explicit cast to fix compile error

    "error: non-constant-expression cannot be narrowed from type 'int64_t' (aka 'long long') to 'uint32_t' (aka 'unsigned int') in initializer list [-Wc++11-narrowing]"

    * remove trailing whitespace

    * add ggml_opt_resume_g which accepts forward and backward cgraphs

    * fix formulas in comments

    * bug fix for ggml_compute_forward_get_rows_back_f32

    the result should be set to zero, not to whatever data is in opt0

    * improve training memory usage with scratch buffers

    instead of relying on the automatic backward pass, we manually create the graph for the backward pass.
    it turns out that all backward pass operations need only temporary memory which can be reused after each layer.

    will compute backward pass for ALL model parameters

    * add option to use scratch buffers in training or not

    make it configurable because currently training with scratch buffers implies flash attention and optimization over all parameters.

    * ci : disable temporary

    * store view offset and permute axes in opt[0] instead of storing it in padding

    use memcpy to store offset, because offset is of type size_t.
    when storing it as int32_t offset would have to be smaller than 2^31 which is not necessarily true.

    * minor : fix compile warnings + minor style changes

    * fix bug in threaded indices calculation of ggml_compute_forward_flash_attn_back_f32

    * store view offset like in master branch

    * bug fix in forward_batch_wo_cache_flash_attn_train

    * scratch buffer bug fixes in forward_batch_wo_cache_flash_attn_train

    data of permute and reshape is the same as their input.
    if we want to preserve the output of permute/reshape, we also need to preserve their inputs.

    replace reshape(src0, src1) with reshape_nd calls so that we don't need src1.

    replace (temporary) t03 with ggml_repeat(ctx0, layer.attention_norm, t02).
    in the future we could also use the new broadcasting ggml_mul to avoid these repeat calls.
    for this we need backward pass of broadcasting ggml_mul.

    * remove unnecessary scratch buffer 0

    buf 0 is persistent memory, so we can just disable scratch for this by using buf -1

    * avoid creating unnecessary grad tensors

    previously we need to create grads for model parameters, so that expand(..) correctly populates cgraph->leafs & cgraph->grads
    this wasted memory, because unnecessary grad for each op were automatically created:
    the automatically generated grad was unnecessary because we later manually set the grad (e.g. t35->grad = expand(gb, ...) ).
    this discarded the automatically generated grad resulting in wasted memory.

    improved this by changing expand(..) to not use ggml_build_forward_expand.
    expand set cgraph->nodes but not the leafs.
    cgraph->leafs & cgraph->grads are set in another pass after the last expand call.

    * print used training seed

    * zero initialize gfbuf and gbbuf

    * ci : re-enable workflows + add README for training

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit 2347e45e7bdb09c9a7d74b2c0bc86c2b65f0c343
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 13 20:20:07 2023 +0300

    llama : do a warm-up eval at start for better timings (#1824)

commit 74d4cfa3438cb58bd177eed30014e6588694aaa8
Author: Kerfuffle <[email protected]>
Date:   Tue Jun 13 04:23:23 2023 -0600

    Allow "quantizing" to f16 and f32 (#1787)

    * Allow "quantizing" to f16 and f32

    Fix an issue where quantizing didn't respect LLAMA_NO_K_QUANTS

    Add brief help to the list of quantization types in the quantize tool

    Ignore case for quantization type arguments in the quantize tool

commit 74a6d922f12ccfe16b0c265f43be8978c6f25e98
Author: Kawrakow <[email protected]>
Date:   Mon Jun 12 22:39:21 2023 +0300

    Metal implementation for all k_quants (#1807)

    * metal : improve q4_K

    28.3 -> 26.0 ms/token by avoiding a branch in the
    calculation of the scales.

    * metal : small improvement for Q4_K

    * metal : still optimizing Q4_K

    This commit pushes it down to 25.3 ms / token.

    The crazy idea of using 6 bits for the scales is really costly on
    Metal: if I remove the bit fiddling necessary to make the block
    scales, time goes almost to the Q4_0 23 ms/token.

    Before pushing the k-quants upstream I had a Q4_K variant that
    had used 8-bit scales. It wasn't more accurate, used 0.125 bits more per weight,
    was running slightly slower on the CPU (due to the larger model size
    and being memory bound there), and the difference was entirely
    negligible under CUDA. So, I decided to publish the version with 6-bit
    scales. Perhaps I should re-consider and change to 8-bit scales?

    * metal : some more optimizations

    Q2_K: 25.4 ms/token
    Q6_K: 27.3 ms/token
    Q4_0: 22.8 ms/token
    Q4_1: 23.1 ms/token

    * metal : Q3_K support

    Something is not quite right yet.

    * metal : Q5_K support

    Initial version achieves 31.2 ms/token, 210 GB/s

    * metal : still not able to figure out why q3_K does not work

    * Minor

    * metal : yet another failed attempt to make q3_K work

    * metal : optimize Q5_K

    31.2 ms -> 27.8 ms.
    250 GB/s.

    * metal : q3_K still not working

    Adding a heavily commented q3_K metal kernel to explain
    my obviously faulty logic. Perhaps someone could spot the issue?

    * metal : q3_K finally working

    Not optimized at all.

    What was the issue? The scales are not 4-bytes aligned,
    and I was accessing them with a uint32_t pointer.
    When I tried that on CUDA, I got an error (illegal memory access)
    and added a memcpy to a local array of 3 uint32_t's.
    But on Metal it told me there is no memcpy, so I tried
    accessing directly. There is no error, just garbage results.
    At some point I did try accessing the scales with an uint16_t
    pointer (the scales are for sure 2-byte aligned), but was
    still getting garbage. I guess, there must have been another bug.

    No access to scales is via a uint16_t pointer and, after starting
    from scratch from the C dequantize function, it finally works.

    * metal : Q3_K 1st optimization pass

    * metal : Q3_K second optimization pass - 29.6 ms/token

    * metal : Q3_K cleanup

    * metal : fixed accidentally broken Q2_K

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit e4caa8da59c1c97dc23fa336f4d726984a20560f
Author: slaren <[email protected]>
Date:   Mon Jun 12 19:12:47 2023 +0200

    ci : run when changing only the CUDA sources (#1800)

commit 58970a4c39124a647ac2a640d9e178ea6c961e65
Author: Howard Su <[email protected]>
Date:   Mon Jun 12 20:44:16 2023 +0800

    Leverage mmap for offloading tensors to GPU (#1597)

    * Rebase to latest

    * Show progress

    * Add assert to make sure we only allocate temp buffer for non-CPU backend tensor

    Co-authored-by: Johannes Gäßler <[email protected]>

    ---------

    Co-authored-by: Johannes Gäßler <[email protected]>

commit 8c0a10e64dbf60fd9946c0cd5e6f59690800b123
Author: Kawrakow <[email protected]>
Date:   Mon Jun 12 14:31:36 2023 +0300

    metal : fix failure to load model (#1817)

    The number of buffers in the ggml context was left unitialized.
    This leads to sporadic failures to load the model on
    startup. It is actually strange that the failure occurred so
    infrequantly.

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit fa84c4b3e80199a5683438f062009c031a06c4fa
Author: Kerfuffle <[email protected]>
Date:   Sun Jun 11 08:19:17 2023 -0600

    Fix issue where interactive mode crashes when input exceeds ctx size (#1789)

    * Fix issue where interactive mode in the main example crashes when input exceeds ctx size

    * Ensure the context size is at least 8 tokens in the main example.

    Closes #1768

commit 12b063f0ecf280e98028e444fc492ee6222cdcdc
Author: Kyle Liang <[email protected]>
Date:   Sun Jun 11 21:20:52 2023 +0800

    Fixed WSL cuda's OOM error (#1594)

    * In the function , add the cuda error bypass.

    * remove excessive codes and prints

    ---------

    Co-authored-by: liang <[email protected]>

commit 31d2b5f4a4bae081e59b36ab37c6ff6f5b5940ad
Author: Ryan Landay <[email protected]>
Date:   Sun Jun 11 17:38:53 2023 +0800

    Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798)

commit 4de0334f5cabf4696eced2e5d6e279fdfaa6c0f2
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 10 22:56:53 2023 +0300

    cmake : fix Metal build (close #1791)

commit 3f1223155a462477ac933474ebc4eab0ce3ca264
Author: Artyom Lebedev <[email protected]>
Date:   Sat Jun 10 22:51:36 2023 +0300

    k-quants : GCC12 compilation fix (#1792)

commit 303f5809f1b4ec49823dbe70cacd2124ec1d0df0
Author: Andrei <[email protected]>
Date:   Sat Jun 10 10:47:34 2023 -0400

    metal : fix issue with ggml-metal.metal path. Closes #1769 (#1782)

    * Fix issue with ggml-metal.metal path

    * Add ggml-metal.metal as a resource for llama target

    * Update flake.nix metal kernel substitution

commit 059e99066d95d73d1ca26c3375d47c0e35596229
Author: Aisuko <[email protected]>
Date:   Sun Jun 11 00:08:11 2023 +1000

    doc : fix wrong address of BLIS.md (#1772)

    Signed-off-by: Aisuko <[email protected]>

commit 17c10acfb44ecb7af25e37fb67b9501cbc0034d2
Author: Georgi Gerganov <[email protected]>
Date:   Sat Jun 10 12:06:45 2023 +0300

    ggml : force no_alloc == false when creating opt tensors (close #1699)

    This is needed to make operators like ggml_view() be able to store their
    parameters in the ggml context's memory and not get discarded when
    no_alloc is true

commit e9b66ee9829039d4ab54550d6222e42a0b31e52a
Author: Kawrakow <[email protected]>
Date:   Sat Jun 10 11:28:11 2023 +0300

    metal : add Q4_1 implementation (#1785)

    23.3 ms / token, so just ~1% slower than q4_0.
    Achieves 290 GB/s memory throughput.

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 4f0154b0bad775ac4651bf73b5c216eb43c45cdc
Author: Kerfuffle <[email protected]>
Date:   Sat Jun 10 01:59:17 2023 -0600

    llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691)

    * Add support for quantizing already quantized models

    * Threaded dequantizing and f16 to f32 conversion

    * Clean up thread blocks with spares calculation a bit

    * Use std::runtime_error exceptions.

commit ef3171d16241c18581d4d08374f0b9e396ade6b7
Author: Xingchen Song(宋星辰) <[email protected]>
Date:   Sat Jun 10 15:49:40 2023 +0800

    ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638)

commit 555275a693843273759230547001f9ae07fb537e
Author: rankaiyx <[email protected]>
Date:   Sat Jun 10 14:41:59 2023 +0800

    make : add SSSE3 compilation use case (#1659)

commit 98ed16557432d7a5179c57eddcc3a08a7ae6d54d
Author: Robert Sung-wook Shin <[email protected]>
Date:   Sat Jun 10 01:24:40 2023 +0900

    OpenCL: Add release memory (#1741)

    * Add opencl release memory

    * Rename function name

commit ae9663f1887513e152839e91f61c513075a19422
Author: Johannes Gäßler <[email protected]>
Date:   Fri Jun 9 13:58:15 2023 +0200

    Windows nvcc workaround (#1753)

    Fix gibberish output on Windows when using CUDA

commit b33dee282f5d8032b5f780152732dc45cbf2d349
Author: Georgi Gerganov <[email protected]>
Date:   Fri Jun 9 11:11:04 2023 +0300

    metal : fix build "tanhf" -> "tanh"

commit 92f44ff7f778ef1b94028b2ba6d39943b5ca0ada
Author: AT <[email protected]>
Date:   Fri Jun 9 04:00:51 2023 -0400

    metal : add GELU implementation (#1770)

    Co-authored-by: Adam Treat <[email protected]>

commit 245fc3c37da5ac5963f9f11a9f4f2ac08d96afc6
Author: Kawrakow <[email protected]>
Date:   Fri Jun 9 10:39:59 2023 +0300

    metal : faster q4_0 (#1775)

    * metal : 8% faster q4_0

    Avoid copying into local uchar4 anf float4.

    * metal : 17% faster Q4_0

    Use 64 threads in a thread group.

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 72ff5282bf0388c60821f504c4c8cc2b1f491aa6
Author: Kawrakow <[email protected]>
Date:   Thu Jun 8 22:28:21 2023 +0300

    metal : add Q2_K implementation (#1762)

    * metal : add Q2_K implementation

    27.1 ms / token on M2 Max 30-core GPU, so about the
    same speed as Q4_0. Memory throughput is ~156 GB/s.

    The access pattern used in the Q2_K
    CUDA implementation resulted in significantly lower
    performance (~31 ms/token).

    * Fixing merge conflicts

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 0bf7cf1b296fc9fca05411b37afdf08a531487d2
Author: Georgi Gerganov <[email protected]>
Date:   Thu Jun 8 20:48:14 2023 +0300

    Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)"

    This reverts commit 8432d4d9f716b25133e3ed671d91e21f6f3be867.

commit 8432d4d9f716b25133e3ed671d91e21f6f3be867
Author: le.chang <[email protected]>
Date:   Fri Jun 9 00:47:56 2023 +0800

    ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)

commit 0f291e1f65c1d68201e71ce99c89562a36686b6d
Author: Kawrakow <[email protected]>
Date:   Thu Jun 8 19:46:22 2023 +0300

    metal : Q6_K implementation (#1752)

    * Metal implementation for Q4_K

    Very slow for now:
    42 ms / token, Q4_0 runs in 28 ms/token on my
    30-core M2 Max GPU.

    * Optimizing Q4_K on metal

    The first token always takes longer, I guess because
    the metal kernel is being jit-compiled.
    So, using n = 128 to measure time.

    At this point Q4_K takes 29.5 ms / token
    compared to 27.2 ms / token for Q4_0.
    Quite a bit better than the initial attempt,
    but still not good enough.

    * Optimizing q4_K metal dot some more

    For n = 256 it is now 28.1 ms/token compared to
    27 ms/token for q4_0.

    * Fix after merge with master

    * Metal implementation for Q6_K

    Similar to the CUDA implementation.
    No idea if this is the optimum for Metal, but the few
    alternative variants I tried all had a lower performance.

    We get 36.5 ms / token on M2 Max with 30 GPU cores.
    This corresponds to ~200 GB/second throughput.

    * clang-tidy : add config back

    * Much better Q6_K implementation for metal

    28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
    other compute graph operations, we are left with ~19 ms
    for the matrix multiplications. The model is ~5.5 GB,
    so we are getting 1000 / 19 * 5.5 = 290 GB/s!

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 8fc8179919a11738910db07a800f2b176f8adf09
Author: qingfengfenga <[email protected]>
Date:   Thu Jun 8 15:58:53 2023 +0800

    Add llama.cpp docker support for non-latin languages (#1673)

    * Modify Dockerfile default character set to improve compatibility (#1673)

commit b50b570ed9d699d3d126d72fc02de92926bcd937
Author: Steven Roussey <[email protected]>
Date:   Thu Jun 8 00:12:28 2023 -0700

    ggml : fix fprintf warnings (#1720)

commit 53aba3f393f2e02a78ddaba2e934893a8bbf3246
Author: Georgi Gerganov <[email protected]>
Date:   Thu Jun 8 10:09:08 2023 +0300

    clang-tidy : restore dot file from accidental deletion

commit 4161bdc04debb70bf5f275492b4d89fd9330087c
Author: Kawrakow <[email protected]>
Date:   Thu Jun 8 10:08:23 2023 +0300

    metal : add Q4_K implementation (#1733)

    * Metal implementation for Q4_K

    Very slow for now:
    42 ms / token, Q4_0 runs in 28 ms/token on my
    30-core M2 Max GPU.

    * Optimizing Q4_K on metal

    The first token always takes longer, I guess because
    the metal kernel is being jit-compiled.
    So, using n = 128 to measure time.

    At this point Q4_K takes 29.5 ms / token
    compared to 27.2 ms / token for Q4_0.
    Quite a bit better than the initial attempt,
    but still not good enough.

    * Optimizing q4_K metal dot some more

    For n = 256 it is now 28.1 ms/token compared to
    27 ms/token for q4_0.

    * Fix after merge with master

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 0035858273ebe0694926bf4414d279f3e1cd109d
Author: johnson442 <[email protected]>
Date:   Thu Jun 8 08:02:48 2023 +0100

    k-quants : add missing compile definition to CMakeLists (#1748)

commit 5c64a0952ee58b2d742ee84e8e3d43cce5d366db
Author: Georgi Gerganov <[email protected]>
Date:   Wed Jun 7 10:59:52 2023 +0300

    k-quants : allow to optionally disable at compile time (#1734)

    * k-quants : put behind optional compile flag LLAMA_K_QUANTS

    * build : enable k-quants by default

commit 5b57a5b72676540b6a45a3f527126299969ad241
Author: jacobi petrucciani <[email protected]>
Date:   Wed Jun 7 00:15:31 2023 -0400

    flake : update to support metal on m1/m2 (#1724)

commit 4dc62c545df0af60635d579e9e4dd91bc5afff51
Author: Georgi Gerganov <[email protected]>
Date:   Wed Jun 7 07:15:08 2023 +0300

    readme : add June roadmap

commit 35a84916fb029905c44746127026079268216e7a
Author: Willy Tarreau <[email protected]>
Date:   Wed Jun 7 04:10:17 2023 +0200

    main: add the possibility to open the prompt cache read-only (#1640)

    The prompt cache constitutes a nice speed up when using the same prompt
    prefix across multiple evaluations, but when using it, it will also be
    updated, which is not always desirable. One use case is to have a large
    prompt containing some context and usage rules, and a second part
    containing variable data of the problem being studied. In this case it's
    desirable to be able to save the first part once, and to always reuse it
    as-is without updating it with the second part.

    The new argument --prompt-cache-ro enables this read-only mode on the
    prompt cache. The prompt's contents that match the cache are loaded
    from the cache but the rest is not modified. This allowed to reduce a
    total analysis time from 112s to 49.7s here, without having to backup
    and restore a copy of the prompt, which takes significant time at 500
    MB.

    Signed-off-by: Willy Tarreau <[email protected]>

commit 2d7bf110edd8c49209401a16132052cba706ffd0
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 6 22:54:39 2023 +0300

    llama : fix vram_scratch var

commit 2a4e41a086ce80da68c402457c75c77e52dcc698
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 6 22:41:53 2023 +0300

    llama : fix compile warnings

commit 17366df842e358768c0df7024484fffecfc7865b
Author: Johannes Gäßler <[email protected]>
Date:   Tue Jun 6 21:33:23 2023 +0200

    Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703)

    * CUDA multi GPU + scratch

    ggml_cuda_compute_forward

    Tensor parallelism

    ggml_cuda_add

    ggml_cuda_rms_norm

    ggml_cuda_silu

    CUDA scratch buffer

    --main-gpu CLI option

commit 44f906e8537fcec965e312d621c80556d6aa9bec
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 6 20:16:57 2023 +0300

    metal : add f16 support

commit d5b111f53d14972669eb52055f9df2567663ad8b
Author: LostRuins <[email protected]>
Date:   Wed Jun 7 01:00:01 2023 +0800

    Clblast fixes + enhancements to save VRAM and offload more layers (#1675)

    * Use events instead of clFinish, where possible

    * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel

    * Reduce queueing overhead for contiguous tensors by using single mul kernel call

    * Adapt to #1612 cl_mem malloc changes

    * Reduce code duplication between cuda and opencl branches

    * Improve implementation

    * Clblast fixes + enhancements to save VRAM:

    1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them.
    2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer
    3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it.

    * change max value size_t to use limits

    * removed flags from the CL pool malloc, apply code tidying suggestions.

commit 2d43387dafe9c60f15f57aa23ee0b37864b98b32
Author: Georgi Gerganov <ggerga…
@@ -262,6 +262,10 @@ These options help improve the performance and memory usage of the LLaMA models.

- `--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance. Disabling mmap results in slower load times but may reduce pageouts if you're not using `--mlock`. Note that if the model is larger than the total amount of RAM, turning off mmap would prevent the model from loading at all.

### NUMA support

- `--numa`: Attempt optimizations that help on some systems with non-uniform memory access. This currently consists of pinning an equal proportion of the threads to the cores on each NUMA node, and disabling prefetch and readahead for mmap. The latter causes mapped pages to be faulted in on first access instead of all at once, and in combination with pinning threads to NUMA nodes, more of the pages end up on the NUMA node where they are used. Note that if the model is already in the system page cache, for example because of a previous run without this option, this will have little effect unless you drop the page cache first. This can be done by rebooting the system or on Linux by writing '3' to '/proc/sys/vm/drop\_caches' as root.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry for this comment. I have limit experiences of NUMA node and processes placement optimization to utilize NUMA node's memory access locality. But I am sure such locality helps to improve end to end latencies.

I want to understand what the optimization exactly made by this PR.

This currently consists of pinning an equal proportion of the threads to the cores on each NUMA node, and disabling prefetch and readahead for mmap.

  • " pinning an equal proportion of the threads to the cores on each NUMA node"

which part does this job ?

  • "disabling prefetch and readahead for mmap"

when I/O compute overlap , disabling prefetch surely gave bad performance in latencies and throughput. But why is this related to NUMA nodes ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use mmap to implement pin memory allocation for streams/weights inputs ?

I am a little confused when you talk about page (paged memory) and mmap (which maps memory from device to DRAM and ready to be read by DMA)

Thank you if you can help to clarify my concerns. Happy to learn your solutions.

if (fptr != NULL) {
char buf[42];
if (fgets(buf, sizeof(buf), fptr) && strncmp(buf, "0\n", sizeof(buf)) != 0) {
GGML_PRINT("WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance\n");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know the details why this hurts performance ?

node->perf_cycles += cycles_cur;
node->perf_time_us += time_us_cur;
}

static thread_ret_t ggml_graph_compute_thread(void * data) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am new to llama.cpp. It seems that this function is about scheduling a forward computation in a thread (I assume the tasks dependences is resolved outside). Is there any big picture about how this function working ?

How should I understand it ? and with the NUMA related updates ?

size = file->size;
int fd = fileno(file->fp);
int flags = MAP_SHARED;
// prefetch/readahead impairs performance on NUMA systems
if (numa) { prefetch = 0; }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes again. I want to understand why prefetch hurts performance with NUMA node ? (with paged memory or pin memory)

@yiakwy-xpu-ml-framework-team
Copy link

hi @zrm and @ggerganov sorry for comments. I am really interested and curious about the improvements and I am new to Lamma.cpp. Is there any procedures I can follow to reproduce the before and after PR in a ARM Cortex machine ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue performance Speed related topics threading Parallel processing and thread management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ggml : add NUMA support
5 participants