llama : quantize up to 31% faster on Linux with mmap #3206

cebtenzzre · 2023-09-16T03:26:35Z

This is a follow-up to #3115. It enables mmap for quantize on Linux, since no one seems to have reported a performance decrease on that platform. Windows has not been tested, and macOS has seen both a speed-up and a slow-down.

bobqianic · 2023-09-16T03:41:42Z

How does incorporating mmap lead to faster quantization? Can anyone explain?

cebtenzzre · 2023-09-16T03:50:09Z

When the 'quantize' script reads from disk, it normally has to load a whole tensor into memory before it can start converting it to f32 and quantizing it. This change allows the input tensor to be paged in on-demand in 4096-byte chunks so it can be read and converted simultaneously.

slaren · 2023-09-23T23:00:26Z

I tested this with 7B f16 to q4_0. On Windows and got ~15% faster times with mmap when the model is cached, no difference when it is not cached. Under WSL2, mmap is always about ~35% faster, cached or uncached. So I think mmap can be enabled on Windows too.

slaren · 2023-09-23T23:01:06Z

llama.cpp

- std::unique_ptr<llama_model_loader> ml(new llama_model_loader(fname_inp, /*use_mmap*/ false));
+ // mmap consistently increases speed Linux, is inconsistent on macOS
+ // (possibly related to free memory), and has not been tested on Windows.
+#ifdef __linux__


Suggested change

#ifdef __linux__

#if defined(__linux__) || defined(_WIN32)

ggerganov · 2023-09-23T23:45:59Z

Let me run a few tests this week and we can merge.

ggerganov

On M1 Pro with 32GB, quantizing 13B with mmap enabled is ~x2 slower, so let's leave mmap off on Mac until we figure out something that would always improve the performance, regardless of the model size

…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggerganov#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggerganov#3401) train : fix KQ_pos allocation (ggerganov#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggerganov#3206) readme : update hot topics + model links (ggerganov#3399) readme : add link to grammars app (ggerganov#3388) swift : fix build on xcode 15 (ggerganov#3387) build : enable more non-default compiler warnings (ggerganov#3200) ggml_tensor: update the structure comments. (ggerganov#3283) ggml : release the requested thread pool resource (ggerganov#3292) llama.cpp : split llama_context_params into model and context params (ggerganov#3301) ci : multithreaded builds (ggerganov#3311) train : finetune LORA (ggerganov#2632) gguf : basic type checking in gguf_get_* (ggerganov#3346) gguf : make token scores and types optional (ggerganov#3347) ci : disable freeBSD builds due to lack of VMs (ggerganov#3381) llama : custom attention mask + parallel decoding + no context swaps (ggerganov#3228) docs : mark code as Bash (ggerganov#3375) readme : add Mistral AI release 0.1 (ggerganov#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggerganov#3370)

…rganov#3206) * llama : enable mmap in quantize on Linux -> 31% faster * also enable mmap on Windows --------- Co-authored-by: Georgi Gerganov <[email protected]>

llama : enable mmap in quantize on Linux -> 31% faster

32bc3f4

cebtenzzre requested review from ggerganov and ikawrakow September 16, 2023 03:26

slaren reviewed Sep 23, 2023

View reviewed changes

slaren approved these changes Sep 23, 2023

View reviewed changes

cebtenzzre and others added 2 commits September 23, 2023 23:39

also enable mmap on Windows

b3a6b28

Merge branch 'master' into HEAD

66382f1

ggerganov approved these changes Sep 29, 2023

View reviewed changes

ggerganov merged commit 2777a84 into ggerganov:master Sep 29, 2023
31 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : quantize up to 31% faster on Linux with mmap #3206

llama : quantize up to 31% faster on Linux with mmap #3206

cebtenzzre commented Sep 16, 2023

bobqianic commented Sep 16, 2023

cebtenzzre commented Sep 16, 2023

slaren commented Sep 23, 2023

slaren Sep 23, 2023

ggerganov commented Sep 23, 2023

ggerganov left a comment

llama : quantize up to 31% faster on Linux with mmap #3206

llama : quantize up to 31% faster on Linux with mmap #3206

Conversation

cebtenzzre commented Sep 16, 2023

bobqianic commented Sep 16, 2023

cebtenzzre commented Sep 16, 2023

slaren commented Sep 23, 2023

slaren Sep 23, 2023

Choose a reason for hiding this comment

ggerganov commented Sep 23, 2023

ggerganov left a comment

Choose a reason for hiding this comment