Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : quantize up to 31% faster on Linux with mmap #3206

Merged
merged 3 commits into from
Sep 29, 2023

Conversation

cebtenzzre
Copy link
Collaborator

This is a follow-up to #3115. It enables mmap for quantize on Linux, since no one seems to have reported a performance decrease on that platform. Windows has not been tested, and macOS has seen both a speed-up and a slow-down.

@bobqianic
Copy link
Contributor

How does incorporating mmap lead to faster quantization? Can anyone explain?

@cebtenzzre
Copy link
Collaborator Author

When the 'quantize' script reads from disk, it normally has to load a whole tensor into memory before it can start converting it to f32 and quantizing it. This change allows the input tensor to be paged in on-demand in 4096-byte chunks so it can be read and converted simultaneously.

@slaren
Copy link
Collaborator

slaren commented Sep 23, 2023

I tested this with 7B f16 to q4_0. On Windows and got ~15% faster times with mmap when the model is cached, no difference when it is not cached. Under WSL2, mmap is always about ~35% faster, cached or uncached. So I think mmap can be enabled on Windows too.

llama.cpp Outdated
std::unique_ptr<llama_model_loader> ml(new llama_model_loader(fname_inp, /*use_mmap*/ false));
// mmap consistently increases speed Linux, is inconsistent on macOS
// (possibly related to free memory), and has not been tested on Windows.
#ifdef __linux__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef __linux__
#if defined(__linux__) || defined(_WIN32)

@ggerganov
Copy link
Owner

Let me run a few tests this week and we can merge.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On M1 Pro with 32GB, quantizing 13B with mmap enabled is ~x2 slower, so let's leave mmap off on Mac until we figure out something that would always improve the performance, regardless of the model size

@ggerganov ggerganov merged commit 2777a84 into ggerganov:master Sep 29, 2023
31 of 32 checks passed
joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 2, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp:
  ggml-cuda : perform cublas mat mul of quantized types as f16 (ggerganov#3412)
  llama.cpp : add documentation about rope_freq_base and scale values (ggerganov#3401)
  train : fix KQ_pos allocation (ggerganov#3392)
  llama : quantize up to 31% faster on Linux and Windows with mmap (ggerganov#3206)
  readme : update hot topics + model links (ggerganov#3399)
  readme : add link to grammars app (ggerganov#3388)
  swift : fix build on xcode 15 (ggerganov#3387)
  build : enable more non-default compiler warnings (ggerganov#3200)
  ggml_tensor: update the structure comments. (ggerganov#3283)
  ggml : release the requested thread pool resource (ggerganov#3292)
  llama.cpp : split llama_context_params into model and context params (ggerganov#3301)
  ci : multithreaded builds (ggerganov#3311)
  train : finetune LORA (ggerganov#2632)
  gguf : basic type checking in gguf_get_* (ggerganov#3346)
  gguf : make token scores and types optional (ggerganov#3347)
  ci : disable freeBSD builds due to lack of VMs (ggerganov#3381)
  llama : custom attention mask + parallel decoding + no context swaps (ggerganov#3228)
  docs : mark code as Bash (ggerganov#3375)
  readme : add Mistral AI release 0.1 (ggerganov#3362)
  ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggerganov#3370)
yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
…rganov#3206)

* llama : enable mmap in quantize on Linux -> 31% faster

* also enable mmap on Windows

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants