RAM Misuse on CPU-Only setup / Very Slow #85

jacmkno · 2023-08-11T06:36:19Z

Running on Colab with CPU only works, but I happen to notice that the RAM is not being used. Seems like the library is doing a lot of disk I/O instead of storing the whole model on RAM as happens with the standard GPU setup with hugging face transformers. As a consequence inference is obscenely slow.

I'm trying to hunt down if the issue corresponds directly to llama.cpp, GGML or ctransformers.

Here is the Colab: https://colab.research.google.com/drive/1iGifBXEaXI2JDbJG1Il7BAS8gqVWm8kR?usp=sharing

willb0 · 2023-08-13T20:42:35Z

also seeing this when working with ggml quantized LLAMA to 4bit loaded using ctransformers in a ipynb-- when running llama.cpp stuff there is a --mlock parameter to keep the whole model loaded in memory, would be nice to know if that is possible for ctransfomers

willb0 · 2023-08-13T21:49:16Z

from what I can tell, we would have to add mlock as a parameter when we call self._llm = self._lib.ctransformers_llm_create, and add it to the load_library loading fxn as well

marella · 2023-08-15T18:42:46Z

I will add mmap and mlock options in the next release. I tested with different combinations of mmap and mlock but didn't notice any performance improvement on my machine and Google Colab.
Note that Colab CPU only has 2 cores so inference will be slow. It runs ~3x faster on my machine which has 6 cores with same settings. On Colab, it is better to use GPU.
Also llm(prompt) only gives output after the entire text is generated. Try using stream=True to see output after each token is generated:

for text in llm(prompt, stream=True):
    print(text, end='', flush=False)

willb0 · 2023-08-17T15:53:49Z

thank you @marella !!
i noticed improved inference on mac m1 8GB using GGML quantized LLAMA 7B, but that wasn't your release just on my fork where i changed mlock

marella · 2023-08-20T19:24:24Z

Added mmap and mlock parameters for LLaMA and Falcon models in the latest version 0.2.23

baylitoo · 2023-08-21T10:00:43Z

How can i use mmap and mlock when working with langchain and creating the model using CTransformers(path,config) ?

No need to reply, found it

This was referenced Aug 11, 2023

Keeping the model loaded on RAM ggerganov/ggml#30

Open

Missing option to keep models loaded in RAM / GPU RAM ggerganov/ggml#444

Open

Models not loaded into RAM on CPU-Only setup. Is the library using the disk as RAM? ggerganov/llama.cpp#2589

Closed

marella closed this as completed Aug 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAM Misuse on CPU-Only setup / Very Slow #85

RAM Misuse on CPU-Only setup / Very Slow #85

jacmkno commented Aug 11, 2023

willb0 commented Aug 13, 2023

willb0 commented Aug 13, 2023

marella commented Aug 15, 2023

willb0 commented Aug 17, 2023

marella commented Aug 20, 2023

baylitoo commented Aug 21, 2023 •

edited

Loading

RAM Misuse on CPU-Only setup / Very Slow #85

RAM Misuse on CPU-Only setup / Very Slow #85

Comments

jacmkno commented Aug 11, 2023

willb0 commented Aug 13, 2023

willb0 commented Aug 13, 2023

marella commented Aug 15, 2023

willb0 commented Aug 17, 2023

marella commented Aug 20, 2023

baylitoo commented Aug 21, 2023 • edited Loading

baylitoo commented Aug 21, 2023 •

edited

Loading