Keeping the model loaded on RAM #30

regstuff · 2023-03-01T12:42:13Z

Is there a way to keep the model loaded in the RAM between successive runs? I have an api like setup, and every time a prompt comes in, the model has to be loaded into RAM again, which takes a while for GPT-J.
I'm using python and basically just running the ./bin/gpt-j command via subprocess.Popen.

jishnudg · 2023-03-01T15:03:57Z

to run it like a local service, you mean? yeah I'd love to be able to do this too...

umhau · 2023-03-02T13:05:32Z

"All" you have to do is modify the for loop here: https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/main.cpp#L649. Instead of breaking when you hit the end of text token or the max token count, wait for user input.

biemster · 2023-03-02T13:09:34Z

@umhau that will just add extra input right? I think the idea is (at least in my use case) to be able to start a new prompt without reloading the whole model.

umhau · 2023-03-03T01:26:22Z

@biemster In that case, this section gives the arguments for the gptj_eval function. It looks like you should be able to reset the n_past, embd_inp, and embd_w variables to the starting values when you're done with each prompt. Then it's a matter of modifying the for loop so it clears the vars when it's done, and waits for your next round of input.

mallorbc · 2023-03-30T03:35:45Z

Did anyone do this? Running this as a service would be great.

apaz-cli · 2023-04-12T00:45:25Z

The mmap()/mlock() changes in llama.cpp should be applicable here.

jacmkno · 2023-08-11T06:54:41Z

Inference is painfully slow on CPU-Only setup and it seems to be because of this issue.

I'm using ctransformers which I believe to be using this library to run the models and I found that the model is not fully loaded into memory while doing an inference with CPU-Only and then I found this comment on GGML's Features in the README: "Zero memory allocations during runtime".

Same issue with another independent library using llama.cpp: https://github.com/abetlen/llama-cpp-python

Seems to me this library is probably to blame.

Questions:

Does this mean GGML indeed does not ever load the whole model into RAM? That seems to me like a waste of RAM for users with high RAM.
Is the library just doing lots of reads to disk during inference?
Is there an easy option to let the library load the whole model into RAM and keep it there?

This thread makes it sound like the whole model is being loaded into RAM and then unloaded after inference, but that does not seem to be case, since I have 3GB model and RAM never goes above 1.2.

Here is the Colab to test it (keep an eye on resources during inference):
https://colab.research.google.com/drive/1iGifBXEaXI2JDbJG1Il7BAS8gqVWm8kR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping the model loaded on RAM #30

Keeping the model loaded on RAM #30

regstuff commented Mar 1, 2023

jishnudg commented Mar 1, 2023

umhau commented Mar 2, 2023

biemster commented Mar 2, 2023

umhau commented Mar 3, 2023 •

edited

Loading

mallorbc commented Mar 30, 2023

apaz-cli commented Apr 12, 2023

jacmkno commented Aug 11, 2023 •

edited

Loading

Keeping the model loaded on RAM #30

Keeping the model loaded on RAM #30

Comments

regstuff commented Mar 1, 2023

jishnudg commented Mar 1, 2023

umhau commented Mar 2, 2023

biemster commented Mar 2, 2023

umhau commented Mar 3, 2023 • edited Loading

mallorbc commented Mar 30, 2023

apaz-cli commented Apr 12, 2023

jacmkno commented Aug 11, 2023 • edited Loading

umhau commented Mar 3, 2023 •

edited

Loading

jacmkno commented Aug 11, 2023 •

edited

Loading