Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping the model loaded on RAM #30

Open
regstuff opened this issue Mar 1, 2023 · 7 comments
Open

Keeping the model loaded on RAM #30

regstuff opened this issue Mar 1, 2023 · 7 comments

Comments

@regstuff
Copy link

regstuff commented Mar 1, 2023

Is there a way to keep the model loaded in the RAM between successive runs? I have an api like setup, and every time a prompt comes in, the model has to be loaded into RAM again, which takes a while for GPT-J.
I'm using python and basically just running the ./bin/gpt-j command via subprocess.Popen.

@jishnudg
Copy link

jishnudg commented Mar 1, 2023

to run it like a local service, you mean? yeah I'd love to be able to do this too...

@umhau
Copy link

umhau commented Mar 2, 2023

"All" you have to do is modify the for loop here: https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/main.cpp#L649. Instead of breaking when you hit the end of text token or the max token count, wait for user input.

@biemster
Copy link

biemster commented Mar 2, 2023

@umhau that will just add extra input right? I think the idea is (at least in my use case) to be able to start a new prompt without reloading the whole model.

@umhau
Copy link

umhau commented Mar 3, 2023

@biemster In that case, this section gives the arguments for the gptj_eval function. It looks like you should be able to reset the n_past, embd_inp, and embd_w variables to the starting values when you're done with each prompt. Then it's a matter of modifying the for loop so it clears the vars when it's done, and waits for your next round of input.

@mallorbc
Copy link

Did anyone do this? Running this as a service would be great.

@apaz-cli
Copy link

The mmap()/mlock() changes in llama.cpp should be applicable here.

@jacmkno
Copy link

jacmkno commented Aug 11, 2023

Inference is painfully slow on CPU-Only setup and it seems to be because of this issue.

I'm using ctransformers which I believe to be using this library to run the models and I found that the model is not fully loaded into memory while doing an inference with CPU-Only and then I found this comment on GGML's Features in the README: "Zero memory allocations during runtime".

Same issue with another independent library using llama.cpp: https://github.com/abetlen/llama-cpp-python

Seems to me this library is probably to blame.

Questions:

  1. Does this mean GGML indeed does not ever load the whole model into RAM? That seems to me like a waste of RAM for users with high RAM.

  2. Is the library just doing lots of reads to disk during inference?

  3. Is there an easy option to let the library load the whole model into RAM and keep it there?

This thread makes it sound like the whole model is being loaded into RAM and then unloaded after inference, but that does not seem to be case, since I have 3GB model and RAM never goes above 1.2.

Here is the Colab to test it (keep an eye on resources during inference):
https://colab.research.google.com/drive/1iGifBXEaXI2JDbJG1Il7BAS8gqVWm8kR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants