-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keeping the model loaded on RAM #30
Comments
to run it like a local service, you mean? yeah I'd love to be able to do this too... |
"All" you have to do is modify the for loop here: https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/main.cpp#L649. Instead of breaking when you hit the end of text token or the max token count, wait for user input. |
@umhau that will just add extra input right? I think the idea is (at least in my use case) to be able to start a new prompt without reloading the whole model. |
@biemster In that case, this section gives the arguments for the gptj_eval function. It looks like you should be able to reset the n_past, embd_inp, and embd_w variables to the starting values when you're done with each prompt. Then it's a matter of modifying the for loop so it clears the vars when it's done, and waits for your next round of input. |
Did anyone do this? Running this as a service would be great. |
The |
Inference is painfully slow on CPU-Only setup and it seems to be because of this issue. I'm using ctransformers which I believe to be using this library to run the models and I found that the model is not fully loaded into memory while doing an inference with CPU-Only and then I found this comment on GGML's Features in the README: "Zero memory allocations during runtime". Same issue with another independent library using llama.cpp: https://github.com/abetlen/llama-cpp-python Seems to me this library is probably to blame. Questions:
This thread makes it sound like the whole model is being loaded into RAM and then unloaded after inference, but that does not seem to be case, since I have 3GB model and RAM never goes above 1.2. Here is the Colab to test it (keep an eye on resources during inference): |
Is there a way to keep the model loaded in the RAM between successive runs? I have an api like setup, and every time a prompt comes in, the model has to be loaded into RAM again, which takes a while for GPT-J.
I'm using python and basically just running the ./bin/gpt-j command via subprocess.Popen.
The text was updated successfully, but these errors were encountered: