-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent requests using the CUDA backend #456
Comments
By concurrent inference do you mean calling I believe you can disable the KV cache offload to the GPU by just using |
That is correct, first the repeating layers, then the non-repeating layers, and then the V and K components of the KV cache get offloaded. |
If I set
Setting |
Update: I am seeing the same error trying to execute 2 inference operations in the same process, but not when in different processes. Is there a per-process global memory space used by the CUDA backend? Thanks! |
First, thanks for the awesome work that went into making this project happen!
I am using https://github.com/rustformers/llm and attempting run two inferencing sessions concurrently using the same hardware acceleration device.
Compiling for the Metal backend, this works correctly. However, compiling and running on a CUDA device errors out in the sampler logic:
@LLukas22 suggested:
The function in question is:
ggml/src/ggml-cuda.cu
Line 5779 in 95b559d
and
ggml/src/ggml-cuda.cu
Line 5707 in 95b559d
Has anyone been running concurrent inferencing sessions on the same CUDA device?
Is looking into the implementation of
ggml_cuda_assign_buffers_no_scratch
helpful for trying to figure out the issue here?Thanks!
The text was updated successfully, but these errors were encountered: