Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent requests using the CUDA backend #456

Open
radu-matei opened this issue Aug 16, 2023 · 4 comments
Open

Concurrent requests using the CUDA backend #456

radu-matei opened this issue Aug 16, 2023 · 4 comments

Comments

@radu-matei
Copy link

First, thanks for the awesome work that went into making this project happen!

I am using https://github.com/rustformers/llm and attempting run two inferencing sessions concurrently using the same hardware acceleration device.

Compiling for the Metal backend, this works correctly. However, compiling and running on a CUDA device errors out in the sampler logic:

InvalidWeight', /root/.cargo/git/checkouts/rustformers-llm-a7ebb8f50571cb3b/a513077/crates/llm-base/src/samplers.rs:157:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
CUDA error 1 at llama-cpp/ggml-cuda.cu:3502: invalid argument
CUDA error 4 at llama-cpp/ggml-cuda.cu:3882: driver shutting down
Segmentation fault (core dumped)

@LLukas22 suggested:

This could be caused by the k/v cache which is also offloaded onto the GPU but is part of the session.
There is a function ggml_cuda_assign_buffers_no_scratch which is used to move the cache onto the gpu. Maybe it assigns the same memory region twice?

The function in question is:

void ggml_cuda_assign_buffers_no_scratch(struct ggml_tensor * tensor) {

and
void ggml_cuda_assign_buffers_impl(struct ggml_tensor * tensor, bool scratch, bool force_inplace) {

Has anyone been running concurrent inferencing sessions on the same CUDA device?
Is looking into the implementation of ggml_cuda_assign_buffers_no_scratch helpful for trying to figure out the issue here?

Thanks!

@ggerganov
Copy link
Owner

By concurrent inference do you mean calling eval sequentially for session 0, then session 1, then 0,1,0,1, etc. ?

I believe you can disable the KV cache offload to the GPU by just using --ngl equal to the exact number of layers in the model (e.g. 32 for 7B LLaMA). Pinging @JohannesGaessler in case I'm missing something

@JohannesGaessler
Copy link
Contributor

ggml_cuda_assign_buffers_no_scratch simply allocates new memory for the buffer instead of putting it on the scratch buffer. Calling it multiple times allocates disjoint memory regions. If I had to guess the problem is that the VRAM scratch buffer is global so concurrent inferencing sessions will overwrite each others' data.

I believe you can disable the KV cache offload to the GPU by just using --ngl equal to the exact number of layers in the model (e.g. 32 for 7B LLaMA).

That is correct, first the repeating layers, then the non-repeating layers, and then the V and K components of the KV cache get offloaded.

@radu-matei
Copy link
Author

radu-matei commented Aug 16, 2023

If I set --ngl 32 on the 7B model:

CUBLAS error 15 at llama-cpp/ggm1-cuda.cu:4768: the requested functionality is not supported
CUDA error 4 at 1lama-cpp/ggml-cuda.cu:4202: driver shutting down
Segmentation fault (core dumped)

Setting --ngl 32 on the 13B model is able to run concurrent sessions, but about an order of magnitude slower (even when running a single inference).

@radu-matei
Copy link
Author

Update: I am seeing the same error trying to execute 2 inference operations in the same process, but not when in different processes.

Is there a per-process global memory space used by the CUDA backend?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants