Concurrent requests using the CUDA backend #456

radu-matei · 2023-08-16T14:30:28Z

First, thanks for the awesome work that went into making this project happen!

I am using https://github.com/rustformers/llm and attempting run two inferencing sessions concurrently using the same hardware acceleration device.

Compiling for the Metal backend, this works correctly. However, compiling and running on a CUDA device errors out in the sampler logic:

InvalidWeight', /root/.cargo/git/checkouts/rustformers-llm-a7ebb8f50571cb3b/a513077/crates/llm-base/src/samplers.rs:157:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
CUDA error 1 at llama-cpp/ggml-cuda.cu:3502: invalid argument
CUDA error 4 at llama-cpp/ggml-cuda.cu:3882: driver shutting down
Segmentation fault (core dumped)

@LLukas22 suggested:

This could be caused by the k/v cache which is also offloaded onto the GPU but is part of the session.
There is a function ggml_cuda_assign_buffers_no_scratch which is used to move the cache onto the gpu. Maybe it assigns the same memory region twice?

The function in question is:

ggml/src/ggml-cuda.cu

Line 5779 in 95b559d

void ggml_cuda_assign_buffers_no_scratch(struct ggml_tensor * tensor) {

and

ggml/src/ggml-cuda.cu

Line 5707 in 95b559d

 void ggml_cuda_assign_buffers_impl(struct ggml_tensor * tensor, bool scratch, bool force_inplace) { 

Has anyone been running concurrent inferencing sessions on the same CUDA device?
Is looking into the implementation of ggml_cuda_assign_buffers_no_scratch helpful for trying to figure out the issue here?

Thanks!

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-08-16T19:40:08Z

By concurrent inference do you mean calling eval sequentially for session 0, then session 1, then 0,1,0,1, etc. ?

I believe you can disable the KV cache offload to the GPU by just using --ngl equal to the exact number of layers in the model (e.g. 32 for 7B LLaMA). Pinging @JohannesGaessler in case I'm missing something

JohannesGaessler · 2023-08-16T19:48:03Z

ggml_cuda_assign_buffers_no_scratch simply allocates new memory for the buffer instead of putting it on the scratch buffer. Calling it multiple times allocates disjoint memory regions. If I had to guess the problem is that the VRAM scratch buffer is global so concurrent inferencing sessions will overwrite each others' data.

I believe you can disable the KV cache offload to the GPU by just using --ngl equal to the exact number of layers in the model (e.g. 32 for 7B LLaMA).

That is correct, first the repeating layers, then the non-repeating layers, and then the V and K components of the KV cache get offloaded.

radu-matei · 2023-08-16T20:37:48Z

If I set --ngl 32 on the 7B model:

CUBLAS error 15 at llama-cpp/ggm1-cuda.cu:4768: the requested functionality is not supported
CUDA error 4 at 1lama-cpp/ggml-cuda.cu:4202: driver shutting down
Segmentation fault (core dumped)

Setting --ngl 32 on the 13B model is able to run concurrent sessions, but about an order of magnitude slower (even when running a single inference).

radu-matei · 2023-08-22T11:53:30Z

Update: I am seeing the same error trying to execute 2 inference operations in the same process, but not when in different processes.

Is there a per-process global memory space used by the CUDA backend?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent requests using the CUDA backend #456

Concurrent requests using the CUDA backend #456

radu-matei commented Aug 16, 2023

ggerganov commented Aug 16, 2023

JohannesGaessler commented Aug 16, 2023

radu-matei commented Aug 16, 2023 •

edited

Loading

radu-matei commented Aug 22, 2023

Concurrent requests using the CUDA backend #456

Concurrent requests using the CUDA backend #456

Comments

radu-matei commented Aug 16, 2023

ggerganov commented Aug 16, 2023

JohannesGaessler commented Aug 16, 2023

radu-matei commented Aug 16, 2023 • edited Loading

radu-matei commented Aug 22, 2023

radu-matei commented Aug 16, 2023 •

edited

Loading