Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : add option for controlling work distribution across threads #291

Closed
ggerganov opened this issue Jun 25, 2023 · 1 comment · Fixed by ggerganov/llama.cpp#4761
Closed
Labels
performance Speed related topics refactoring Refactoring

Comments

@ggerganov
Copy link
Owner

ggerganov commented Jun 25, 2023

See ggerganov/llama.cpp#1507

And comment: ggerganov/llama.cpp#1507 (comment)

I guess we can extend ggml to be able to choose work chunk distribution method - either at compile time, or via a context parameter. We can factor out the range selections from the ggml forward implementations to make implementation more concise and extensible in the future


Another thing to be investigated is the usage of sched_yield() and potentially making it user configurable:

ggerganov/whisper.cpp@09a6325

@IsaacDynamo
Copy link

Making this configurable would also be nice for the cuBLAS backend. When the whole model fits on the GPU, increasing the number of threads doesn't improve token/sec eval time.

But it does increase the CPU load on the system due to the busy loop. Even with n_thread = 1 , I suspect that a lot of CPU cycles are wasted in the busy loop.

So a yield flag would be a great addition to give the user control.

A busy-loop with a fallback to a yield might also be a good 'automatic' solution, that could be used as default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics refactoring Refactoring
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants