Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial manually specifies number of threads #10

Closed
wants to merge 2 commits into from
Closed

Tutorial manually specifies number of threads #10

wants to merge 2 commits into from

Conversation

ArnoStrouwen
Copy link
Contributor

In the tutorial the number of threads is always manually specified. For some users it might be difficult to find the suitable number for their GPU. Could an example be added to the documentation on how to automate this?

In the tutorial the number of threads is always manually specified. For some users it might be difficult to find the suitable number for their GPU. Could an example be added to the documentation on how to automate this?
@maleadt
Copy link
Member

maleadt commented Apr 13, 2020

Thanks!

Could an example be added to the documentation on how to automate this?

Sure, but I think it needs some more explanation about why (amount of threads is not unbounded, also affects performance) and how (pass a closure instead, use the occupancy API). You also can just pass config=configurator, see e.g. uses in CuArrays.

@ArnoStrouwen
Copy link
Contributor Author

What is the reason you do not directly use the number of blocks suggested by launch_configuration? For the example in the tutorial this gives me blocks = 16 and is about 10% faster for me, than blocks = 1024 from:

    function configurator(kernel)
        config = launch_configuration(kernel.fun)

        threads = min(total_threads, config.threads)
        blocks = cld(total_threads, threads)

        return (threads=threads, blocks=blocks)
    end

@maleadt maleadt added the documentation Improvements or additions to documentation label May 25, 2020
@maleadt maleadt force-pushed the master branch 11 times, most recently from d3147a2 to cf97309 Compare June 12, 2020 15:49
@maleadt
Copy link
Member

maleadt commented Oct 4, 2021

Sorry, forgot about this. I've included a reworked version of your suggestion in 179498a.

And to answer your question: the block count returned by the occupancy API is the required amount of blocks to fully saturate the GPU. But it obviously depends on the input size whether you should launch as many. It can be useful to change the launch configuration (e.g. decide to use a loop inside your kernel or not) based on this value, but it's generally not possible to use it directly as the number of blocks to launch (unless your kernel is very generic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants