-
-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find out common bottlenecks #1273
Comments
I think it could make sense to split GPU and CPU dispatches if you wanted to take the time to write out the adjoints and add For data parallelism, we might want to just smack it dead center and have a tutorial in the docs titled "Multi-GPU on Clusters" that shows vmap, tmap, pmap, and then setting up multiple GPUs + pmap, all inside of gradients, with a link to ClusterManagers.jl. It should make it extremely obvious that Flux works with huge compute. Not necessarily a "bottleneck", but it's a common enough question that anyone who searches for it should easily find that page. We should look at doing 5-argument It would be good to time |
This issue (motivated by https://discourse.julialang.org/t/flux-vs-pytorch-cpu-performance/42667/25) is intended to be a high level overview of the common bottlenecks that show up in common models. This is a non-exhaustive list and would be expanded upon as more suggestions and use cases come along.
Base.tanh
is slower compared to some SIMD'd versions (For example from SLEEFPirates) (replace Base.tanh with faster tanh #1272)@avx
to our activation functions to help with SIMD'ingsoftmax
via LoopVectorisation.jl (use LoopVectorization to vectorize activation functions and softmax NNlib.jl#199)cc @CarloLucibello @ChrisRackauckas @ViralBShah
The text was updated successfully, but these errors were encountered: