Find out common bottlenecks #1273

DhairyaLGandhi · 2020-07-09T16:44:30Z

This issue (motivated by https://discourse.julialang.org/t/flux-vs-pytorch-cpu-performance/42667/25) is intended to be a high level overview of the common bottlenecks that show up in common models. This is a non-exhaustive list and would be expanded upon as more suggestions and use cases come along.

Broadcasted Base.tanh is slower compared to some SIMD'd versions (For example from SLEEFPirates) (replace Base.tanh with faster tanh #1272)
Possibly sprinkling @avx to our activation functions to help with SIMD'ing
Vectorise softmax via LoopVectorisation.jl (use LoopVectorization to vectorize activation functions and softmax NNlib.jl#199)
Data parallelism primitives to support for so parallelism can be utilised easier (support parallel maps (tmap, pmap, vmap, etc.) Zygote.jl#728)

cc @CarloLucibello @ChrisRackauckas @ViralBShah

The text was updated successfully, but these errors were encountered:

ChrisRackauckas · 2020-07-10T01:18:02Z

I think it could make sense to split GPU and CPU dispatches if you wanted to take the time to write out the adjoints and add CUDA.free expressions, since if you write out the adjoint then you will have lots of intermediate calculations that you know won't escape and thus it would be safe to free those variables the moment they are used in the backpass. That only makes sense for GPUs of course, while everything in the forwards passes should probably get some @avx magic on it.

For data parallelism, we might want to just smack it dead center and have a tutorial in the docs titled "Multi-GPU on Clusters" that shows vmap, tmap, pmap, and then setting up multiple GPUs + pmap, all inside of gradients, with a link to ClusterManagers.jl. It should make it extremely obvious that Flux works with huge compute. Not necessarily a "bottleneck", but it's a common enough question that anyone who searches for it should easily find that page.

We should look at doing 5-argument mul! in things like the Dense kernel. I get inconsistent results on OpenBLAS, but we should get some measurements on MKL.

It would be good to time @avx broadcasting against https://github.com/JuliaMath/IntelVectorMath.jl which is probably the fastest vector math library out there. If we're at least close, I think we can say that @axv is good enough to be at least matching everyone else.

CarloLucibello mentioned this issue Dec 2, 2020

performance Dense layer on CPU #1414

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find out common bottlenecks #1273

Find out common bottlenecks #1273

DhairyaLGandhi commented Jul 9, 2020

ChrisRackauckas commented Jul 10, 2020

Find out common bottlenecks #1273

Find out common bottlenecks #1273

Comments

DhairyaLGandhi commented Jul 9, 2020

ChrisRackauckas commented Jul 10, 2020