Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix-Matrix multiply is quite slow #187

Open
vchuravy opened this issue Oct 5, 2018 · 17 comments
Open

Matrix-Matrix multiply is quite slow #187

vchuravy opened this issue Oct 5, 2018 · 17 comments

Comments

@vchuravy
Copy link
Member

vchuravy commented Oct 5, 2018

While looking with @YingboMa into getting a PDE solved just by using DArray we encountered that matrix-matrix multiply is quite slow in DArray.

From discussion with @andreasnoack

  1. Distributed currently has no fetch! e.g. a fetch into a localarray, it is therefore hard to avoid temporaries when working across processes. This causes many copies and requires GC work which creates communication bottlenecks.

  2. Our communication layer doesn't support RDMA so there are copies happening in the network-stack, and we use sockets instead of shared memory for commincation on the same node.

  3. There are some communications bottlenecks due to how we use the event-loop and it is feasible to to get into a situation where forward progress is hard to make due a machine being busy with computation and not communicating in a timely fashion.

using Distributed
addprocs(4)

using LinearAlgebra

# Set worker BLAS to one thread onlye
@sync for p in workers()
    @async remotecall_wait(LinearAlgebra.BLAS.set_num_threads, p , 1)
end


using BenchmarkTools
using DistributedArrays

const suite = BenchmarkGroup()
suite["Array"] = BenchmarkGroup()
suite["distribute"] = BenchmarkGroup()

function benchmark(T=Array, N=10)
    @benchmarkable A * B setup=(A = $T(rand($N, $N)); B = $T(rand($N, $N)))
end

for N in (2^i for i = 5:13)
    suite["Array"][N] = benchmark(Array, N)
    suite["distribute"][N] = benchmark(distribute, N)
end

tune!(suite)
results = run(suite)

I would be interested in gathering numbers from different systems here. My first set of results is from just my local laptop with 2 Cores - 4 Threads and using 4 Julia processes.

There is a lot of overhead for smallish problems, but the results aren't that bad once we get to interesting problem sizes...

julia> for (name, trial) in sort(collect(results["Array"]), by=x->time(x[2]))
           t = time(trial) / 1e6
           println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
       end
32.....................................0.0 ms
64....................................0.02 ms
128...................................0.06 ms
256...................................0.41 ms
512...................................3.51 ms
1024.................................24.51 ms
2048................................249.21 ms
4096...............................2226.76 ms
8192..............................18990.07 ms

julia> for (name, trial) in sort(collect(results["distribute"]), by=x->time(x[2]))
           t = time(trial) / 1e6
           println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
       end
32....................................2.01 ms
64....................................2.06 ms
128...................................2.32 ms
256...................................2.97 ms
512...................................6.63 ms
1024..................................34.2 ms
2048................................261.15 ms
4096...............................2295.89 ms
8192..............................17112.45 ms
@montyvesselinov
Copy link

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> for (name, trial) in sort(collect(results["Array"]), by=x->time(x[2]))
                  t = time(trial) / 1e6
                  println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
              end
32.....................................0.0 ms
64....................................0.02 ms
128...................................0.04 ms
256...................................0.19 ms
512....................................1.4 ms
1024...................................9.3 ms
2048.................................71.92 ms
4096................................600.13 ms
8192...............................5114.26 ms

julia> for (name, trial) in sort(collect(results["distribute"]), by=x->time(x[2]))
                  t = time(trial) / 1e6
                  println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
              end
32....................................1.46 ms
64....................................1.48 ms
128...................................1.73 ms
256...................................2.36 ms
512...................................5.35 ms
1024.................................24.76 ms
2048................................134.15 ms
4096...............................1304.26 ms
8192...............................9542.33 ms

@ViralBShah
Copy link
Member

ViralBShah commented May 19, 2020

Does Elemental provide a matmul? Does someone know if we use it in Elemental.jl? Should we remove the slow matmul we have in here, and let other packages provide it for now?

@ViralBShah
Copy link
Member

This package by @haampie might come in handy:

https://github.com/haampie/COSMA.jl

@andreasnoack
Copy link
Member

@andreasnoack
Copy link
Member

Should we remove the slow matmul we have in here, and let other packages provide it for now?

It comes at a price though since it effectively makes the package dependent on MPI. It will also restrict the matmul to a few element types.

@ViralBShah
Copy link
Member

Can we use Elemental's matmul today (as in it is already wrapped up), or we still need to do the wrapping?

The MPI dependency is unavoidable practically. I say matmul already falls into the linear algebra camp, and at that point avoiding MPI is not straightforward. Also, we now have MPI and all the MPI libraries in BinaryBuilder - and it is all much easier to get working out of the box.

@vchuravy
Copy link
Member Author

I don't think we should add a MPI dependency to DistributedArrays. It will make it much harder to untangle and not necessary from a technical point of view. From a practical position Elemental gives a better user-experience right now, but I don't wan us to be pegged long-term against MPI.

@ViralBShah
Copy link
Member

ViralBShah commented May 20, 2020

Not suggesting adding MPI to DistributedArrays. Suggesting that if you want fast matmul, you should just use one of the MPI packages. Maybe we document this - since there are some benefits to the generic matmul we have.

@andreasnoack
Copy link
Member

It's wrapped here. I think it also ends up being used in https://github.com/JuliaParallel/Elemental.jl#truncated-svd although most of the work there is GEMV not GEMM.

@ViralBShah
Copy link
Member

Can it override the * for darrays? Will people be up in arms about type piracy?

@andreasnoack
Copy link
Member

I think it would be okay. We do similar things already in https://github.com/JuliaParallel/Elemantal.jl/src/julia/darray.jl. However, the code for copying data to and from the Elemental arrays is not robust and needs some work.

@haampie
Copy link
Contributor

haampie commented May 20, 2020

I'm overriding mul! for T<:BlasFloat in COSMA.jl. I wouldn't say it's type piracy when it only adds a specialized method

@andreasnoack
Copy link
Member

I wouldn't say it's type piracy when it only adds a specialized method

I think most people would consider that type piracy anyway but also that it's one of the cases where type piracy makes sense.

@ViralBShah
Copy link
Member

I agree. Let's do what it takes to make the user experience pleasant. This may be the first time in a long while that all the parallel packages are coming together.

@haampie
Copy link
Contributor

haampie commented May 21, 2020

Yes. Last thing to sort out is how to swap out libmpi for jll packages. Neither Elemental nor COSMA currently work with a system MPI implementation. I haven't had time yet to see if the MPI.jl trick can be repeated in those packages. Probably it has to happen in Elemental_jll.jl and COSMA_jll.jl. But then the situation might be slightly different, because in MPI.jl Julia searches for a libmpi, whereas in those jll packages the shared libs depend on libmpi, and I don't know (yet) how the search paths can be changed then.

Ok, 30s research later: we would need to change the search paths in these generated files I would think: https://github.com/JuliaBinaryWrappers/Elemental_jll.jl/blob/master/src/wrappers/x86_64-apple-darwin14-cxx11.jl

@ViralBShah
Copy link
Member

Since the libmpi are not interchangeable - so if you move away from what we do in BB, you have to do a source build. Search paths may not be sufficient.

@haampie
Copy link
Contributor

haampie commented May 21, 2020

Yeah, I'm aware of that. But for Cray clusters MPICH is interchangeable with the system library, and that would be my use-case. So even when Julia provides a virtual MPI package where the user can pick their favorite implemenation, and BB supports all this, there still has to be a way to use a system MPI version to get optimal performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants