Matrix-Matrix multiply is quite slow #187

vchuravy · 2018-10-05T15:15:35Z

While looking with @YingboMa into getting a PDE solved just by using DArray we encountered that matrix-matrix multiply is quite slow in DArray.

From discussion with @andreasnoack

Distributed currently has no fetch! e.g. a fetch into a localarray, it is therefore hard to avoid temporaries when working across processes. This causes many copies and requires GC work which creates communication bottlenecks.
Our communication layer doesn't support RDMA so there are copies happening in the network-stack, and we use sockets instead of shared memory for commincation on the same node.
There are some communications bottlenecks due to how we use the event-loop and it is feasible to to get into a situation where forward progress is hard to make due a machine being busy with computation and not communicating in a timely fashion.

using Distributed
addprocs(4)

using LinearAlgebra

# Set worker BLAS to one thread onlye
@sync for p in workers()
    @async remotecall_wait(LinearAlgebra.BLAS.set_num_threads, p , 1)
end


using BenchmarkTools
using DistributedArrays

const suite = BenchmarkGroup()
suite["Array"] = BenchmarkGroup()
suite["distribute"] = BenchmarkGroup()

function benchmark(T=Array, N=10)
    @benchmarkable A * B setup=(A = $T(rand($N, $N)); B = $T(rand($N, $N)))
end

for N in (2^i for i = 5:13)
    suite["Array"][N] = benchmark(Array, N)
    suite["distribute"][N] = benchmark(distribute, N)
end

tune!(suite)
results = run(suite)

I would be interested in gathering numbers from different systems here. My first set of results is from just my local laptop with 2 Cores - 4 Threads and using 4 Julia processes.

There is a lot of overhead for smallish problems, but the results aren't that bad once we get to interesting problem sizes...

julia> for (name, trial) in sort(collect(results["Array"]), by=x->time(x[2]))
           t = time(trial) / 1e6
           println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
       end
32.....................................0.0 ms
64....................................0.02 ms
128...................................0.06 ms
256...................................0.41 ms
512...................................3.51 ms
1024.................................24.51 ms
2048................................249.21 ms
4096...............................2226.76 ms
8192..............................18990.07 ms

julia> for (name, trial) in sort(collect(results["distribute"]), by=x->time(x[2]))
           t = time(trial) / 1e6
           println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
       end
32....................................2.01 ms
64....................................2.06 ms
128...................................2.32 ms
256...................................2.97 ms
512...................................6.63 ms
1024..................................34.2 ms
2048................................261.15 ms
4096...............................2295.89 ms
8192..............................17112.45 ms

The text was updated successfully, but these errors were encountered:

montyvesselinov · 2019-04-26T15:48:36Z

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> for (name, trial) in sort(collect(results["Array"]), by=x->time(x[2]))
                  t = time(trial) / 1e6
                  println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
              end
32.....................................0.0 ms
64....................................0.02 ms
128...................................0.04 ms
256...................................0.19 ms
512....................................1.4 ms
1024...................................9.3 ms
2048.................................71.92 ms
4096................................600.13 ms
8192...............................5114.26 ms

julia> for (name, trial) in sort(collect(results["distribute"]), by=x->time(x[2]))
                  t = time(trial) / 1e6
                  println(rpad(name, 25, "."), lpad(string(round(t, digits=2), " ms"), 20, "."))
              end
32....................................1.46 ms
64....................................1.48 ms
128...................................1.73 ms
256...................................2.36 ms
512...................................5.35 ms
1024.................................24.76 ms
2048................................134.15 ms
4096...............................1304.26 ms
8192...............................9542.33 ms

ViralBShah · 2020-05-19T23:15:22Z

Does Elemental provide a matmul? Does someone know if we use it in Elemental.jl? Should we remove the slow matmul we have in here, and let other packages provide it for now?

ViralBShah · 2020-05-19T23:15:59Z

This package by @haampie might come in handy:

https://github.com/haampie/COSMA.jl

andreasnoack · 2020-05-20T13:03:33Z

Yes. It has a BLAS-like interface https://github.com/elemental/Elemental/blob/master/include/El/blas_like/level3.h

andreasnoack · 2020-05-20T13:05:22Z

Should we remove the slow matmul we have in here, and let other packages provide it for now?

It comes at a price though since it effectively makes the package dependent on MPI. It will also restrict the matmul to a few element types.

ViralBShah · 2020-05-20T14:26:42Z

Can we use Elemental's matmul today (as in it is already wrapped up), or we still need to do the wrapping?

The MPI dependency is unavoidable practically. I say matmul already falls into the linear algebra camp, and at that point avoiding MPI is not straightforward. Also, we now have MPI and all the MPI libraries in BinaryBuilder - and it is all much easier to get working out of the box.

vchuravy · 2020-05-20T14:31:10Z

I don't think we should add a MPI dependency to DistributedArrays. It will make it much harder to untangle and not necessary from a technical point of view. From a practical position Elemental gives a better user-experience right now, but I don't wan us to be pegged long-term against MPI.

ViralBShah · 2020-05-20T14:32:19Z

Not suggesting adding MPI to DistributedArrays. Suggesting that if you want fast matmul, you should just use one of the MPI packages. Maybe we document this - since there are some benefits to the generic matmul we have.

andreasnoack · 2020-05-20T15:13:29Z

It's wrapped here. I think it also ends up being used in https://github.com/JuliaParallel/Elemental.jl#truncated-svd although most of the work there is GEMV not GEMM.

ViralBShah · 2020-05-20T18:17:12Z

Can it override the * for darrays? Will people be up in arms about type piracy?

andreasnoack · 2020-05-20T19:11:59Z

I think it would be okay. We do similar things already in https://github.com/JuliaParallel/Elemantal.jl/src/julia/darray.jl. However, the code for copying data to and from the Elemental arrays is not robust and needs some work.

haampie · 2020-05-20T21:52:02Z

I'm overriding mul! for T<:BlasFloat in COSMA.jl. I wouldn't say it's type piracy when it only adds a specialized method

andreasnoack · 2020-05-21T04:34:21Z

I wouldn't say it's type piracy when it only adds a specialized method

I think most people would consider that type piracy anyway but also that it's one of the cases where type piracy makes sense.

ViralBShah · 2020-05-21T13:27:28Z

I agree. Let's do what it takes to make the user experience pleasant. This may be the first time in a long while that all the parallel packages are coming together.

haampie · 2020-05-21T14:31:03Z

Yes. Last thing to sort out is how to swap out libmpi for jll packages. Neither Elemental nor COSMA currently work with a system MPI implementation. I haven't had time yet to see if the MPI.jl trick can be repeated in those packages. Probably it has to happen in Elemental_jll.jl and COSMA_jll.jl. But then the situation might be slightly different, because in MPI.jl Julia searches for a libmpi, whereas in those jll packages the shared libs depend on libmpi, and I don't know (yet) how the search paths can be changed then.

Ok, 30s research later: we would need to change the search paths in these generated files I would think: https://github.com/JuliaBinaryWrappers/Elemental_jll.jl/blob/master/src/wrappers/x86_64-apple-darwin14-cxx11.jl

ViralBShah · 2020-05-21T15:19:18Z

Since the libmpi are not interchangeable - so if you move away from what we do in BB, you have to do a source build. Search paths may not be sufficient.

haampie · 2020-05-21T16:12:38Z

Yeah, I'm aware of that. But for Cray clusters MPICH is interchangeable with the system library, and that would be my use-case. So even when Julia provides a virtual MPI package where the user can pick their favorite implemenation, and BB supports all this, there still has to be a way to use a system MPI version to get optimal performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix-Matrix multiply is quite slow #187

Matrix-Matrix multiply is quite slow #187

vchuravy commented Oct 5, 2018

montyvesselinov commented Apr 26, 2019

ViralBShah commented May 19, 2020 •

edited

Loading

ViralBShah commented May 19, 2020

andreasnoack commented May 20, 2020

andreasnoack commented May 20, 2020

ViralBShah commented May 20, 2020

vchuravy commented May 20, 2020

ViralBShah commented May 20, 2020 •

edited

Loading

andreasnoack commented May 20, 2020

ViralBShah commented May 20, 2020

andreasnoack commented May 20, 2020

haampie commented May 20, 2020

andreasnoack commented May 21, 2020

ViralBShah commented May 21, 2020

haampie commented May 21, 2020 •

edited

Loading

ViralBShah commented May 21, 2020

haampie commented May 21, 2020

Matrix-Matrix multiply is quite slow #187

Matrix-Matrix multiply is quite slow #187

Comments

vchuravy commented Oct 5, 2018

montyvesselinov commented Apr 26, 2019

ViralBShah commented May 19, 2020 • edited Loading

ViralBShah commented May 19, 2020

andreasnoack commented May 20, 2020

andreasnoack commented May 20, 2020

ViralBShah commented May 20, 2020

vchuravy commented May 20, 2020

ViralBShah commented May 20, 2020 • edited Loading

andreasnoack commented May 20, 2020

ViralBShah commented May 20, 2020

andreasnoack commented May 20, 2020

haampie commented May 20, 2020

andreasnoack commented May 21, 2020

ViralBShah commented May 21, 2020

haampie commented May 21, 2020 • edited Loading

ViralBShah commented May 21, 2020

haampie commented May 21, 2020

ViralBShah commented May 19, 2020 •

edited

Loading

ViralBShah commented May 20, 2020 •

edited

Loading

haampie commented May 21, 2020 •

edited

Loading