Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocations when multi-threading #240

Open
Lightup1 opened this issue May 15, 2022 · 5 comments
Open

Allocations when multi-threading #240

Lightup1 opened this issue May 15, 2022 · 5 comments

Comments

@Lightup1
Copy link

Lightup1 commented May 15, 2022

When activate multithreading by FFTW.set_num_threads(), it become slower and allocate more.
The provider here is the default "fftw".
But when using "mkl", it just use multithreading automatically depending on the problem size and allocation is 0.
Here are some benchmarks

data1=rand(ComplexF64,2^12)
p1=plan_fft!(data1)
julia> p1=plan_fft!(data1)
FFTW in-place forward plan for 4096-element array of ComplexF64
(dft-ct-dit/32
  (dftw-direct-32/8 "t3fv_32_avx2_128")
  (dft-directbuf/130-128-x32 "n1fv_128_avx2"))

julia> @benchmark $p1*$data1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  21.800 μs … 68.700 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.100 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.189 μs ±  1.335 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▁  ▁█
  ▂▂▂▄▄▅▅█▆███▆▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  21.8 μs         Histogram: frequency by time        29.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> FFTW.set_num_threads(6)

julia> p1=plan_fft!(data1)
FFTW in-place forward plan for 4096-element array of ComplexF64
(dft-thr-ct-dit-x6/32
  (dftw-direct-32/8 "t3fv_32_avx2_128")
  (dftw-direct-32/8 "t3fv_32_avx2_128")
  (dftw-direct-32/8 "t3fv_32_avx2_128")
  (dftw-direct-32/8 "t3fv_32_avx2_128")
  (dftw-direct-32/8 "t3fv_32_avx2_128")
  (dftw-direct-32/8 "t3fv_32_avx2_128")
  (dft-directbuf/130-128-x32 "n1fv_128_avx2"))

julia> @benchmark $p1*$data1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.900 μs … 300.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     48.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.904 μs ±  12.291 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▁▁▂▄▆█▆▄▂ ▁ ▂▁▁
  ▂▁▂▂▂▂▂▂▂▃▃▃▃▅▇▇███████████████▇▆▅▅▅▅▅▅▅▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂ ▄
  23.9 μs         Histogram: frequency by time         83.5 μs <

 Memory estimate: 4.41 KiB, allocs estimate: 53.

And

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 5 2600X Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6
  JULIA_PKG_SERVER = https://mirrors.tuna.tsinghua.edu.cn/julia
  JULIA_EDITOR = code
@Lightup1 Lightup1 changed the title Allocs in thread Allocations when multi-threading May 15, 2022
@N5N3
Copy link
Contributor

N5N3 commented May 18, 2022

  1. The allocation comes from julia's threading system. (FFTW backend use Julia's threading system, while MKL not.)
  2. IIRC, for single 1d-inplace fft, FFTW's doc has said multi-threading might not help.

@Lightup1
Copy link
Author

Thanks for the information. Is there a way to change the threading system (not the provider) to let it allocation free?

@N5N3
Copy link
Contributor

N5N3 commented May 21, 2022

Julia has https://github.com/JuliaSIMD/Polyester.jl, which should be able to replace

FFTW.jl/src/providers.jl

Lines 56 to 60 in 17bc81a

function spawnloop(f::Ptr{Cvoid}, fdata::Ptr{Cvoid}, elsize::Csize_t, num::Cint, callback_data::Ptr{Cvoid})
@sync for i = 0:num-1
Threads.@spawn ccall(f, Ptr{Cvoid}, (Ptr{Cvoid},), fdata + elsize*i)
end
end

with

 function spawnloop(f::Ptr{Cvoid}, fdata::Ptr{Cvoid}, elsize::Csize_t, num::Cint, callback_data::Ptr{Cvoid})
    @batch per=core for i = 0:num-1
        ccall(f, Ptr{Cvoid}, (Ptr{Cvoid},), fdata + elsize*i) 
    end
 end

(I didn't test it, so don't except this to work!!!!!)

But I'm affraid this won't help the performance. FFT it self is much more heavy than the allocation.

@Lightup1
Copy link
Author

Thank you very much! I'll test it later.

@roflmaostc
Copy link

I can confirm this issue, I haven't noticed it in the past.

julia> using FFTW, BenchmarkTools

julia> x = randn((1024, 1024));

julia> FFTW.set_num_threads(8)

julia> @btime ifft($x);
  7.646 ms (17810 allocations: 33.20 MiB)

julia> FFTW.set_num_threads(4)

julia> @btime ifft($x);
  6.362 ms (115 allocations: 32.01 MiB)

julia> FFTW.set_num_threads(6)

julia> @btime ifft($x);
  10.943 ms (38485 allocations: 34.27 MiB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants