Polyester.jl provides low-overhead multithreading in Julia. The primary API is @batch
, which can be used to parallelize for-loops (similar to @threads
).
Polyester implements static scheduling (c.f. @threads :static
) and has minimal overhead because it manages and re-uses a dedicated set of Julia tasks. This can lead to (great) speedups compared to other multithreading variants (see Benchmark below).
using Polyester
function axpy_polyester!(y, a, x)
@batch for i in eachindex(y,x)
y[i] = a * x[i] + y[i]
end
end
a = 3.141
x = rand(10_000)
y = rand(10_000)
axpy_polyester!(y, a, x)
-
Polyester.@batch
moves arrays to threads by turning them into StrideArraysCore.PtrArrays. This means that under an@batch
slices will createview
s by default(!). You may want to start Julia with--check-bounds=yes
while debugging. -
Polyester uses the regular Julia threads. The total number of threads is still governed by
--threads
orJULIA_NUM_THREADS
(check withThreads.nthreads()
). -
Polyester does not pin Julia threads to CPU-cores/threads. You can control how many "Polyester tasks" you want to use (see below). But to ensure that these tasks are running on specific CPU-cores/threads, you need to use a tool like ThreadPinning.jl.
Let's consider a basic axpy kernel.
using Polyester: @batch
using Base.Threads: @threads
using LinearAlgebra
using BenchmarkTools
# pinning threads for good measure
using ThreadPinning
pinthreads(:cores)
# Single threaded.
function axpy_serial!(y, a, x)
for i in eachindex(y,x)
@inbounds y[i] = a * x[i] + y[i]
end
end
# Multithreaded with @batch
function axpy_batch!(y, a, x)
@batch for i in eachindex(y,x)
@inbounds y[i] = a * x[i] + y[i]
end
end
# Multithreaded with @threads (default scheduling)
function axpy_atthreads!(y, a, x)
@threads for i in eachindex(y,x)
@inbounds y[i] = a * x[i] + y[i]
end
end
# Multithreaded with @threads :static
function axpy_atthreads_static!(y, a, x)
@threads :static for i in eachindex(y,x)
@inbounds y[i] = a * x[i] + y[i]
end
end
y = rand(10_000);
x = rand(10_000);
@benchmark axpy_serial!($y, eps(), $x)
@benchmark axpy_batch!($y, eps(), $x)
@benchmark axpy_atthreads!($y, eps(), $x)
@benchmark axpy_atthreads_static!($y, eps(), $x)
@benchmark axpy!(eps(), $x, $y) # BLAS built-in axpy
VERSION
With 8 Julia threads (pinned to different CPU-cores) I find the following results.
julia> @benchmark axpy_serial!($y, eps(), $x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max): 1.430 ΞΌs β¦ 2.226 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.434 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.438 ΞΌs Β± 23.775 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.43 ΞΌs Histogram: frequency by time 1.55 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark axpy_batch!($y, eps(), $x)
BenchmarkTools.Trial: 10000 samples with 69 evaluations.
Range (min β¦ max): 853.623 ns β¦ 2.361 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 885.507 ns β GC (median): 0.00%
Time (mean Β± Ο): 889.184 ns Β± 25.306 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββββββ
βββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββ β
854 ns Histogram: frequency by time 968 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark axpy_atthreads!($y, eps(), $x)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min β¦ max): 4.437 ΞΌs β¦ 388.400 ΞΌs β GC (min β¦ max): 0.00% β¦ 97.02%
Time (median): 5.077 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 5.560 ΞΌs Β± 9.340 ΞΌs β GC (mean Β± Ο): 4.03% Β± 2.37%
βββ
ββββββ
ββββββ
ββββββββββββ
βββββββββββββββββββββββββββββββββββββββββ β
4.44 ΞΌs Histogram: frequency by time 7.44 ΞΌs <
Memory estimate: 4.54 KiB, allocs estimate: 48.
julia> @benchmark axpy_atthreads_static!($y, eps(), $x)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min β¦ max): 3.078 ΞΌs β¦ 357.969 ΞΌs β GC (min β¦ max): 0.00% β¦ 96.65%
Time (median): 3.618 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 4.102 ΞΌs Β± 9.118 ΞΌs β GC (mean Β± Ο): 5.75% Β± 2.57%
βββββ
β
βββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββ β
3.08 ΞΌs Histogram: frequency by time 6.12 ΞΌs <
Memory estimate: 4.56 KiB, allocs estimate: 48.
julia> @benchmark axpy!(eps(), $x, $y) # BLAS built-in axpy
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max): 1.438 ΞΌs β¦ 9.397 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.441 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.445 ΞΌs Β± 83.630 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.44 ΞΌs Histogram: frequency by time 1.55 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> VERSION
v"1.9.3"
With only 10_000
elements, this simple AXPY computation can't afford the overhead of multithreading via @threads
(for either scheduling scheme). In fact, the latter just slows the computation down. Similarly, the built-in BLAS axpy!
doesn't provide any multithreading speedup (it likely falls back to a serial variant). Only with Polyester's @batch
, which has minimal overhead, do we get a decent(!) speedup.
The per
keyword argument can be used to limit the number of Julia threads to be used by a @batch
block. Specifically, per=core
will use only max(num_cores, nthreads())
many of the Julia threads.
Note that @batch
defaults to per=cores
. This is because LoopVectorization.jl currently only uses up to 1 thread per physical core, and switching the number of
threads incurs some overhead.
The minbatch
argument lets us choose a minimum number of iterations per thread. That is, minbatch=n
means it'll use at most number_loop_iterations Γ· n
threads.
For our benchmark example above with 10000 iterations, setting minbatch=2500
will lead to @batch
using only 4 (of 8) threads. This is still faster than the serial version but slower than plain @batch
, which uses all 8 available threads.
function axpy_minbatch!(y, a, x)
@batch minbatch=2500 for i in eachindex(y,x)
@inbounds y[i] = a * x[i] + y[i]
end
end
@benchmark axpy_minbatch!($y, $eps(), $x)
julia> @benchmark axpy_minbatch!($y, $eps(), $x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max): 1.072 ΞΌs β¦ 5.085 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.114 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.126 ΞΌs Β± 72.510 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββ
β
βββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββ β
1.07 ΞΌs Histogram: frequency by time 1.36 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Warning: this feature is likely broken!
You also can define local storage for each thread, providing a vector containing each of the local storages at the end.
julia> let
@batch threadlocal=rand(10:99) for i in 0:9
println("index $i, thread $(Threads.threadid()), local storage $threadlocal")
threadlocal += 1
end
println(threadlocal)
end
index 8, thread 1, local storage 30
index 3, thread 3, local storage 49
index 9, thread 1, local storage 31
index 6, thread 4, local storage 33
index 0, thread 2, local storage 14
index 4, thread 3, local storage 50
index 7, thread 4, local storage 34
index 1, thread 2, local storage 15
index 5, thread 3, local storage 51
index 2, thread 2, local storage 16
Any[32, 17, 52, 35]
Optionally, a type can be specified for the thread-local storage:
julia> let
@batch threadlocal=rand(10:99)::Float16 for i in 0:9
end
println(threadlocal)
end
Float16[83.0, 90.0, 27.0, 65.0]
When running many repetitions of a Polyester-multithreaded function (e.g. in an embarrassingly parallel problem that repeatedly executes a small already Polyester-multithreaded function), it can be beneficial to disable Polyester (the inner multithreaded loop) and multithread only at the outer level (e.g. with Base.Threads
). This can be done with the disable_polyester_threads
context manager. In the expandable section below you can see examples with benchmarks.
It is best to call disable_polyester_threads
only once, before any @thread
uses happen, to avoid overhead. E.g. best to do it as:
disable_polyester_threads() do
@threads for i in 1:n
f()
end
end
instead of doing it in the following unnecessarily slow manner:
@threads for i in 1:n # DO NOT DO THIS
disable_polyester_threads() do # IT HAS UNNECESSARY OVERHEAD
f()
end
end
Benchmarks of nested multi-threading with Polyester
# Big inner problem, repeated only a few times
y = rand(10000000,4);
x = rand(size(y)...);
@btime inner($x,$y,1) # 57.456 ms (0 allocations: 0 bytes)
@btime inner_polyester($x,$y,1) # 7.456 ms (0 allocations: 0 bytes)
@btime inner_thread($x,$y,1) # 7.286 ms (49 allocations: 4.56 KiB)
@btime sequential_sequential($x,$y) # 229.513 ms (0 allocations: 0 bytes)
@btime sequential_polyester($x,$y) # 29.921 ms (0 allocations: 0 bytes)
@btime sequential_thread($x,$y) # 29.343 ms (196 allocations: 18.25 KiB)
@btime threads_of_polyester($x,$y) # 29.961 ms (42 allocations: 4.34 KiB)
# the following is a purposefully suboptimal way to disable threads
@btime threads_of_polyester_inner_disable($x,$y) # 55.397 ms (51 allocations: 4.62 KiB)
# the following is a good way to disable threads (the disable call happening once in the outer scope)
@btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 55.396 ms (47 allocations: 4.50 KiB)
@btime threads_of_sequential($x,$y) # 55.404 ms (48 allocations: 4.53 KiB)
@btime threads_of_thread($x,$y) # 29.187 ms (220 allocations: 22.03 KiB)
# Small inner problem, repeated many times
y = rand(1000,1000);
x = rand(size(y)...);
@btime inner($x,$y,1) # 3.390 ΞΌs (0 allocations: 0 bytes)
@btime inner_polyester($x,$y,1) # 785.714 ns (0 allocations: 0 bytes)
@btime inner_thread($x,$y,1) # 4.043 ΞΌs (48 allocations: 4.54 KiB)
@btime sequential_sequential($x,$y) # 5.720 ms (0 allocations: 0 bytes)
@btime sequential_polyester($x,$y) # 1.143 ms (0 allocations: 0 bytes)
@btime sequential_thread($x,$y) # 4.796 ms (50307 allocations: 4.50 MiB)
@btime threads_of_polyester($x,$y) # 1.165 ms (42 allocations: 4.34 KiB)
# the following is a purposefully suboptimal way to disable threads
@btime threads_of_polyester_inner_disable($x,$y) # 779.713 ΞΌs (1042 allocations: 35.59 KiB)
# the following is a good way to disable threads (the disable call happening once in the outer scope)
@btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 743.813 ΞΌs (48 allocations: 4.53 KiB)
@btime threads_of_sequential($x,$y) # 694.463 ΞΌs (45 allocations: 4.44 KiB)
@btime threads_of_thread($x,$y) # 2.288 ms (42058 allocations: 4.25 MiB)
# The tested functions
# All of these would be better implemented by just using @tturbo,
# but these suboptimal implementations serve as good test case for
# Polyster-vs-Base thread scheduling.
function inner(x,y,j)
for i β axes(x,1)
y[i,j] = sin(x[i,j])
end
end
function inner_polyester(x,y,j)
@batch for i β axes(x,1)
y[i,j] = sin(x[i,j])
end
end
function inner_thread(x,y,j)
@threads for i β axes(x,1)
y[i,j] = sin(x[i,j])
end
end
function sequential_sequential(x,y)
for j β axes(x,2)
inner(x,y,j)
end
end
function sequential_polyester(x,y)
for j β axes(x,2)
inner_polyester(x,y,j)
end
end
function sequential_thread(x,y)
for j β axes(x,2)
inner_thread(x,y,j)
end
end
function threads_of_polyester(x,y)
@threads for j β axes(x,2)
inner_polyester(x,y,j)
end
end
function threads_of_polyester_inner_disable(x,y)
# XXX This is a bad way to disable Polyester threads as
# it causes unnecessary overhead for each @threads thread.
# See the benchmarks above for a better way.
@threads for j β axes(x,2)
Polyester.disable_polyester_threads() do
inner_polyester(x,y,j)
end
end
end
function threads_of_thread(x,y)
@threads for j β axes(x,2)
inner_thread(x,y,j)
end
end
function threads_of_thread(x,y)
@threads for j β axes(x,2)
inner_thread(x,y,j)
end
end
function threads_of_sequential(x,y)
@threads for j β axes(x,2)
inner(x,y,j)
end
end
Benchmarks executed on:
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 128 Γ AMD EPYC 7V13 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 8 on 128 virtual cores
Environment:
JULIA_NUM_THREADS = 8