Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLIS Append 64_ Suffix to All F77 Exported #4463

Merged
merged 1 commit into from
Feb 19, 2022

Conversation

xrq-phys
Copy link
Contributor

@xrq-phys xrq-phys commented Feb 19, 2022

Resolves JuliaLinearAlgebra/libblastrampoline#36

Currently only level-1/2/3 S/D/C/Z routines are exported w/ 64_ suffix, while libblastrampoline identifies BLAS suffix against isamax. This causes the issue above. The patch here should resolve the issue and kickoff lbt+libblis.

Append 64_ suffix to all F77 exported routines.
@giordano
Copy link
Member

giordano commented Feb 19, 2022

Nice:

julia> using LinearAlgebra

julia> peakflops(5000)
1.606979535336095e11

julia> BLAS.lbt_forward("./libblis.so", clear=true)
155

julia> peakflops(5000)
3.5732036026861916e10

Edit: I realised after posting that BLIS is slower, didn't notice the different order of magnitude 😬

Although for other operations OpenBLAS is faster:

julia> using BenchmarkTools

julia> LinearAlgebra.__init__()

julia> @benchmark BLAS.axpy!(a, x, y) setup=(T=Float32; N=Int(1e6); a=randn(T); x=randn(T, N); y=randn(T, N)) evals=1
BenchmarkTools.Trial: 208 samples with 1 evaluation.
 Range (min  max):  123.510 μs    4.651 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     176.543 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   262.716 μs ± 386.684 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁█▃▁                                                           
  ████▆▄▃▃▄▃▂▄▃▄▃▃▃▄▂▂▃▁▁▂▁▁▁▂▃▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  124 μs           Histogram: frequency by time         1.15 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> BLAS.lbt_forward("./libblis.so", clear=true)
155

julia> @benchmark BLAS.axpy!(a, x, y) setup=(T=Float32; N=Int(1e6); a=randn(T); x=randn(T, N); y=randn(T, N)) evals=1
BenchmarkTools.Trial: 412 samples with 1 evaluation.
 Range (min  max):  330.814 μs  761.048 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     484.763 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   476.162 μs ±  86.159 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █ ▂ ▁                ▇▆▅▆    ▄  ▁▁ ▁▁                         
  ████▇█▅▆▄▃▃▁▄▃▃▃▁▃▄▅▆▅█████▆▆██████████▅▅▆▅▅▄▃▁▁▃▃▃▃▁▁▃▁▁▃▁▁▃ ▄
  331 μs           Histogram: frequency by time          700 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Does BLIS doe runtime detection of features on all architectures? In particular I'm interested in SVE for A64FX, I saw you worked on that.

@giordano giordano merged commit 3f85ffc into JuliaPackaging:master Feb 19, 2022
@xrq-phys
Copy link
Contributor Author

@giordano Thanks for the info.

Now BLIS has no specialized optimization for level-2 BLAS operations I guess that where the slowdown comes from.

SVE is not compiled for now due to GNU as doesn't assembly SVE instruction without -march= while BinaryBuilder.jl disables it.

@giordano
Copy link
Member

Ok. We do have support for multiple microarchitectures, but still we need to flesh out some details, and I need to fix some compiler flags for aarch64. With JuliaLang/julia#44194 we'll eventually be able to target A64FX, too.

While we're here: do you happen to know whether A64FX requires AES? 🙂

@giordano
Copy link
Member

SVE is not compiled for now due to GNU as doesn't assembly SVE instruction without -march= while BinaryBuilder.jl disables it.

Wait, would BLIS build a "fat" library for all the targets into a single file, like OpenBLAS does? Because in that case it's ok to disable the check for -march, as long as the library can be used on all microarchitectures.

@xrq-phys
Copy link
Contributor Author

I'm afraid I do not know about this.

Wait, would BLIS build a "fat" library for all the targets into a single file, like OpenBLAS does? Because in that case it's ok to disable the check for -march, as long as the library can be used on all microarchitectures.

Exactly. Those asm compiled with -march=...+sve would not actually get executed when there's no SVE at all. I suppose this would make BLIS easier to support SVE than Julia as a whole.

@giordano
Copy link
Member

Ok, then you can add lock_microarchitecture=false, see

preferred_gcc_version=v"6", lock_microarchitecture=false, julia_compat="1.7")

@xrq-phys
Copy link
Contributor Author

Excellent!

Do I need to somehow stick to GCC 8 for max compatibility? Or can I push GCC to 10 for arm_sve.h?

Both approaches would work for SVE processors though.

@giordano
Copy link
Member

Do I need to somehow stick to GCC 8 for max compatibility? Or can I push GCC to 10 for arm_sve.h?

The main compatibility concern we usually have is when compiling C++ code which would end up requiring a too new libstdc++ at runtime. However I don't see symbols tagged with GLIBCXX in libblis:

% nm libblis.so|grep GLIBCXX
%

so I think it should be ok to use GCC 10 for this. We also have GCC 11.

@xrq-phys
Copy link
Contributor Author

BLIS uses C only. Upgrading to GCC 10 would save me source screening work then. Thanks.

We also have GCC 11.

Seen when compiling for aarch64-apple 😉. Nice work!

simeonschaub pushed a commit to simeonschaub/Yggdrasil that referenced this pull request Feb 23, 2022
Append 64_ suffix to all F77 exported routines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues with blis_jll
2 participants