Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use max(1, Sys.CPU_THREADS) BLAS threads for aarch64. #46085

Merged
merged 2 commits into from
Jul 20, 2022

Conversation

chriselrod
Copy link
Contributor

None of the common ARM CPUs that come to mind have hyperthreading.

Fixes #46071

None of the common ARM CPUs that come to mind have hyperthreading.
@gbaraldi
Copy link
Member

Looks good to me. Probably should be backported to 1.8 too.

@KristofferC KristofferC added the backport 1.8 Change should be backported to release-1.8 label Jul 18, 2022
@@ -582,7 +582,11 @@ function __init__()
Base.at_disable_library_threading(() -> BLAS.set_num_threads(1))

if !haskey(ENV, "OPENBLAS_NUM_THREADS")
BLAS.set_num_threads(max(1, Sys.CPU_THREADS ÷ 2))
if Sys.ARCH === :aarch64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is right. It's only on CPUs where we know Sys.CPU_THREADS is already the "right" number of threads for BLAS that we should use Sys.CPU_THREADS instead of Sys.CPU_THREADS ÷ 2. Unfortunately we don't have any way at the moment to query this, and the only system where we know for sure we have to do it is the M1 (because we have specific code for setting the value Sys.CPU_THREADS), which incidentally coincides with all known aarch64-apple-darwin platforms. Also, this should use @static and I think a better test is Sys.isapple() && Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) == "aarch64" (HostPlatform() does a normalisation of the arch, which in some cases may be arm64)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the common ARM CPUs that come to mind have hyperthreading.

Probably it's only/mainly Intel which does hyperthreading no? You're excluding 32-bit ARM, which I also doubt does hyperthreading

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD Cpus also have it (it's called SMT for them). Power I think does it too. For someone running asahi for example this is probably wrong, same for someone running the new Intel ones. But it was wrong before and i don't think we figured out a way to query performance cores on linux.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Power I think does it too.

Power does 4x or 8x, depending on the chip, meaning we may want to divide by 4 or 8 for Power? Is that detectable?

For someone running asahi for example this is probably wrong

On asahi, would the 4big4L read as 8 threads, and 8big2L read as 10 threads?
Just by chance, 4 would've been correct for the former, but 5 for the latter isn't exactly ideal either. I'd be curious whether 5 or 10 are faster.

It's only on CPUs where we know Sys.CPU_THREADS is already the "right" number of threads for BLAS that we should use Sys.CPU_THREADS instead of Sys.CPU_THREADS ÷ 2

I think this is a rather strong statement, in that 5 doesn't seem obviously better than 10 for 8 big core + 2 little core systems.

Also, this should use @static

Seems unnecessary when using Sys.ARCH, given the syntax is valid and the compiler can optimize it away just fine.

julia> foo() = Sys.ARCH === :aarch64
foo (generic function with 1 method)

julia> @code_typed foo()
CodeInfo(
1return false
) => Bool

but I can use @static when switching to Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) == "aarch64"

I think a better test is Sys.isapple()

Do you think Asahi linux is more common than Raspberry Pi 4, Graviton, Ampere, A64FX?

Not sure about Nvidea's products.
If we start seeing more ARM laptops, they'll probably have performance + efficiency cores for Windows and Linux.

So it's possible, but I do not know the market shares here.

There are Intel CPUs without hyperthreading, and you can also turn it off in the bios for CPUs with it (this is reasonably common in HPC environments).
So combined with the P+E cores from Alder Lake and beyond, this all does get pretty messy. Would be interesting to see benchmarks there.
Intel advertises their E cores as being efficient in terms of performance/area, not performance/power, so they're marketing their E cores as improving multithreading performance. Their next generation of CPUs will add a lot more E cores without adding extra P cores. Maybe we really do want to use 1 BLAS thread per E core, hoping that the BLAS library does reasonable load balancing (which isn't a given; MKL seems to, but I wouldn't bet on OpenBLAS or BLIS). We'd still probably only want 1 thread per P core, and these unfortunately do have hyperthreading...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alder lake is really annoying because the E cores don't have SMT but the P cores do. Ideally we should always find the number of P cores and set NUM_THREADS to that.

A quick test shows that, at least for BLAS, adding the E cores makes it quite a bit slower.

versioninfo()
Julia Version 1.9.0-DEV.688
Commit 3eaed8b54c (2022-05-31 15:31 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × 12th Gen Intel(R) Core(TM) i5-12600K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, alderlake)
  Threads: 6 on 16 virtual cores
Environment:
  JULIA_NUM_THREADS = 6
  JULIA_NUM_PRECOMPILE_TASKS = 6

julia> using LinearAlgebra
A1=rand(8192,8192); A2=copy(A1);

julia> @btime lu!($A1);
  1.022 s (2 allocations: 64.05 KiB)

julia> BLAS.set_num_threads(10)

julia> @btime lu!($A2);
  1.847 s (2 allocations: 64.05 KiB)

Copy link
Contributor Author

@chriselrod chriselrod Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about with using MKL? I'm curious if MKL does better.

I get the argument for mixing cores optimized for absolute performance with cores optimized for performance/area, but with it hurting benchmarks, so far that sounds more like marketing than reality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you also set JULIA_NUM_PRECOMPILE_TASKS = 6, do you think it even hurts this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set that because in macos there were some issues with the E cores and precompiling (it made the sound crackle). With MKL and 6 cores it didn't change much, but with 10 it didn't get worse interestingly. So mkl might do something not to use them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So mkl might do something not to use them

That seems likely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, when I said "Intel" above I was referring to x86-64 in general, not chips made by Intel specifically.

Anyway, since it was said that most ARM CPUs (and not just Aaarch64, and not just on macOS) will typically not have multithreading, then the condition should change to

Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform())  ("aarch64", "armv6l", "armv7l")

? It remains the already mentioned problem of Asahi Linux, where we don't set correctly the total number of CPUs, but that's a more general problem of detecting the efficiency/power cores, and in any case this is only a default value of threads, which can be always changed (and people strongly caring about the precise number of threads to be used on their machine should set it manually to be safe against changes of defaults in any case).

Use binarybuilder to normalize arch, and for now only use 1 BLAS thread per CPU thread for `Sys.isapple()` and `aarch64`
@KristofferC KristofferC merged commit 97df6db into JuliaLang:master Jul 20, 2022
KristofferC pushed a commit that referenced this pull request Jul 20, 2022
@chriselrod chriselrod deleted the patch-2 branch July 20, 2022 14:08
@@ -582,7 +582,11 @@ function __init__()
Base.at_disable_library_threading(() -> BLAS.set_num_threads(1))

if !haskey(ENV, "OPENBLAS_NUM_THREADS")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOTO_NUM_THREADS comes from the original GOTOBLAS library. Nobody should be using it!

For OMP, I thought that only applies if you build with OpenMP for threading - which we don't. I think we should only look at OPENBLAS_NUM_THREADS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of fact OpenBLAS respects those variables. If you set them you can see with BLAS.get_num_threads that OpenBLAS changes the setting accordingly

Copy link
Member

@ViralBShah ViralBShah Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup it does. I was just saying that we may only want to use one of them and only document that. However, I don't have a strong opinion if we want to use the other ones - but what should we do if more than one of those environment variables are present?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of those variables is set, we shouldn't override it. See #46118 🙂

@KristofferC KristofferC removed the backport 1.8 Change should be backported to release-1.8 label Aug 7, 2022
ffucci pushed a commit to ffucci/julia that referenced this pull request Aug 11, 2022
pcjentsch pushed a commit to pcjentsch/julia that referenced this pull request Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-ideal default number of BLAS threads on aarch64-apple-darwin
5 participants