Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using sycl::atomic_ref<...>{}.store() defaults to __ATOMIC_SEQ_CST leading to a compiler error. #1466

Open
GagaLP opened this issue May 27, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@GagaLP
Copy link

GagaLP commented May 27, 2024

Bug summary
Calling store() on sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::device> unexpectedly defaults to __ATOMIC_SEQ_CST, causing the following error:

fatal error: error in backend: Cannot select: 0x7ee560: ch = AtomicStore<(store seq_cst (s32) into %ir.30)> 0x7eec48:1, 0x7eeaa8, 0x7eec48
  0x7eeaa8: i64,ch = CopyFromReg 0x6252198, Register:i64 %0
    0x7ee220: i64 = Register %0
  0x7eec48: i32,ch = load<(dereferenceable load (s32) from %ir.11)> 0x6252198, FrameIndex:i64<6>, undef:i64
    0x8aa1f0: i64 = FrameIndex<6>
    0x8a9b70: i64 = undef
In function: _ZNK7hipsycl4sycl10atomic_refIiLNS0_12memory_orderE0ELNS0_12memory_scopeE3ELNS0_6access13address_spaceE4EE5storeEiS2_S3_

To Reproduce
Compile the following code:

sycl::queue q{sycl::gpu_selector{}};
auto g = sycl::malloc_device<int>(1, q);
q.submit([&](sycl::handler& cgh) {
	sycl::local_accessor<size_t> l(1, cgh);
	cgh.parallel_for(sycl::nd_range<1>(1, 1), [&](sycl::nd_item<1>) {
		sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::device>{g[0]}.store(1, sycl::memory_order::relaxed);
	});
});

With the following CMake configuration:

cmake_minimum_required(VERSION 2.25)
cmake_policy(SET CMP0058 NEW)

project(sycl_test)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_C_STANDARD_REQUIRED TRUE)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_INSTALL_PREFIX ${CMAKE_CURRENT_SOURCE_DIR}/install)

add_compile_options(-fdiagnostics-color=always)

find_package(adaptivecpp CONFIG REQUIRED)

add_executable(sycl_test sycl_test.cpp)

add_sycl_to_target(TARGET sycl_test SOURCES sycl_test.cpp)

And this command:
cmake -DCMAKE_PREFIX_PATH="/path/to/AdaptiveCpp" -DACPP_TARGETS=cuda:sm_75 -DCMAKE_C_COMPILER=/usr/lib/llvm-14/bin/clang -DCMAKE_CXX_COMPILER=/usr/lib/llvm-14/bin/clang++ -G Ninja ..

The same error occurs when compiling the following CUDA code:

__global__ void kernel() {
    __shared__ int s[1];
    __atomic_store_n(s, 1, __ATOMIC_SEQ_CST);
}

However, compiling the CUDA code with __ATOMIC_RELAXED works:

__global__ void kernel() {
    __shared__ int s[1];
    __atomic_store_n(s, 1, __ATOMIC_RELAXED);
}

Describe your setup

  • Current AdaptiveCpp version: branch = develop, commit
  • Clang version:
Ubuntu clang version 14.0.0-1ubuntu1.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/lib/llvm-14/bin

Additional context
Using LLVM 17 (clang 17.0.6) does not resolve the issue.

@GagaLP GagaLP added the bug Something isn't working label May 27, 2024
@illuhad
Copy link
Collaborator

illuhad commented May 27, 2024

It does not default to seq_cst. However, because the memory order is a runtime argument, it needs to generate code paths for all potential memory orders, including for the case where the function argument happens to be seq_cst.

Our implementation of the builtin looks like so:

inline constexpr int builtin_memory_order(memory_order o) noexcept {
  switch(o){
    case memory_order::relaxed:
      return __ATOMIC_RELAXED;
    case memory_order::acquire:
      return __ATOMIC_ACQUIRE;
    case memory_order::release:
      return __ATOMIC_RELEASE;
    case memory_order::acq_rel:
      return __ATOMIC_ACQ_REL;
    case memory_order::seq_cst:
      return __ATOMIC_SEQ_CST;
  }
  return __ATOMIC_RELAXED;
}

template <access::address_space S, class T>
HIPSYCL_HIPLIKE_BUILTIN void
__hipsycl_atomic_store(T *addr, T x, memory_order order,
                       memory_scope scope) noexcept {
  __atomic_store_n(addr, x, builtin_memory_order(order));
}

which then compiles to:

define void @_ZNK7hipsycl4sycl10atomic_refIiLNS0_12memory_orderE0ELNS0_12memory_scopeE3ELNS0_6access13address_spaceE4EE5storeEiS2_S3_(%"class.hipsycl::sycl::atomic_ref"* nocapture noundef nonnull readonly align 8 dereferenceable(8) %0, i32 noundef %1, i32 noundef %2, i32 noundef %3) local_unnamed_addr #0 comdat align 2 {
  %5 = getelementptr inbounds %"class.hipsycl::sycl::atomic_ref", %"class.hipsycl::sycl::atomic_ref"* %0, i64 0, i32 0
  %6 = load i32*, i32** %5, align 8
  switch i32 %2, label %11 [
    i32 4, label %10
    i32 1, label %7
    i32 2, label %8
    i32 3, label %9
  ]

7:                                                ; preds = %4
  br label %11

8:                                                ; preds = %4
  br label %11

9:                                                ; preds = %4
  br label %11

10:                                               ; preds = %4
  br label %11

11:                                               ; preds = %4, %10, %9, %8, %7
  %.0 = phi i32 [ 5, %10 ], [ 4, %9 ], [ 3, %8 ], [ 2, %7 ], [ 0, %4 ]
  switch i32 %.0, label %12 [
    i32 3, label %13
    i32 5, label %14
  ]

12:                                               ; preds = %11
  store atomic i32 %1, i32* %6 monotonic, align 4
  br label %15

13:                                               ; preds = %11
  store atomic i32 %1, i32* %6 release, align 4
  br label %15

14:                                               ; preds = %11
  store atomic i32 %1, i32* %6 seq_cst, align 4
  br label %15

15:                                               ; preds = %14, %13, %12
  ret void
}

The code path with the seq_cst is never taken in your application, and indeed if I compile with -O3 , the optimizer propagates the relaxed memory order argument far enough so that the unneeded code paths are eliminated, and the code compiles correctly.

We could just silently ignore the seq_cst case for --acpp-targets=cuda, but that might have negative implications on error diagnostics if a user actually tries to use seq_cst. Ultimately, this is a clang/LLVM bug because the LLVM NVPTX backend does not handle seq_cst correctly.

As a general pointer, I would also suggest to use --acpp-targets=generic instead of --acpp-targets=cuda. The generic JIT compiler does not have this problem, and is the better compiler anyway - it generates faster code, compiles faster and is more portable.

@fknorr
Copy link
Contributor

fknorr commented May 28, 2024

Could we instead emit a trap for seq_cst loads / stores until this is fixed in LLVM?

@illuhad
Copy link
Collaborator

illuhad commented May 28, 2024

@fknorr Good idea, I think this could work. The LLVM nvptx backend supports the __builtin_trap builtin, which maps to PTX trap instruction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants