Using `sycl::atomic_ref<...>{}.store()` defaults to `__ATOMIC_SEQ_CST` leading to a compiler error. #1466

GagaLP · 2024-05-27T12:20:14Z

Bug summary
Calling store() on sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::device> unexpectedly defaults to __ATOMIC_SEQ_CST, causing the following error:

fatal error: error in backend: Cannot select: 0x7ee560: ch = AtomicStore<(store seq_cst (s32) into %ir.30)> 0x7eec48:1, 0x7eeaa8, 0x7eec48
  0x7eeaa8: i64,ch = CopyFromReg 0x6252198, Register:i64 %0
    0x7ee220: i64 = Register %0
  0x7eec48: i32,ch = load<(dereferenceable load (s32) from %ir.11)> 0x6252198, FrameIndex:i64<6>, undef:i64
    0x8aa1f0: i64 = FrameIndex<6>
    0x8a9b70: i64 = undef
In function: _ZNK7hipsycl4sycl10atomic_refIiLNS0_12memory_orderE0ELNS0_12memory_scopeE3ELNS0_6access13address_spaceE4EE5storeEiS2_S3_

To Reproduce
Compile the following code:

sycl::queue q{sycl::gpu_selector{}};
auto g = sycl::malloc_device<int>(1, q);
q.submit([&](sycl::handler& cgh) {
	sycl::local_accessor<size_t> l(1, cgh);
	cgh.parallel_for(sycl::nd_range<1>(1, 1), [&](sycl::nd_item<1>) {
		sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::device>{g[0]}.store(1, sycl::memory_order::relaxed);
	});
});

With the following CMake configuration:

cmake_minimum_required(VERSION 2.25)
cmake_policy(SET CMP0058 NEW)

project(sycl_test)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_C_STANDARD_REQUIRED TRUE)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_INSTALL_PREFIX ${CMAKE_CURRENT_SOURCE_DIR}/install)

add_compile_options(-fdiagnostics-color=always)

find_package(adaptivecpp CONFIG REQUIRED)

add_executable(sycl_test sycl_test.cpp)

add_sycl_to_target(TARGET sycl_test SOURCES sycl_test.cpp)

And this command:
cmake -DCMAKE_PREFIX_PATH="/path/to/AdaptiveCpp" -DACPP_TARGETS=cuda:sm_75 -DCMAKE_C_COMPILER=/usr/lib/llvm-14/bin/clang -DCMAKE_CXX_COMPILER=/usr/lib/llvm-14/bin/clang++ -G Ninja ..

The same error occurs when compiling the following CUDA code:

__global__ void kernel() {
    __shared__ int s[1];
    __atomic_store_n(s, 1, __ATOMIC_SEQ_CST);
}

However, compiling the CUDA code with __ATOMIC_RELAXED works:

__global__ void kernel() {
    __shared__ int s[1];
    __atomic_store_n(s, 1, __ATOMIC_RELAXED);
}

Describe your setup

Current AdaptiveCpp version: branch = develop, commit
Clang version:

Ubuntu clang version 14.0.0-1ubuntu1.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/lib/llvm-14/bin

Additional context
Using LLVM 17 (clang 17.0.6) does not resolve the issue.

The text was updated successfully, but these errors were encountered:

illuhad · 2024-05-27T17:45:32Z

It does not default to seq_cst. However, because the memory order is a runtime argument, it needs to generate code paths for all potential memory orders, including for the case where the function argument happens to be seq_cst.

Our implementation of the builtin looks like so:

inline constexpr int builtin_memory_order(memory_order o) noexcept {
  switch(o){
    case memory_order::relaxed:
      return __ATOMIC_RELAXED;
    case memory_order::acquire:
      return __ATOMIC_ACQUIRE;
    case memory_order::release:
      return __ATOMIC_RELEASE;
    case memory_order::acq_rel:
      return __ATOMIC_ACQ_REL;
    case memory_order::seq_cst:
      return __ATOMIC_SEQ_CST;
  }
  return __ATOMIC_RELAXED;
}

template <access::address_space S, class T>
HIPSYCL_HIPLIKE_BUILTIN void
__hipsycl_atomic_store(T *addr, T x, memory_order order,
                       memory_scope scope) noexcept {
  __atomic_store_n(addr, x, builtin_memory_order(order));
}

which then compiles to:

define void @_ZNK7hipsycl4sycl10atomic_refIiLNS0_12memory_orderE0ELNS0_12memory_scopeE3ELNS0_6access13address_spaceE4EE5storeEiS2_S3_(%"class.hipsycl::sycl::atomic_ref"* nocapture noundef nonnull readonly align 8 dereferenceable(8) %0, i32 noundef %1, i32 noundef %2, i32 noundef %3) local_unnamed_addr #0 comdat align 2 {
  %5 = getelementptr inbounds %"class.hipsycl::sycl::atomic_ref", %"class.hipsycl::sycl::atomic_ref"* %0, i64 0, i32 0
  %6 = load i32*, i32** %5, align 8
  switch i32 %2, label %11 [
    i32 4, label %10
    i32 1, label %7
    i32 2, label %8
    i32 3, label %9
  ]

7:                                                ; preds = %4
  br label %11

8:                                                ; preds = %4
  br label %11

9:                                                ; preds = %4
  br label %11

10:                                               ; preds = %4
  br label %11

11:                                               ; preds = %4, %10, %9, %8, %7
  %.0 = phi i32 [ 5, %10 ], [ 4, %9 ], [ 3, %8 ], [ 2, %7 ], [ 0, %4 ]
  switch i32 %.0, label %12 [
    i32 3, label %13
    i32 5, label %14
  ]

12:                                               ; preds = %11
  store atomic i32 %1, i32* %6 monotonic, align 4
  br label %15

13:                                               ; preds = %11
  store atomic i32 %1, i32* %6 release, align 4
  br label %15

14:                                               ; preds = %11
  store atomic i32 %1, i32* %6 seq_cst, align 4
  br label %15

15:                                               ; preds = %14, %13, %12
  ret void
}

The code path with the seq_cst is never taken in your application, and indeed if I compile with -O3 , the optimizer propagates the relaxed memory order argument far enough so that the unneeded code paths are eliminated, and the code compiles correctly.

We could just silently ignore the seq_cst case for --acpp-targets=cuda, but that might have negative implications on error diagnostics if a user actually tries to use seq_cst. Ultimately, this is a clang/LLVM bug because the LLVM NVPTX backend does not handle seq_cst correctly.

As a general pointer, I would also suggest to use --acpp-targets=generic instead of --acpp-targets=cuda. The generic JIT compiler does not have this problem, and is the better compiler anyway - it generates faster code, compiles faster and is more portable.

fknorr · 2024-05-28T08:03:48Z

Could we instead emit a trap for seq_cst loads / stores until this is fixed in LLVM?

illuhad · 2024-05-28T15:55:06Z

@fknorr Good idea, I think this could work. The LLVM nvptx backend supports the __builtin_trap builtin, which maps to PTX trap instruction.

GagaLP added the bug Something isn't working label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `sycl::atomic_ref<...>{}.store()` defaults to `__ATOMIC_SEQ_CST` leading to a compiler error. #1466

Using `sycl::atomic_ref<...>{}.store()` defaults to `__ATOMIC_SEQ_CST` leading to a compiler error. #1466

GagaLP commented May 27, 2024

illuhad commented May 27, 2024 •

edited

Loading

fknorr commented May 28, 2024

illuhad commented May 28, 2024

Using sycl::atomic_ref<...>{}.store() defaults to __ATOMIC_SEQ_CST leading to a compiler error. #1466

Using sycl::atomic_ref<...>{}.store() defaults to __ATOMIC_SEQ_CST leading to a compiler error. #1466

Comments

GagaLP commented May 27, 2024

illuhad commented May 27, 2024 • edited Loading

fknorr commented May 28, 2024

illuhad commented May 28, 2024

Using `sycl::atomic_ref<...>{}.store()` defaults to `__ATOMIC_SEQ_CST` leading to a compiler error. #1466

Using `sycl::atomic_ref<...>{}.store()` defaults to `__ATOMIC_SEQ_CST` leading to a compiler error. #1466

illuhad commented May 27, 2024 •

edited

Loading