JuliaGPU · bors · Feb 3, 2020 · Nov 9, 2019 · Nov 10, 2019 · Nov 10, 2019
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -26,6 +26,9 @@ julia:nightly:
  - .test
  tags:
  - nvidia
+ - sm_75
+ variables:
+ CI_THOROUGH: 'true'
  allow_failure: true
 
 

diff --git a/docs/make.jl b/docs/make.jl
@@ -25,6 +25,7 @@ function main()
  ],
  "Device" => [
  "device/cuda.md",
+ "device/wmma.md",
  "device/array.md"
  ]
  ]

diff --git a/docs/src/device/wmma.md b/docs/src/device/wmma.md
@@ -0,0 +1,178 @@
+# WMMA
+
+This section details CUDAnative's interface to CUDA's warp matrix multiply-accumulate (WMMA) operations.
+This interface enables programmatic access to Tensor Cores, a new hardware feature in Volta that performs mixed precision matrix MAC operations.
+
+Access to WMMA using CUDAnative is available in two levels: low level wrappers around the LLVM intrinsics, and a higher-level API, similar to that of CUDA C.
+
+Note that to use the WMMA intrinsics, you need a sufficiently recent version of Julia: `v1.4.0-DEV.666` or later.
+You can check this by running the following in the REPL:
+```julia
+VERSION >= v"1.4.0-DEV.666"
+```
+
+!!! note
+
+ If you're running into any of following errors while using the WMMA interfaces:
+ ```
+ LLVM error: Do not know how to split the result of this operator!
+ ```
+ or
+ ```
+ CUDA error: a PTX JIT compilation failed (code 218, ERROR_INVALID_PTX)
+ ptxas application ptx input, line <line>; error : .aligned modifier required for instruction '<instr>'
+ ```
+ then make sure you are running Julia v1.4.0-DEV.666 or later!
+
+## Introduction of Terminology
+
+The WMMA operations perform a matrix multiply-accumulate.
+More concretely, it calculates ``D = A \cdot B + C``, where ``A`` is a ``M \times K`` matrix, ``B`` is a ``K \times N`` matrix, and ``C`` and ``D`` are ``M \times N`` matrices.
+
+Note that not all values of ``M``, ``N`` and ``K`` are allowed.
+The tuple ``(M, N, K)`` is often called the "shape" of the multiply accumulate operation.
+
+The multiply-accumulate consists of the following steps:
+- Load the matrices ``A``, ``B`` and ``C`` from memory to registers using a WMMA load operation.
+- Perform the matrix multiply-accumulate of ``A``, ``B`` and ``C`` to obtain ``D`` using a WMMA MMA operation. ``D`` is stored in hardware registers after this step.
+- Store the result ``D`` back to memory using a WMMA store operation.
+
+Note that WMMA is a warp-wide operation, which means that all threads in a warp must cooperate, and execute the WMMA operations in lockstep.
+Failure to do so will result in undefined behaviour.
+
+Each thread in a warp will hold a part of the matrix in its registers.
+In WMMA parlance, this part is referred to as a "fragment".
+Note that the exact mapping between matrix elements and fragment is unspecified, and subject to change in future versions.
+
+Finally, it is important to note that the resultant ``D`` matrix can be used as a ``C`` matrix for a subsequent multiply-accumulate.
+This is useful if one needs to calculate a sum of the form ``\sum_{i=0}^{n} A_i B_i``, where ``A_i`` and ``B_i`` are matrices of the correct dimension.
+
+## LLVM Intrinsics
+
+The LLVM intrinsics are accessible by using the one-to-one Julia wrappers.
+The return type of each wrapper is the Julia type that corresponds closest to the return type of the LLVM intrinsic.
+For example, LLVM's `[8 x <2 x half>]` becomes `NTuple{8, NTuple{2, VecElement{Float16}}}` in Julia.
+In essence, these wrappers return the SSA values returned by the LLVM intrinsic.
+Currently, all intrinsics that are available in LLVM 6, PTX 6.0 and SM 70 are implemented.
+
+These LLVM intrinsics are then lowered to the correct PTX instructions by the LLVM NVPTX backend.
+For more information about the PTX instructions, please refer to the [PTX Instruction Set Architecture Manual](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions).
+
+The LLVM intrinsics are subdivided in three categories: load, store and multiply-accumulate.
+In what follows, each of these will be discussed.
+
+### Load matrix
+```@docs
+CUDAnative.WMMA.llvm_wmma_load
+```
+
+### Perform multiply-accumulate
+```@docs
+CUDAnative.WMMA.llvm_wmma_mma
+```
+
+### Store matrix
+```@docs
+CUDAnative.WMMA.llvm_wmma_store
+```
+
+### Example
+
+````@eval
+lines = readlines("../../../examples/wmma/low-level.jl")
+start = findfirst(x -> x == "### START", lines) + 1
+stop = findfirst(x -> x == "### END", lines) - 1
+example = join(lines[start:stop], '\n')
+
+using Markdown
+Markdown.parse("""
+```julia
+$(example)
+```
+""")
+````
+
+## CUDA C-like API
+
+The main difference between the CUDA C-like API and the lower level wrappers, is that the former enforces several constraints when working with WMMA.
+For example, it ensures that the ``A`` fragment argument to the MMA instruction was obtained by a `load_a` call, and not by a `load_b` or `load_c`.
+Additionally, it makes sure that the data type and storage layout of the load/store operations and the MMA operation match.
+
+The CUDA C-like API heavily uses Julia's dispatch mechanism.
+As such, the method names are much shorter than the LLVM intrinsic wrappers, as most information is baked into the type of the arguments rather than the method name.
+
+
+Note that, in CUDA C++, the fragment is responsible for both the storage of intermediate results and the WMMA configuration.
+All CUDA C++ WMMA calls are function templates that take the resultant fragment as a by-reference argument.
+As a result, the type of this argument can be used during overload resolution to select the correct WMMA instruction to call.
+
+In contrast, the API in Julia separates the WMMA storage ([`WMMA.Fragment`](@ref)) and configuration ([`WMMA.Config`](@ref)).
+Instead of taking the resultant fragment by reference, the Julia functions just return it.
+This makes the dataflow clearer, but it also means that the type of that fragment cannot be used for selection of the correct WMMA instruction.
+Thus, there is still a limited amount of information that cannot be inferred from the argument types, but must nonetheless match for all WMMA operations, such as the overall shape of the MMA.
+This is accomplished by a separate "WMMA configuration" (see [`WMMA.Config`](@ref)) that you create once, and then give as an argument to all intrinsics.
+
+### Fragment
+```@docs
+CUDAnative.WMMA.FragmentLayout
+CUDAnative.WMMA.RowMajor
+CUDAnative.WMMA.ColMajor
+CUDAnative.WMMA.Unspecified
+CUDAnative.WMMA.Fragment
+```
+
+### WMMA configuration
+```@docs
+CUDAnative.WMMA.Config
+```
+
+### Load matrix
+```@docs
+CUDAnative.WMMA.load_a
+CUDAnative.WMMA.load_b
+CUDAnative.WMMA.load_c
+```
+
+### Perform multiply-accumulate
+```@docs
+CUDAnative.WMMA.mma
+```
+
+### Store matrix
+```@docs
+CUDAnative.WMMA.store_d
+```
+
+### Fill fragment
+```@docs
+CUDAnative.WMMA.fill_c
+```
+
+### Element access and broadcasting
+
+Similar to the CUDA C++ WMMA API, [`WMMA.Fragment`](@ref)s have an `x` member that can be used to access individual elements.
+Note that, in contrast to the values returned by the LLVM intrinsics, the `x` member is flattened.
+For example, while the `Float16` variants of the `load_a` instrinsics return `NTuple{8, NTuple{2, VecElement{Float16}}}`, the `x` member has type `NTuple{16, Float16}`.
+
+Typically, you will only need to access the `x` member to perform elementwise operations.
+This can be more succinctly expressed using Julia's broadcast mechanism.
+For example, to double each element in a fragment, you can simply use:
+```julia
+frag = 2.0f0 .* frag
+```
+
+### Example
+
+````@eval
+lines = readlines("../../../examples/wmma/high-level.jl")
+start = findfirst(x -> x == "### START", lines) + 1
+stop = findfirst(x -> x == "### END", lines) - 1
+example = join(lines[start:stop], '\n')
+
+using Markdown
+Markdown.parse("""
+```julia
+$(example)
+```
+""")
+````
diff --git a/examples/wmma/high-level.jl b/examples/wmma/high-level.jl
@@ -0,0 +1,46 @@
+# Need https://github.com/JuliaLang/julia/pull/33970
+# and https://github.com/JuliaLang/julia/pull/34043
+if VERSION < v"1.4.0-DEV.666"
+ exit()
+end
+
+using CUDAnative
+if CUDAnative.current_capability() < v"7.0"
+ exit()
+end
+
+### START
+using CUDAnative
+using CuArrays
+using Test
+
+a = rand(Float16, (16, 16))
+b = rand(Float16, (16, 16))
+c = rand(Float32, (16, 16))
+
+a_dev = CuArray(a)
+b_dev = CuArray(b)
+c_dev = CuArray(c)
+d_dev = similar(c_dev)
+
+function kernel(a_dev, b_dev, c_dev, d_dev)
+ conf = WMMA.Config{16, 16, 16, Float32}
+
+ a_frag = WMMA.load_a(pointer(a_dev), 16, WMMA.ColMajor, conf)
+ b_frag = WMMA.load_b(pointer(b_dev), 16, WMMA.ColMajor, conf)
+ c_frag = WMMA.load_c(pointer(c_dev), 16, WMMA.ColMajor, conf)
+
+ c_frag = 0.5f0 .* c_frag
+
+ d_frag = WMMA.mma(a_frag, b_frag, c_frag, conf)
+
+ WMMA.store_d(pointer(d_dev), d_frag, 16, WMMA.ColMajor, conf)
+
+ return
+end
+
+@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
+d = Array(d_dev)
+
+@test all(isapprox.(a * b + 0.5 * c, d; rtol=0.01))
+### END
diff --git a/examples/wmma/low-level.jl b/examples/wmma/low-level.jl
@@ -0,0 +1,42 @@
+# Need https://github.com/JuliaLang/julia/pull/33970
+# and https://github.com/JuliaLang/julia/pull/34043
+if VERSION < v"1.4.0-DEV.666"
+ exit()
+end
+
+using CUDAnative
+if CUDAnative.current_capability() < v"7.0"
+ exit()
+end
+
+### START
+using CUDAnative
+using CuArrays
+using Test
+
+# Generate input matrices
+a = rand(Float16, (16, 16))
+a_dev = CuArray(a)
+b = rand(Float16, (16, 16))
+b_dev = CuArray(b)
+c = rand(Float32, (16, 16))
+c_dev = CuArray(c)
+
+# Allocate space for result
+d_dev = similar(c_dev)
+
+# Matrix multiply-accumulate kernel (D = A * B + C)
+function kernel(a_dev, b_dev, c_dev, d_dev)
+ a_frag = WMMA.llvm_wmma_load_a_col_m16n16k16_stride_f16(pointer(a_dev), 16)
+ b_frag = WMMA.llvm_wmma_load_b_col_m16n16k16_stride_f16(pointer(b_dev), 16)
+ c_frag = WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f32(pointer(c_dev), 16)
+
+ d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f32(a_frag, b_frag, c_frag)
+
+ WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f32(pointer(d_dev), d_frag, 16)
+ return
+end
+
+@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
+@test all(isapprox.(a * b + c, Array(d_dev); rtol=0.01))
+### END
diff --git a/src/device/cuda.jl b/src/device/cuda.jl
@@ -11,6 +11,7 @@ include("cuda/assertion.jl")
 include("cuda/memory_dynamic.jl")
 include("cuda/atomics.jl")
 include("cuda/misc.jl")
+include("cuda/wmma.jl")
 
 # functionality from libdevice
 #

diff --git a/src/device/cuda/memory_shared.jl b/src/device/cuda/memory_shared.jl
@@ -83,8 +83,9 @@ end
  initializer!(gv, null(gv_typ))
  end
  # by requesting a larger-than-datatype alignment, we might be able to vectorize.
- # we pick 16 bytes since this is the largest transaction size as supported by PTX.
- alignment!(gv, Base.max(16, datatype_align(T)))
+ # we pick 32 bytes here, since WMMA instructions require 32-byte alignment.
+ # TODO: Make the alignment configurable
+ alignment!(gv, Base.max(32, datatype_align(T)))
 
  # generate IR
  Builder(JuliaContext()) do builder