Skip to content
This repository has been archived by the owner on May 27, 2021. It is now read-only.

Implement wrappers for WMMA LLVM intrinsics #494

Merged
merged 91 commits into from
Feb 3, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
7561da6
Implement wrappers for WMMA LLVM intrinsics
thomasfaingnaert Nov 9, 2019
8f4f2d1
Implement basic CUDA-style API
thomasfaingnaert Nov 10, 2019
23d9552
Generalise load for matrix
thomasfaingnaert Nov 10, 2019
faae545
Implement wrappers for WMMA LLVM intrinsics
thomasfaingnaert Nov 9, 2019
0efeaaa
Merge branch 'wmma-wrapper' into wmma-cuda
thomasfaingnaert Nov 10, 2019
844f28e
Move high-level API to same file
thomasfaingnaert Nov 10, 2019
91d6ee7
Finish load
thomasfaingnaert Nov 10, 2019
db17cc6
Wrapper for store
thomasfaingnaert Nov 10, 2019
53657fe
Generalise MMA
thomasfaingnaert Nov 10, 2019
7a0b1dc
Generalise high level test
thomasfaingnaert Nov 10, 2019
d0e490c
Move d type to config
thomasfaingnaert Nov 10, 2019
7ec4877
Add fill_fragment function
thomasfaingnaert Nov 10, 2019
cf3ba19
Add tests for multiply
thomasfaingnaert Nov 10, 2019
740559a
Add configuration variable to fill
thomasfaingnaert Nov 10, 2019
44898c6
Add documentation
thomasfaingnaert Nov 10, 2019
cb753da
Add documentation
thomasfaingnaert Nov 10, 2019
c43a50c
Merge branch 'wmma-wrapper' into wmma-cuda
thomasfaingnaert Nov 10, 2019
381e25a
Add initial documentation CUDA style API
thomasfaingnaert Nov 10, 2019
3ed3a17
Finalise documentation
thomasfaingnaert Nov 11, 2019
722cb5a
Merge remote-tracking branch 'upstream/master' into wmma-wrapper
thomasfaingnaert Nov 14, 2019
18394c1
Implement tests for shared address space
thomasfaingnaert Nov 15, 2019
0b8fff8
Change default shared memory alignment
thomasfaingnaert Nov 23, 2019
3d22453
Merge branch 'wmma-wrapper' into wmma-cuda
thomasfaingnaert Nov 23, 2019
410a12b
Change equality test
thomasfaingnaert Nov 25, 2019
67fbbf7
Merge branch 'wmma-wrapper' into wmma-cuda
thomasfaingnaert Nov 25, 2019
ab54af0
Change equality test
thomasfaingnaert Nov 25, 2019
f44388a
Change load to ccall
thomasfaingnaert Nov 26, 2019
b1865dd
Use ccall for store
thomasfaingnaert Nov 26, 2019
865aac5
Cleanup store
thomasfaingnaert Nov 27, 2019
4e2cb3c
Fix wmma
thomasfaingnaert Nov 27, 2019
c85e4d5
Fix shared tests
thomasfaingnaert Nov 27, 2019
84f43f0
Clean up wrappers
thomasfaingnaert Nov 27, 2019
6b6e65b
Fix indenting
thomasfaingnaert Nov 27, 2019
2455e3b
Fix typo
thomasfaingnaert Nov 27, 2019
4d39b9b
Fix typo
thomasfaingnaert Nov 27, 2019
579a060
Cleanup addrspacecast
thomasfaingnaert Nov 27, 2019
5606368
Re-enable tests
thomasfaingnaert Nov 27, 2019
fc108a6
Fix intrinsics for LLVM 8
thomasfaingnaert Nov 28, 2019
3c16a20
Fix shared tests
thomasfaingnaert Nov 28, 2019
2d2b592
Clean up tests
thomasfaingnaert Nov 28, 2019
b8de94b
Merge branch 'wmma-cuda' into wmma-wrapper
thomasfaingnaert Nov 28, 2019
2d0c7cf
Fixes
thomasfaingnaert Nov 28, 2019
9373c55
Reenable tests
thomasfaingnaert Nov 28, 2019
ba6ff5b
Add whitespace
thomasfaingnaert Nov 29, 2019
06f1f1e
Use separate frag size variable for high-level API
thomasfaingnaert Nov 29, 2019
f5ebc8e
Implement flattening
thomasfaingnaert Nov 29, 2019
08a6e6c
Test elementwise op
thomasfaingnaert Nov 29, 2019
05ec4a5
Change comment
thomasfaingnaert Nov 29, 2019
a83bbfa
Only run WMMA test for recent Julia
thomasfaingnaert Nov 30, 2019
062038d
Reenable other tests
thomasfaingnaert Nov 30, 2019
62607ef
Add minimum Julia version to documentation
thomasfaingnaert Nov 30, 2019
df1ca7d
Refactor load to use @generated
thomasfaingnaert Dec 1, 2019
9fd6b74
Refactor store to use @generated
thomasfaingnaert Dec 1, 2019
c7eef61
Refactor fill to use @generated
thomasfaingnaert Dec 1, 2019
0d6777c
Refactor wmma to use @generated
thomasfaingnaert Dec 1, 2019
702d372
Cleanup
thomasfaingnaert Dec 1, 2019
9a3036c
Reenable tests
thomasfaingnaert Dec 1, 2019
e4fe143
Implement broadcasting
thomasfaingnaert Dec 2, 2019
556bbdf
Use correct type for alpha and beta
thomasfaingnaert Dec 3, 2019
40ff035
Add tests for flattening and unflattening
thomasfaingnaert Dec 4, 2019
3911871
Add tests for broadcasting
thomasfaingnaert Dec 4, 2019
70f35df
Add CUDAnative prefix
thomasfaingnaert Dec 4, 2019
c99d9f6
Adhere to Julia naming convention for types
thomasfaingnaert Dec 6, 2019
765a7c2
Capitalise WMMA
thomasfaingnaert Dec 6, 2019
10828dc
Merge remote-tracking branch 'upstream/master' into wmma-wrapper
thomasfaingnaert Dec 7, 2019
e5e6965
Move examples to separate folder
thomasfaingnaert Dec 7, 2019
76ed2bb
Only run WMMA examples for recent Julia
thomasfaingnaert Dec 7, 2019
eff97ae
Merge remote-tracking branch 'upstream/master' into wmma-wrapper
thomasfaingnaert Dec 10, 2019
e181320
Undo changes to Project and Manifest file
thomasfaingnaert Dec 10, 2019
841199e
Document flattening and broadcast
thomasfaingnaert Dec 10, 2019
86af328
Bump minimum Julia version
thomasfaingnaert Dec 10, 2019
6d376f3
Use exit() in examples
thomasfaingnaert Dec 10, 2019
c50beed
Temporarily disable test
thomasfaingnaert Dec 10, 2019
1b70c90
Bump min SM for Julia nightly tests
thomasfaingnaert Dec 12, 2019
0563679
Check capability in WMMA tests
thomasfaingnaert Dec 12, 2019
3b8391d
Reenable unflatten test
thomasfaingnaert Dec 15, 2019
a64194d
Merge remote-tracking branch 'upstream/master' into wmma-wrapper
thomasfaingnaert Dec 26, 2019
fdc8dde
Update version check
thomasfaingnaert Dec 26, 2019
b869380
Merge remote-tracking branch 'upstream/master' into wmma-wrapper
thomasfaingnaert Jan 31, 2020
3f9faac
Set CI_THOROUGH
thomasfaingnaert Jan 31, 2020
a1b64f2
Mark constants as 'const'
thomasfaingnaert Jan 31, 2020
d215a56
Remove join_nonempty
thomasfaingnaert Jan 31, 2020
d66f738
Implement indexing for WMMAFragment
thomasfaingnaert Jan 31, 2020
dad2a66
Refactor conversion of AS to Int
thomasfaingnaert Jan 31, 2020
28f44b5
Move LLVM instrincs doc to docstrings
thomasfaingnaert Jan 31, 2020
9d9d5e2
Fix path to examples in docs
thomasfaingnaert Jan 31, 2020
ae19e03
Move everything in WMMA submodule
thomasfaingnaert Feb 1, 2020
3c95dcf
Fix docs
thomasfaingnaert Feb 1, 2020
94ab442
Fix indenting
thomasfaingnaert Feb 1, 2020
3579373
Add broadcasting to example
thomasfaingnaert Feb 1, 2020
93c77bc
Small doc fix
thomasfaingnaert Feb 1, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ julia:nightly:
- .test
tags:
- nvidia
- sm_75
thomasfaingnaert marked this conversation as resolved.
Show resolved Hide resolved
variables:
CI_THOROUGH: 'true'
allow_failure: true


Expand Down
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ function main()
],
"Device" => [
"device/cuda.md",
"device/wmma.md",
"device/array.md"
]
]
Expand Down
178 changes: 178 additions & 0 deletions docs/src/device/wmma.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# WMMA

This section details CUDAnative's interface to CUDA's warp matrix multiply-accumulate (WMMA) operations.
This interface enables programmatic access to Tensor Cores, a new hardware feature in Volta that performs mixed precision matrix MAC operations.

Access to WMMA using CUDAnative is available in two levels: low level wrappers around the LLVM intrinsics, and a higher-level API, similar to that of CUDA C.

Note that to use the WMMA intrinsics, you need a sufficiently recent version of Julia: `v1.4.0-DEV.666` or later.
You can check this by running the following in the REPL:
```julia
VERSION >= v"1.4.0-DEV.666"
```

!!! note

If you're running into any of following errors while using the WMMA interfaces:
```
LLVM error: Do not know how to split the result of this operator!
```
or
```
CUDA error: a PTX JIT compilation failed (code 218, ERROR_INVALID_PTX)
ptxas application ptx input, line <line>; error : .aligned modifier required for instruction '<instr>'
```
then make sure you are running Julia v1.4.0-DEV.666 or later!

## Introduction of Terminology

The WMMA operations perform a matrix multiply-accumulate.
More concretely, it calculates ``D = A \cdot B + C``, where ``A`` is a ``M \times K`` matrix, ``B`` is a ``K \times N`` matrix, and ``C`` and ``D`` are ``M \times N`` matrices.

Note that not all values of ``M``, ``N`` and ``K`` are allowed.
The tuple ``(M, N, K)`` is often called the "shape" of the multiply accumulate operation.

The multiply-accumulate consists of the following steps:
- Load the matrices ``A``, ``B`` and ``C`` from memory to registers using a WMMA load operation.
- Perform the matrix multiply-accumulate of ``A``, ``B`` and ``C`` to obtain ``D`` using a WMMA MMA operation. ``D`` is stored in hardware registers after this step.
- Store the result ``D`` back to memory using a WMMA store operation.

Note that WMMA is a warp-wide operation, which means that all threads in a warp must cooperate, and execute the WMMA operations in lockstep.
Failure to do so will result in undefined behaviour.

Each thread in a warp will hold a part of the matrix in its registers.
In WMMA parlance, this part is referred to as a "fragment".
Note that the exact mapping between matrix elements and fragment is unspecified, and subject to change in future versions.

Finally, it is important to note that the resultant ``D`` matrix can be used as a ``C`` matrix for a subsequent multiply-accumulate.
This is useful if one needs to calculate a sum of the form ``\sum_{i=0}^{n} A_i B_i``, where ``A_i`` and ``B_i`` are matrices of the correct dimension.

## LLVM Intrinsics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user perspective I would want to first read about C-like/highlevel API and then if I am interested I care about the intrinsics.


The LLVM intrinsics are accessible by using the one-to-one Julia wrappers.
The return type of each wrapper is the Julia type that corresponds closest to the return type of the LLVM intrinsic.
For example, LLVM's `[8 x <2 x half>]` becomes `NTuple{8, NTuple{2, VecElement{Float16}}}` in Julia.
In essence, these wrappers return the SSA values returned by the LLVM intrinsic.
Currently, all intrinsics that are available in LLVM 6, PTX 6.0 and SM 70 are implemented.

These LLVM intrinsics are then lowered to the correct PTX instructions by the LLVM NVPTX backend.
For more information about the PTX instructions, please refer to the [PTX Instruction Set Architecture Manual](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions).

The LLVM intrinsics are subdivided in three categories: load, store and multiply-accumulate.
In what follows, each of these will be discussed.

### Load matrix
thomasfaingnaert marked this conversation as resolved.
Show resolved Hide resolved
```@docs
CUDAnative.WMMA.llvm_wmma_load
```

### Perform multiply-accumulate
```@docs
CUDAnative.WMMA.llvm_wmma_mma
```

### Store matrix
```@docs
CUDAnative.WMMA.llvm_wmma_store
```

### Example

````@eval
lines = readlines("../../../examples/wmma/low-level.jl")
start = findfirst(x -> x == "### START", lines) + 1
stop = findfirst(x -> x == "### END", lines) - 1
example = join(lines[start:stop], '\n')

using Markdown
Markdown.parse("""
```julia
$(example)
```
""")
````

## CUDA C-like API

The main difference between the CUDA C-like API and the lower level wrappers, is that the former enforces several constraints when working with WMMA.
For example, it ensures that the ``A`` fragment argument to the MMA instruction was obtained by a `load_a` call, and not by a `load_b` or `load_c`.
Additionally, it makes sure that the data type and storage layout of the load/store operations and the MMA operation match.

The CUDA C-like API heavily uses Julia's dispatch mechanism.
As such, the method names are much shorter than the LLVM intrinsic wrappers, as most information is baked into the type of the arguments rather than the method name.


Note that, in CUDA C++, the fragment is responsible for both the storage of intermediate results and the WMMA configuration.
All CUDA C++ WMMA calls are function templates that take the resultant fragment as a by-reference argument.
As a result, the type of this argument can be used during overload resolution to select the correct WMMA instruction to call.

In contrast, the API in Julia separates the WMMA storage ([`WMMA.Fragment`](@ref)) and configuration ([`WMMA.Config`](@ref)).
Instead of taking the resultant fragment by reference, the Julia functions just return it.
This makes the dataflow clearer, but it also means that the type of that fragment cannot be used for selection of the correct WMMA instruction.
Thus, there is still a limited amount of information that cannot be inferred from the argument types, but must nonetheless match for all WMMA operations, such as the overall shape of the MMA.
This is accomplished by a separate "WMMA configuration" (see [`WMMA.Config`](@ref)) that you create once, and then give as an argument to all intrinsics.

### Fragment
```@docs
CUDAnative.WMMA.FragmentLayout
CUDAnative.WMMA.RowMajor
CUDAnative.WMMA.ColMajor
CUDAnative.WMMA.Unspecified
CUDAnative.WMMA.Fragment
```

### WMMA configuration
```@docs
CUDAnative.WMMA.Config
```

### Load matrix
```@docs
CUDAnative.WMMA.load_a
CUDAnative.WMMA.load_b
CUDAnative.WMMA.load_c
```

### Perform multiply-accumulate
```@docs
CUDAnative.WMMA.mma
```

### Store matrix
```@docs
CUDAnative.WMMA.store_d
```

### Fill fragment
```@docs
CUDAnative.WMMA.fill_c
```

### Element access and broadcasting

Similar to the CUDA C++ WMMA API, [`WMMA.Fragment`](@ref)s have an `x` member that can be used to access individual elements.
Note that, in contrast to the values returned by the LLVM intrinsics, the `x` member is flattened.
For example, while the `Float16` variants of the `load_a` instrinsics return `NTuple{8, NTuple{2, VecElement{Float16}}}`, the `x` member has type `NTuple{16, Float16}`.

Typically, you will only need to access the `x` member to perform elementwise operations.
This can be more succinctly expressed using Julia's broadcast mechanism.
For example, to double each element in a fragment, you can simply use:
```julia
frag = 2.0f0 .* frag
```

### Example

````@eval
lines = readlines("../../../examples/wmma/high-level.jl")
start = findfirst(x -> x == "### START", lines) + 1
stop = findfirst(x -> x == "### END", lines) - 1
example = join(lines[start:stop], '\n')

using Markdown
Markdown.parse("""
```julia
$(example)
```
""")
````
46 changes: 46 additions & 0 deletions examples/wmma/high-level.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Need https://github.com/JuliaLang/julia/pull/33970
# and https://github.com/JuliaLang/julia/pull/34043
if VERSION < v"1.4.0-DEV.666"
exit()
end

using CUDAnative
if CUDAnative.current_capability() < v"7.0"
exit()
end

### START
using CUDAnative
using CuArrays
using Test

a = rand(Float16, (16, 16))
b = rand(Float16, (16, 16))
c = rand(Float32, (16, 16))

a_dev = CuArray(a)
b_dev = CuArray(b)
c_dev = CuArray(c)
d_dev = similar(c_dev)

function kernel(a_dev, b_dev, c_dev, d_dev)
conf = WMMA.Config{16, 16, 16, Float32}

a_frag = WMMA.load_a(pointer(a_dev), 16, WMMA.ColMajor, conf)
b_frag = WMMA.load_b(pointer(b_dev), 16, WMMA.ColMajor, conf)
c_frag = WMMA.load_c(pointer(c_dev), 16, WMMA.ColMajor, conf)

c_frag = 0.5f0 .* c_frag

d_frag = WMMA.mma(a_frag, b_frag, c_frag, conf)

WMMA.store_d(pointer(d_dev), d_frag, 16, WMMA.ColMajor, conf)

return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
d = Array(d_dev)

@test all(isapprox.(a * b + 0.5 * c, d; rtol=0.01))
### END
42 changes: 42 additions & 0 deletions examples/wmma/low-level.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Need https://github.com/JuliaLang/julia/pull/33970
# and https://github.com/JuliaLang/julia/pull/34043
if VERSION < v"1.4.0-DEV.666"
exit()
end

using CUDAnative
if CUDAnative.current_capability() < v"7.0"
exit()
end

### START
using CUDAnative
using CuArrays
using Test

# Generate input matrices
a = rand(Float16, (16, 16))
a_dev = CuArray(a)
b = rand(Float16, (16, 16))
b_dev = CuArray(b)
c = rand(Float32, (16, 16))
c_dev = CuArray(c)

# Allocate space for result
d_dev = similar(c_dev)

# Matrix multiply-accumulate kernel (D = A * B + C)
function kernel(a_dev, b_dev, c_dev, d_dev)
a_frag = WMMA.llvm_wmma_load_a_col_m16n16k16_stride_f16(pointer(a_dev), 16)
b_frag = WMMA.llvm_wmma_load_b_col_m16n16k16_stride_f16(pointer(b_dev), 16)
c_frag = WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f32(pointer(c_dev), 16)

d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f32(a_frag, b_frag, c_frag)

WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f32(pointer(d_dev), d_frag, 16)
return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
@test all(isapprox.(a * b + c, Array(d_dev); rtol=0.01))
### END
1 change: 1 addition & 0 deletions src/device/cuda.jl
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ include("cuda/assertion.jl")
include("cuda/memory_dynamic.jl")
include("cuda/atomics.jl")
include("cuda/misc.jl")
include("cuda/wmma.jl")

# functionality from libdevice
#
Expand Down
5 changes: 3 additions & 2 deletions src/device/cuda/memory_shared.jl
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,9 @@ end
initializer!(gv, null(gv_typ))
end
# by requesting a larger-than-datatype alignment, we might be able to vectorize.
# we pick 16 bytes since this is the largest transaction size as supported by PTX.
alignment!(gv, Base.max(16, datatype_align(T)))
# we pick 32 bytes here, since WMMA instructions require 32-byte alignment.
# TODO: Make the alignment configurable
alignment!(gv, Base.max(32, datatype_align(T)))

# generate IR
Builder(JuliaContext()) do builder
Expand Down
Loading