Skip to content

Commit

Permalink
Fix some typos in perfomance tips (#2086)
Browse files Browse the repository at this point in the history
[skip tests]
  • Loading branch information
Zentrik committed Sep 20, 2023
1 parent 2fa6572 commit 78db02a
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions docs/src/tutorials/performance.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
# * Identify problematic kernel invocations: you may be launching thousands of kernels which could be fused into a single call;
# * Find stalls, where the CPU isn't submitting work fast enough to keep the GPU busy.

# If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to look out for in order of importance
# If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to look out for in order of importance:
# * Memory optimizations are the most important area for performance. Hence optimizing memory accesses, e.g., avoiding needless global accesses (buffering in shared memory instead) and coalescing accesses can lead to big performance improvements;
# * Launching more threads on each streaming multiprocessor can be acheived by lowering register pressure and reducing shared memory usage, the tips below outline the various ways in which register pressure can be reduced;
# * Using Float32's instead of Float64's can provide significantly better performance;
Expand Down Expand Up @@ -36,7 +36,7 @@

# The [CUDA C++ Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) is relevant for Julia.

# The following notebooks also have some good tips: [JuliaCon 2021 GPU Workshop](https://github.com/maleadt/juliacon21-gpu_workshop/blob/main/deep_dive/CUDA.ipynb), [Advanced Julia GPU Training](https://github.com/JuliaComputing/Training/tree/master/AdvancedGPU)
# The following notebooks also have some good tips: [JuliaCon 2021 GPU Workshop](https://github.com/maleadt/juliacon21-gpu_workshop/blob/main/deep_dive/CUDA.ipynb), [Advanced Julia GPU Training](https://github.com/JuliaComputing/Training/tree/master/AdvancedGPU).

# Also see the [perf](https://github.com/JuliaGPU/CUDA.jl/tree/master/perf) folder for some optimised code examples.

Expand All @@ -47,7 +47,7 @@
# Many common operations can throw errors at runtime in Julia, they often do this by branching and calling a function in that branch both of which are slow on GPUs. Using `@inbounds` when indexing into arrays will eliminate exceptions due to bounds checking. You can also use `assume` from the package LLVM.jl to get rid of exceptions, e.g.

# ```julia
# using LLVM, LLVM.Interop
# using LLVM.Interop

# function test(x, y)
# assume(x > 0)
Expand All @@ -61,7 +61,7 @@

# ### 32-bit Integers

# Use 32-bit integers where possible. A common source of register pressure is the use of 64-bit integers when only 32-bits are required. For example, the hardware's indices are 32-bit integers, but Julia's literals are Int64's which results in expressions like blockIdx().x-1 to be promoted to 64-bit integers. To use 32-bit integers we can instead replace the `1` with `Int32(1)` or more succintly `1i32` if you run `using CUDA: i32`
# Use 32-bit integers where possible. A common source of register pressure is the use of 64-bit integers when only 32-bits are required. For example, the hardware's indices are 32-bit integers, but Julia's literals are Int64's which results in expressions like blockIdx().x-1 to be promoted to 64-bit integers. To use 32-bit integers we can instead replace the `1` with `Int32(1)` or more succintly `1i32` if you run `using CUDA: i32`.

# To see how much of a difference this makes let's use a kernel introduced in the [introduction](../introduction) for inplace addition.

Expand Down Expand Up @@ -89,7 +89,7 @@ end
# 29
# ```

# Our kernel using 32-bit integers is below
# Our kernel using 32-bit integers is below:

function gpu_add4!(y, x)
index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
Expand Down

0 comments on commit 78db02a

Please sign in to comment.