JuliaGPU · maleadt · Sep 20, 2023 · Sep 19, 2023
diff --git a/docs/src/tutorials/performance.jl b/docs/src/tutorials/performance.jl
@@ -8,7 +8,7 @@
 # * Identify problematic kernel invocations: you may be launching thousands of kernels which could be fused into a single call;
 # * Find stalls, where the CPU isn't submitting work fast enough to keep the GPU busy.
 
-# If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to look out for in order of importance
+# If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to look out for in order of importance:
 # * Memory optimizations are the most important area for performance. Hence optimizing memory accesses, e.g., avoiding needless global accesses (buffering in shared memory instead) and coalescing accesses can lead to big performance improvements;
 # * Launching more threads on each streaming multiprocessor can be acheived by lowering register pressure and reducing shared memory usage, the tips below outline the various ways in which register pressure can be reduced;
 # * Using Float32's instead of Float64's can provide significantly better performance;
@@ -36,7 +36,7 @@
 
 # The [CUDA C++ Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) is relevant for Julia.
 
-# The following notebooks also have some good tips: [JuliaCon 2021 GPU Workshop](https://github.com/maleadt/juliacon21-gpu_workshop/blob/main/deep_dive/CUDA.ipynb), [Advanced Julia GPU Training](https://github.com/JuliaComputing/Training/tree/master/AdvancedGPU)
+# The following notebooks also have some good tips: [JuliaCon 2021 GPU Workshop](https://github.com/maleadt/juliacon21-gpu_workshop/blob/main/deep_dive/CUDA.ipynb), [Advanced Julia GPU Training](https://github.com/JuliaComputing/Training/tree/master/AdvancedGPU).
 
 # Also see the [perf](https://github.com/JuliaGPU/CUDA.jl/tree/master/perf) folder for some optimised code examples.
 
@@ -47,7 +47,7 @@
 # Many common operations can throw errors at runtime in Julia, they often do this by branching and calling a function in that branch both of which are slow on GPUs. Using `@inbounds` when indexing into arrays will eliminate exceptions due to bounds checking. You can also use `assume` from the package LLVM.jl to get rid of exceptions, e.g.
 
 # ```julia
-# using LLVM, LLVM.Interop
+# using LLVM.Interop
 
 # function test(x, y)
 # assume(x > 0)
@@ -61,7 +61,7 @@
 
 # ### 32-bit Integers
 
-# Use 32-bit integers where possible. A common source of register pressure is the use of 64-bit integers when only 32-bits are required. For example, the hardware's indices are 32-bit integers, but Julia's literals are Int64's which results in expressions like blockIdx().x-1 to be promoted to 64-bit integers. To use 32-bit integers we can instead replace the `1` with `Int32(1)` or more succintly `1i32` if you run `using CUDA: i32`
+# Use 32-bit integers where possible. A common source of register pressure is the use of 64-bit integers when only 32-bits are required. For example, the hardware's indices are 32-bit integers, but Julia's literals are Int64's which results in expressions like blockIdx().x-1 to be promoted to 64-bit integers. To use 32-bit integers we can instead replace the `1` with `Int32(1)` or more succintly `1i32` if you run `using CUDA: i32`.
 
 # To see how much of a difference this makes let's use a kernel introduced in the [introduction](../introduction) for inplace addition.
 
@@ -89,7 +89,7 @@ end
 # 29
 # ```
 
-# Our kernel using 32-bit integers is below
+# Our kernel using 32-bit integers is below:
 
 function gpu_add4!(y, x)
  index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x