Update the documentation.

samo-lin · May 25, 2020 · e2b6735 · e2b6735
1 parent 1c12449
commit e2b6735
Show file tree

Hide file tree

Showing 11 changed files with 172 additions and 207 deletions.
diff --git a/docs/make.jl b/docs/make.jl
@@ -1,5 +1,5 @@
 using Documenter, Literate
-using CUDAapi, CUDAdrv, CUDAnative, CuArrays
+using CUDA
 
 const src = "https://github.com/JuliaGPU/CUDA.jl"
 const dst = "https://juliagpu.gitlab.io/CUDA.jl/"

diff --git a/docs/src/development/profiling.md b/docs/src/development/profiling.md
@@ -10,28 +10,28 @@ packages, provide several tools and APIs to remedy this.
 ## Time measurements
 
 To accurately measure execution time in the presence of asynchronously-executing kernels,
-CUDAdrv.jl provides an `@elapsed` macro that, much like `Base.@elapsed`, measures the total
+CUDA.jl provides an `@elapsed` macro that, much like `Base.@elapsed`, measures the total
 execution time of a block of code on the GPU:
 
 ```julia
-julia> a = CuArrays.rand(1024,1024,1024);
+julia> a = CUDA.rand(1024,1024,1024);
 
 julia> Base.@elapsed sin.(a) # WRONG!
 0.008714211
 
-julia> CUDAdrv.@elapsed sin.(a)
+julia> CUDA.@elapsed sin.(a)
 0.051607586f0
 ```
 
 This macro is a low-level utility, assumes the GPU is synchronized before calling, and is
 useful if you need execution timings in your application. For most purposes, you should use
-`CuArrays.@time` which mimics `Base.@time` by printing execution times as well as memory
+`CUDA.@time` which mimics `Base.@time` by printing execution times as well as memory
 allocation stats:
 
 ```julia
-julia> a = CuArrays.rand(1024,1024,1024);
+julia> a = CUDA.rand(1024,1024,1024);
 
-julia> CuArrays.@time sin.(a);
+julia> CUDA.@time sin.(a);
  0.046063 seconds (96 CPU allocations: 3.750 KiB) (1 GPU allocation: 4.000 GiB, 14.33% gc time of which 99.89% spent allocating)
 ```
 
@@ -43,13 +43,13 @@ For robust measurements however, it is advised to use the
 [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl) package which goes to
 great lengths to perform accurate measurements. Due to the asynchronous nature of GPUs, you
 need to ensure the GPU is synchronized at the end of every sample, e.g. by calling
-`CUDAdrv.synchronize()`. An easier, and better-performing alternative is to use the `@sync`
-macro from the CuArrays.jl package:
+`synchronize()`. An easier, and better-performing alternative is to use the unexported `@sync`
+macro:
 
 ```julia
-julia> a = CuArrays.rand(1024,1024,1024);
+julia> a = CUDA.rand(1024,1024,1024);
 
-julia> @benchmark CuArrays.@sync sin.($a)
+julia> @benchmark CUDA.@sync sin.($a)
 BenchmarkTools.Trial:
  memory estimate: 3.73 KiB
  allocs estimate: 95
@@ -64,7 +64,7 @@ BenchmarkTools.Trial:
 ```
 
 Note that the allocations as reported by BenchmarkTools are CPU allocations. For the GPU
-allocation behavior you need to consult `CuArrays.@time`.
+allocation behavior you need to consult `CUDA.@time`.
 
 
 ## Application profiling
@@ -76,19 +76,19 @@ find which kernels needs optimization.
 As we cannot use the Julia profiler for this task, we will be using external profiling
 software as part of the CUDA toolkit. To inform those external tools which code needs to be
 profiled (e.g., to exclude warm-up iterations or other noninteresting elements) you can use
-the `CUDAdrv.@profile` macro to surround interesting code with. Again, this macro mimics an
+the `CUDA.@profile` macro to surround interesting code with. Again, this macro mimics an
 equivalent from the standard library, but this time requires external software to actually
 perform the profiling:
 
 ```julia
-julia> a = CuArrays.rand(1024,1024,1024);
+julia> a = CUDA.rand(1024,1024,1024);
 
 julia> sin.(a); # warmup
 
-julia> CUDAdrv.@profile sin.(a);
-┌ Warning: Calling CUDAdrv.@profile only informs an external profiler to start.
+julia> CUDA.@profile sin.(a);
+┌ Warning: Calling CUDA.@profile only informs an external profiler to start.
 │ The user is responsible for launching Julia under a CUDA profiler like `nvprof`.
-└ @ CUDAdrv.Profile ~/Julia/pkg/CUDAdrv/src/profile.jl:42
+└ @ CUDA.Profile ~/Julia/pkg/CUDA/src/profile.jl:42
 ```
 
 ### `nvprof` and `nvvp`
@@ -99,19 +99,19 @@ julia> CUDAdrv.@profile sin.(a);
  Prefer to use the Nsight tools described below.
 
 For simple profiling, prefix your Julia command-line invocation with the `nvprof` utility.
-For a better timeline, be sure to use `CUDAdrv.@profile` to delimit interesting code and
+For a better timeline, be sure to use `CUDA.@profile` to delimit interesting code and
 start `nvprof` with the option `--profile-from-start off`:
 
 ```
 $ nvprof --profile-from-start off julia
 
-julia> using CuArrays, CUDAdrv
+julia> using CUDA
 
-julia> a = CuArrays.rand(1024,1024,1024);
+julia> a = CUDA.rand(1024,1024,1024);
 
 julia> sin.(a);
 
-julia> CUDAdrv.@profile sin.(a);
+julia> CUDA.@profile sin.(a);
 
 julia> exit()
 ==156406== Profiling application: julia
@@ -149,17 +149,17 @@ $ nsys launch julia
 
 You can then execute whatever code you want in the REPL, including e.g. loading Revise so
 that you can modify your application as you go. When you call into code that is wrapped by
-`CUDAdrv.@profile`, the profiler will become active and generate a profile output file in
+`CUDA.@profile`, the profiler will become active and generate a profile output file in
 the current folder:
 
 ```julia
-julia> using CuArrays, CUDAdrv
+julia> using CUDA
 
-julia> a = CuArrays.rand(1024,1024,1024);
+julia> a = CUDA.rand(1024,1024,1024);
 
 julia> sin.(a);
 
-julia> CUDAdrv.@profile sin.(a);
+julia> CUDA.@profile sin.(a);
 start executed
 Processing events...
 Capturing symbol files...
@@ -175,7 +175,7 @@ stop executed
  Even with a warm-up iteration, the first kernel or API call might seem to take
  significantly longer in the profiler. If you are analyzing short executions, instead
  of whole applications, repeat the operation twice (optionally separated by a call to
- `CUDAdrv.synchronize()` or wrapping in `CuArrays.@sync`)
+ `CUDA.synchronize()` or wrapping in `CUDA.@sync`)
 
 You can open the resulting `.qdrep` file with `nsight-sys`:
 
@@ -186,7 +186,7 @@ You can open the resulting `.qdrep` file with `nsight-sys`:
 If you want details on the execution properties of a kernel, or inspect API interactions,
 Nsight Compute is the tool for you. It is again possible to use this profiler with an
 interactive session of Julia, and debug or profile only those sections of your application
-that are marked with `CUDAdrv.@profile`.
+that are marked with `CUDA.@profile`.
 
 Start with launching Julia under the Nsight Compute CLI tool:
 
@@ -197,14 +197,14 @@ $ nv-nsight-cu-cli --mode=launch julia
 You will get an interactive REPL, where you can execute whatever code you want:
 
 ```julia
-julia> using CuArrays, CUDAdrv
+julia> using CUDA
 
 # Julia hangs!
 ```
 
-As soon as you import any CUDA package, your Julia process will hang. This is expected, as
-the tool breaks upon the very first call to the CUDA API, at which point you are expected to
-launch the Nsight Compute GUI utility and attach to the running session:
+As soon as you import CUDA.jl, your Julia process will hang. This is expected, as the tool
+breaks upon the very first call to the CUDA API, at which point you are expected to launch
+the Nsight Compute GUI utility and attach to the running session:
 
 !["NVIDIA Nsight Compute - Attaching to a session"](nsight_compute-attach.png)
 
@@ -215,11 +215,11 @@ You will see that the tool has stopped execution on the call to `cuInit`. Now ch
 Now our CLI session comes to life again, and we can enter the rest of our script:
 
 ```julia
-julia> a = CuArrays.rand(1024,1024,1024);
+julia> a = CUDA.rand(1024,1024,1024);
 
 julia> sin.(a);
 
-julia> CUDAdrv.@profile sin.(a);
+julia> CUDA.@profile sin.(a);
 ```
 
 Once that's finished, the Nsight Compute GUI window will have plenty details on our kernel:
@@ -236,10 +236,10 @@ the API calls that have been made:
 
 If you want to put additional information in the profile, e.g. phases of your application,
 or expensive CPU operations, you can use the NVTX library. Wrappers for this library are
-included in recent versions of CUDAnative:
+included in recent versions of CUDA.jl:
 
 ```julia
-using CUDAnative
+using CUDA
 
 NVTX.@range "doing X" begin
  ...
@@ -252,7 +252,7 @@ NVTX.@mark "reached Y"
 ## Compiler options
 
 Some tools, like `nvvp` and NSight Systems Compute, also make it possible to do source-level
-profiling. CUDAnative will by default emit the necessary source line information, which you
+profiling. CUDA.jl will by default emit the necessary source line information, which you
 can disable by launching Julia with `-g0`. Conversely, launching with `-g2` will emit
 additional debug information, which can be useful in combination with tools like `cuda-gdb`,
 but might hurt performance or code size.

diff --git a/docs/src/faq.md b/docs/src/faq.md
@@ -6,10 +6,10 @@ This page is a compilation of frequently asked questions and answers.
 ## Can you wrap this or that CUDA API?
 
 If a certain API isn't wrapped with some high-level functionality, you can always use the
-underlying C APIs which are always available as unexported methods. For example, with
-CUDAdrv.jl you can access the CUDA driver library as `cu` prefixed, unexported functions
-like `CUDAdrv.cuDriverGetVersion`. Similarly, vendor libraries like CUBLAS are available
-through their modules in CuArrays.jl, e.g., `CuArrays.CUBLAS.cublasGetVersion_v2`.
+underlying C APIs which are always available as unexported methods. For example, you can
+access the CUDA driver library as `cu` prefixed, unexported functions like
+`CUDA.cuDriverGetVersion`. Similarly, vendor libraries like CUBLAS are available through
+their exported submodule handles, e.g., `CUBLAS.cublasGetVersion_v2`.
 
 Any help on designing or implementing high-level wrappers for this low-level functionality
 is greatly appreciated, so please consider contributing your uses of these APIs on the

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -1,10 +1,8 @@
 # CUDA programming in Julia
 
-Julia has several packages for programming NVIDIA GPUs using CUDA. Some of these packages
-focus on performance and flexibility, while others aim to raise the abstraction level and
-improve performance. This website will introduce the different options, how to use them, and
-what best to choose for your application. For more specific details, such as API references
-or development practices, refer to each package's own documentation.
+The CUDA.jl package is the main entrypoint for for programming NVIDIA GPUs using CUDA. The
+package makes it possible to do so at various abstraction levels, from easy-to-use arrays
+down to hand-written kernels using low-level CUDA APIs.
 
 If you have any questions, please feel free to use the `#gpu` channel on the [Julia
 slack](https://julialang.slack.com/), or the [GPU domain of the Julia
@@ -14,21 +12,21 @@ Discourse](https://discourse.julialang.org/c/domain/gpu).
 ## Quick Start
 
 The Julia CUDA stack requires a functional CUDA-setup, which includes both a driver and
-matching toolkit. Once you've set that up, continue by installing the three core packages:
+matching toolkit. Once you've set that up, continue by installing the CUDA.jl package:
 
 ```julia
 using Pkg
-Pkg.add(["CUDAdrv", "CUDAnative", "CuArrays"])
+Pkg.add("CUDA")
 ```
 
-To make sure everything works as expected, try to load the packages and if you have the time
-execute their test suites:
+To make sure everything works as expected, try to load the package and if you have the time
+execute its test suite:
 
 ```julia
-using CUDAdrv, CUDAnative, CuArrays
+using CUDA
 
 using Pkg
-Pkg.test(["CUDAdrv", "CUDAnative", "CuArrays"])
+Pkg.test("CUDA")
 ```
 
 For more details on the installation process, consult the [Installation](@ref

diff --git a/docs/src/installation/conditional.md b/docs/src/installation/conditional.md
@@ -12,8 +12,8 @@ stack will be taken into account by the package resolver when installing your pa
 If the packages fail to initialize, a message will be print:
 
 ```julia
-julia> using CuArrays
-[ Info: CuArrays.jl failed to initialize, GPU functionality unavailable (set JULIA_CUDA_SILENT or JULIA_CUDA_VERBOSE to silence or expand this message)
+julia> using CUDA
+[ Info: CUDA.jl failed to initialize, GPU functionality unavailable (set JULIA_CUDA_SILENT or JULIA_CUDA_VERBOSE to silence or expand this message)
 ```
 
 To silence this message in your application, set the environment variable
@@ -22,8 +22,8 @@ print more information, and is required information for debugging or for filing
 
 ```julia
 julia> ENV["JULIA_CUDA_VERBOSE"] = true
-julia> using CuArrays
-┌ Error: CuArrays.jl failed to initialize
+julia> using CUDA
+┌ Error: CUDA.jl failed to initialize
 │ exception =
 │ could not load library "libcuda"
 │ libcuda.so: cannot open shared object file: No such file or directory
@@ -43,8 +43,8 @@ If your application requires a GPU, and its functionality is not designed to wor
 CUDA, you should just import the necessary packages and inspect if they are functional:
 
 ```julia
-using CuArrays
-@assert CuArrays.functional()
+using CUDA
+@assert CUDA.functional()
 ```
 
 If you are developing a package, you should take care only to perform this check at run
@@ -54,9 +54,9 @@ GPU:
 ```julia
 module MyApplication
 
-using CuArrays
+using CUDA
 
-__init__() = @assert CuArrays.functional()
+__init__() = @assert CUDA.functional()
 
 end
 ```
@@ -74,9 +74,9 @@ available:
 ```julia
 module MyApplication
 
-using CuArrays
+using CUDA
 
-if CuArrays.functional()
+if CUDA.functional()
  to_gpu_or_not_to_gpu(x::AbstractArray) = CuArray(x)
 else
  to_gpu_or_not_to_gpu(x::AbstractArray) = x
@@ -90,7 +90,7 @@ without CUDA. One option is to evaluate code at run time:
 
 ```julia
 function __init__()
- if CuArrays.functional()
+ if CUDA.functional()
  @eval to_gpu_or_not_to_gpu(x::AbstractArray) = CuArray(x)
  else
  @eval to_gpu_or_not_to_gpu(x::AbstractArray) = x
@@ -106,7 +106,7 @@ const use_gpu = Ref(false)
 to_gpu_or_not_to_gpu(x::AbstractArray) = use_gpu[] ? CuArray(x) : x
 
 function __init__()
- use_gpu[] = CuArrays.functional()
+ use_gpu[] = CUDA.functional()
 end
 ```