Skip to content

Releases: JuliaGPU/CUDA.jl

v5.4.2

29 May 07:35
7e6a57a
Compare
Choose a tag to compare

CUDA v5.4.2

Diff since v5.4.1

Merged pull requests:

v5.4.1

28 May 18:53
5bbd9a7
Compare
Choose a tag to compare

CUDA v5.4.1

Diff since v5.4.0

Merged pull requests:

v5.4.0

28 May 06:45
Compare
Choose a tag to compare

CUDA v5.4.0

Diff since v5.3.5

Merged pull requests:

Closed issues:

  • CUTENSOR breaks after device_reset! (#2319)
  • cuBLASXt's xt_gemm! incompatible with stream-ordered allocated memory (#2320)
  • Add helper function to recompile CUDA stack (#2364)

v5.3.5

24 May 13:29
7232f85
Compare
Choose a tag to compare

CUDA v5.3.5

Diff since v5.3.4

Merged pull requests:

  • Avoid constructing MulAddMuls on Julia v1.12+ (#2277) (@dkarrasch)
  • CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot])
  • Enzyme: allocation functions (#2386) (@wsmoses)
  • Tweaks to prevent context construction on some operations (#2387) (@maleadt)
  • Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt)
  • CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt)
  • Backport: Enzyme allocation fns (#2393) (@wsmoses)

Closed issues:

  • Indexing a view uses scalar indexing (#1472)
  • EnzymeCore is an unconditional dependency. (#2380)
  • cuBLASLt wrappers ccall into cuBLAS (#2388)
  • generic_trimatmul! error (#2389)

v5.3.4

15 May 19:28
c373258
Compare
Choose a tag to compare

CUDA v5.3.4

Diff since v5.3.3

Merged pull requests:

Closed issues:

  • Native Softmax (#175)
  • CUSOLVER: support eigendecomposition (#173)
  • backslash with gpu matrices crashes julia (#161)
  • at-benchmark captures GPU arrays (#156)
  • Support kernels returning Union{} (#62)
  • mul! falls back to generic implementation (#148)
  • \ on qr factorization objects gives a method error (#138)
  • Compiler failure if dependent module only contains a japi1 function (#49)
  • copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
  • Calling Flux.gpu on a view dumps core (#125)
  • Creating CuArray{Tracker.TrackedReal{Float64},1} a few times causes segfaults (#121)
  • Guard against exceeding maximum kernel parameter size (#32)
  • Detect common API misuse in error handlers (#31)
  • rand and friends default to Float64 (#108)
  • \ does not work for least squares (#104)
  • ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
  • CuIterator assumes batches to consist of multiple arrays (#86)
  • Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
  • Document (un)supported language features for kernel programming (#13)
  • Missing dispatch for indexing of reshaped arrays (#556)
  • Track array ownership to avoid illegal memory accesses (#763)
  • NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
  • Support for sm_80 cp.async: asynchronous on-device copies (#850)
  • Profiling Julia with Nsight Systems on Windows results in blank window (#862)
  • sort! and partialsort! are considerably slower than CPU versions (#937)
  • mul! does not dispatch on Adjoint (#1363)
  • Cross-device copy of wrapped arrays fails (#1377)
  • Memory allocation becomes very slow when reserved bytes is large (#1540)
  • Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
  • Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
  • device_reset! does not seem to work anymore (#1579)
  • device-side rand() are not random between successive kernel launches (#1633)
  • Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
  • cusparseSetStream_v2 not defined (#1820)
  • Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
  • KernelAbstractions.jl-related issues (#1838)
  • lock failing in multithreaded plan_fft() (#1921)
  • CUSolver finalizer tries to take ReentrantLock (#1923)
  • Testsuite could be more careful about parallel testing (#2192)
  • Opportunistic GC collection (#2303)
  • Unable to use local CUDA runtime toolkit (#2367)
  • Enzyme prevents testing on 1.11 (#2376)

v5.3.3

27 Apr 10:11
Compare
Choose a tag to compare

CUDA v5.3.3

Diff since v5.3.2

Merged pull requests:

Closed issues:

  • Excessive allocations when running on multiple threads (#1429)
  • Fix and test multigpu support (#2218)
  • Bitonic sort exceeds launch resources (#2331)

v5.3.2

26 Apr 13:59
Compare
Choose a tag to compare

CUDA v5.3.2

Diff since v5.3.1

Merged pull requests:

Closed issues:

  • CuArrays don't seem to display correctly in VS code (#875)
  • Task scheduling can result in delays when synchronizing (#1525)
  • Docs: add example on task-based parallelism with explicit synchronization (#1566)
  • Exception output from many threads is not helpful (#1780)
  • Autodetect external profiler (#2176)
  • LazyInitialized is not GC-safe (#2216)
  • Track CuArray stream usage (#2236)
  • Improve cross-device usage (#2323)
  • CUBLASLt wrapper for cublasLtMatmulDescSetAttribute can have device buffers as input (#2337)
  • Improve error message when assigning real valued arrray with complex numbers (#2341)
  • @device_code_sass broken (#2343)
  • Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
  • @gcsafe_ccall breaks inlining of ccall wrappers (#2347)

v5.3.1

19 Apr 07:16
9c9a05f
Compare
Choose a tag to compare

CUDA v5.3.1

Diff since v5.3.0

Merged pull requests:

Closed issues:

  • Missing CUBLASLt wrappers (#2322)
  • error when switching device (#2323)
  • v5.3.0: regression in Zygote performance (#2333)

v5.3.0

12 Apr 14:27
5da4d1d
Compare
Choose a tag to compare

CUDA v5.3.0

Diff since v5.2.0

Merged pull requests:

Closed issues:

  • Failed to compile PTX code when using NSight on Win11 (#1601)
  • sortperm fails with dims keyword (#2061)
  • NVTX-related segfault on Windows under compute-sanitizer (#2204)
  • Inverse Complex-to-Real FFT allocates GPU memory (#2249)
  • cuDNN not available for your platform (#2252)
  • Cannot reset CuArray to zero (#2257)
  • Cannot take gradient of sort on 2D CuArray (#2259)
  • Multi-threaded code hanging forever with Julia 1.10 (#2261)
  • CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
  • Adjoint not supported on Diagonal arrays (#2275)
  • Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
  • Release v5.3? (#2283)
  • Wrap CUDSS? (#2287)
  • Bug concerning broadcast between device array and unified array (#2289)
  • StackOverflowError trying to throw OutOfGPUMemoryError, subsequent errors (#2292)
  • BUG: sortperm! seems to perform much slower than it should (#2293)
  • Multiplying CuSparseMatrixCSC by CuMatrix results in Out of GPU memory (#2296)
  • BFloat16 support broken on Julia 1.11 (#2306)
  • does not emit line info for debbuging/profiling (#2312)
  • Kernel using StaticArray compiles in julia v1.9.4 but not in v1.10.2 (#2313)
  • Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)

v4.4.2

04 Apr 09:27
Compare
Choose a tag to compare

CUDA v4.4.2

Diff since v4.4.1

Merged pull requests:

Closed issues:

  • Element-wise conversion to Duals (#127)
  • IDEA: CuHostArray (#28)
  • Make Ref pass by-reference (#267)
  • Failed to compile PTX code when using NSight on Win11 (#1601)
  • view(data, idx) boundschecking is disproportionately expensive (#1678)
  • [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
  • Trouble using nsight systems for profiling CUDA in Julia (#1779)
  • dlopen("libcudart") results in duplicate libraries (#1814)
  • Support for JLD2 (#1833)
  • Windows Defender mis-labels artifacts as threat (#1836)
  • Support Cholesky factorization of CuSparseMatrixCSR (#1855)
  • Runtime not re-selected after driver upgrade (#1877)
  • Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
  • Cannot precompile GPU code with PrecompileTools (#2006)
  • Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
  • CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
  • StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
  • Support for LinearAlgebra.pinv (#2070)
  • PTX ISA 8.1 support (#2080)
  • Segmentation fault when importing CUDA (#2083)
  • "No system CUDA driver found" on NixOS (#2089)
  • CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
  • Miss...
Read more