Releases: JuliaGPU/CUDA.jl
Releases ยท JuliaGPU/CUDA.jl
v4.4.2
CUDA v4.4.2
Merged pull requests:
- Added support for more transform directions (#1903) (@RainerHeintzmann)
- CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
- Add some performance tips to the documentation (#1999) (@Zentrik)
- Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
- Adapt to GPUCompiler#master. (#2062) (@maleadt)
- Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
- Use released GPUCompiler. (#2064) (@maleadt)
- Fixes for Windows. (#2065) (@maleadt)
- Switch to GPUArrays buffer management. (#2068) (@maleadt)
- Update CUDA 12 to Update 2. (#2071) (@maleadt)
- [CUSOLVER] Add generic routines (#2074) (@amontoison)
- Update manifest (#2076) (@github-actions[bot])
- Test improvements (#2079) (@maleadt)
- Rework and extend the cooperative groups API. (#2081) (@maleadt)
- Update manifest (#2082) (@github-actions[bot])
- [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
- Fix some typos in perfomance tips (#2086) (@Zentrik)
- Improve PTX ISA selection (#2088) (@maleadt)
- Update manifest (#2090) (@github-actions[bot])
- support ChainRulesCore inplaceability (#2091) (@piever)
- Add a method inv(CuMatrix) (#2095) (@amontoison)
- Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
- Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
- Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
- Add a with_workspaces function (#2099) (@amontoison)
- [CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
- [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
- Call exit when handling exceptions. (#2103) (@maleadt)
- Bump packages. (#2104) (@maleadt)
- Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
- Update manifest (#2107) (@github-actions[bot])
- Make Ref mutable on the GPU. (#2109) (@maleadt)
- CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
- Small profiler improvements (#2113) (@maleadt)
- Update manifest (#2114) (@github-actions[bot])
- [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
- [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
- Fix incorrect timing results for CUDA.@Elapsed (#2118) (@thomasfaingnaert)
- [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
- Update manifest (#2123) (@github-actions[bot])
- Profiler: Show used local memory. (#2124) (@maleadt)
- Support for CUDA 12.3 (#2125) (@maleadt)
- [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
- [CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
- Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
- Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
- [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
- CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
- CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
- Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
- Better support for unified and host memory (#2138) (@maleadt)
- Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
- Avoid allocations during derived array construction. (#2142) (@maleadt)
- More performance tweaks for memory copying (#2143) (@maleadt)
- Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
- Update documentation (#2146) (@maleadt)
- Fixes for sm_61 (#2151) (@maleadt)
- Update sparse factorizations (#2152) (@amontoison)
- Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
- Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
- Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
- Sanitizer improvements. (#2157) (@maleadt)
- [CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
- Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
- [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
- [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
- Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
- expand docs on launch parameters (#2167) (@simonbyrne)
- Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
- kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
- [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
- Added kronecker product support for dense matrices (#2177) (@albertomercurio)
- Update to CUTENSOR 2.0 (#2178) (@maleadt)
- Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
- provide more information on kernel compilation error (#2180) (@simonbyrne)
- [CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
- [CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
- [CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
- Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
- [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
- Support more kwarg syntax with kernel launches (#2189) (@maleadt)
- Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
- NVML: Add support for clock queries. (#2194) (@maleadt)
- Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
- Improvements to context handling (#2200) (@maleadt)
- Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
- Rework unique context management. (#2202) (@maleadt)
- Preserve the buffer type when broadcasting. (#2203) (@maleadt)
- Fixes for Windows (#2206) (@maleadt)
- Bump Aqua. (#2207) (@maleadt)
- Updates for new CUQUANTUM (#2210) (@kshyatt)
- CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
- CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
- Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
- Default to testing with only a single device. (#2221) (@maleadt)
- Backports for v5.1 (#2224) (@maleadt)
- Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
- cuTensor fixes (#2228) (@maleadt)
- Bump versions. (#2229) (@maleadt)
- Add a note about threaded for-blocks. (#2232) (@kshyatt)
- cuTENSOR plan handling changes. (#2234) (@maleadt)
- Fix dynamic dispatch issues (#2235) (@MilesCranmer)
- CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
- Fixes for nightly (#2240) (@maleadt)
- CUBLAS: Support more strided inputs (#2242) (@maleadt)
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make
@device_code_sass
work with non-Julia kernels. (#2247) (@maleadt) - Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in
CUSPARSE
mm!
functions (#2298) (@lpawela) - Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- [CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
Closed issues:
- Element-wise conversion to Duals (#127)
- IDEA: CuHostArray (#28)
- Make Ref pass by-reference (#267)
- Failed to compile PTX code when using NSight on Win11 (#1601)
- view(data, idx) boundschecking is disproportionately expensive (#1678)
- [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
- Trouble using nsight systems for profiling CUDA in Julia (#1779)
- dlopen("libcudart") results in duplicate libraries (#1814)
- Support for JLD2 (#1833)
- Windows Defender mis-labels artifacts as threat (#1836)
- Support Cholesky factorization of CuSparseMatrixCSR (#1855)
- Runtime not re-selected after driver upgrade (#1877)
- Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
- Cannot precompile GPU code with PrecompileTools (#2006)
- Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
- CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
StaticArrays.SHermitianCompact
not working in kernels in Julia 1.10.0-beta2 (#2069)- Support for LinearAlgebra.pinv (#2070)
- PTX ISA 8.1 support (#2080)
- Segmentation fault when importing CUDA (#2083)
- "No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n)
can not be used whenm
orn
is zero (#2093)- Miss...
v5.2.0
CUDA v5.2.0
Merged pull requests:
- CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
- Update to CUTENSOR 2.0 (#2178) (@maleadt)
- Updates for new CUQUANTUM (#2210) (@kshyatt)
- Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
- cuTensor fixes (#2228) (@maleadt)
- Bump versions. (#2229) (@maleadt)
- Add a note about threaded for-blocks. (#2232) (@kshyatt)
- cuTENSOR plan handling changes. (#2234) (@maleadt)
- Fix dynamic dispatch issues (#2235) (@MilesCranmer)
- CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
- Fixes for nightly (#2240) (@maleadt)
- CUBLAS: Support more strided inputs (#2242) (@maleadt)
Closed issues:
- Trouble using nsight systems for profiling CUDA in Julia (#1779)
- Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
- Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
- First test for Julia/CUDA with 15 failures (#2158)
- Update to CUTENSOR 2.0 (#2174)
- Tests fail for CUDA#master (#2223)
- Test failures on Nvidia GH200 (#2227)
- mul! should support strided outputs (#2230)
- Please add support for older cuda versions (cuda 8 and older) (#2231)
- NSight Compute: prevent API calls during precompilation (#2233)
- Integrated profiler: detect lack of permissions (#2237)
v5.1.2
CUDA v5.1.2
Merged pull requests:
- kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
- [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
- Added kronecker product support for dense matrices (#2177) (@albertomercurio)
- Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
- provide more information on kernel compilation error (#2180) (@simonbyrne)
- [CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
- [CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
- [CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
- Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
- [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
- Support more kwarg syntax with kernel launches (#2189) (@maleadt)
- Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
- NVML: Add support for clock queries. (#2194) (@maleadt)
- Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
- Improvements to context handling (#2200) (@maleadt)
- Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
- Rework unique context management. (#2202) (@maleadt)
- Preserve the buffer type when broadcasting. (#2203) (@maleadt)
- Fixes for Windows (#2206) (@maleadt)
- Bump Aqua. (#2207) (@maleadt)
- CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
- CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
- Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
- Default to testing with only a single device. (#2221) (@maleadt)
- Backports for v5.1 (#2224) (@maleadt)
Closed issues:
- More informative errors when parameter size is too big (#2119)
- Modifying
struct
containingCuArray
fails in threads in 5.0.0 and 5.1.0 (#2171) - Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
- Support for combining duplicate elements in sparse matrices (#2185)
- Interactive sessions: periodically trim the memory pool (#2190)
- Broadcast does not preserve buffer type (#2191)
- CUDA doesn't precompile on Julia nightly/1.11 (#2195)
- Latest julia: UndefVarError:
make_seed
not defined inRandom
(#2198) - CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
- Most recent package versions not supported on CUDA.jl (#2212)
- Testing of CUDA fails (#2222)
--debug-info=2
makesNNlibCUDACUDNNExt
precompilation run forever (#2225)
v5.1.1
CUDA v5.1.1
Merged pull requests:
- Sanitizer improvements. (#2157) (@maleadt)
- [CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
- Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
- [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
- [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
- Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
- expand docs on launch parameters (#2167) (@simonbyrne)
- Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
Closed issues:
- High CPU load during GPU syncronization (#2161)
v5.1.0
CUDA v5.1.0
CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming. For more details, see the blog post.
Merged pull requests:
- [CUSOLVER] Add generic routines (#2074) (@amontoison)
- Rework and extend the cooperative groups API. (#2081) (@maleadt)
- [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
- Fix some typos in perfomance tips (#2086) (@Zentrik)
- Improve PTX ISA selection (#2088) (@maleadt)
- Update manifest (#2090) (@github-actions[bot])
- support ChainRulesCore inplaceability (#2091) (@piever)
- Add a method inv(CuMatrix) (#2095) (@amontoison)
- Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
- Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
- Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
- Add a with_workspaces function (#2099) (@amontoison)
- [CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
- [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
- Call exit when handling exceptions. (#2103) (@maleadt)
- Bump packages. (#2104) (@maleadt)
- Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
- Update manifest (#2107) (@github-actions[bot])
- Make Ref mutable on the GPU. (#2109) (@maleadt)
- CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
- Small profiler improvements (#2113) (@maleadt)
- Update manifest (#2114) (@github-actions[bot])
- [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
- [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
- Fix incorrect timing results for
CUDA.@elapsed
(#2118) (@thomasfaingnaert) - [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
- Update manifest (#2123) (@github-actions[bot])
- Profiler: Show used local memory. (#2124) (@maleadt)
- Support for CUDA 12.3 (#2125) (@maleadt)
- [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
- [CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
- Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
- Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
- [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
- CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
- CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
- Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
- Better support for unified and host memory (#2138) (@maleadt)
- Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
- Avoid allocations during derived array construction. (#2142) (@maleadt)
- More performance tweaks for memory copying (#2143) (@maleadt)
- Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
- Update documentation (#2146) (@maleadt)
- Fixes for sm_61 (#2151) (@maleadt)
- Update sparse factorizations (#2152) (@amontoison)
- Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
- Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
- Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
Closed issues:
- Element-wise conversion to Duals (#127)
- IDEA: CuHostArray (#28)
- Make Ref pass by-reference (#267)
- view(data, idx) boundschecking is disproportionately expensive (#1678)
- [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
- dlopen("libcudart") results in duplicate libraries (#1814)
- Support for JLD2 (#1833)
- Windows Defender mis-labels artifacts as threat (#1836)
- Support Cholesky factorization of CuSparseMatrixCSR (#1855)
- Runtime not re-selected after driver upgrade (#1877)
- Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
- Cannot precompile GPU code with PrecompileTools (#2006)
- CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
- PTX ISA 8.1 support (#2080)
- Segmentation fault when importing CUDA (#2083)
- "No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n)
can not be used whenm
orn
is zero (#2093)- Missing CUDA_Runtime_Discovery as a dependency in cuDNN (#2094)
- Binaries for Jetson (#2105)
- Minimum/maximum of array of NaNs is infinity (#2111)
- Performance regression for multiple
@sync
copyto! on CUDA v5 (#2112) - [CUBLAS] Regenerate the wrappers with updated argument types (#2115)
- Unable to allocate unified memory buffers (#2120)
- CUDA 12.3 has been released (#2122)
- atomic min, max for Float32 and Float64 (#2129)
- Native profiler output is limited to around 100 columns when printing to a file (#2130)
- LLVM generates max.NaN which only works on sm_80 (#2148)
- Unified memory-related error on Tegra T194 (#2149)
- Errors on sm_61 (#2150)
v5.0.0
CUDA v5.0.0
Blog post: https://info.juliahub.com/cuda-jl-5-0-changes
This is a breaking release, but the breaking changes are minimal (see the blog post for details):
- Julia 1.8 is now required, and only CUDA 11.4+ is supported
- selection of local toolkits has changed slightly
Merged pull requests:
- Added support for more transform directions (#1903) (@RainerHeintzmann)
- Add some performance tips to the documentation (#1999) (@Zentrik)
- Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
- Adapt to GPUCompiler#master. (#2062) (@maleadt)
- Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
- Use released GPUCompiler. (#2064) (@maleadt)
- Fixes for Windows. (#2065) (@maleadt)
- Switch to GPUArrays buffer management. (#2068) (@maleadt)
- Update CUDA 12 to Update 2. (#2071) (@maleadt)
- Update manifest (#2076) (@github-actions[bot])
- Test improvements (#2079) (@maleadt)
- Update manifest (#2082) (@github-actions[bot])
Closed issues:
v4.4.1
CUDA v4.4.1
Closed issues:
- CUDA driver device support does not match toolkit (#70)
- Launching kernels should not allocate (#66)
- sync_threads() appears to not be sync'ing threads (#61)
- Exception when using CuArrays with Flux (#129)
- Kernel using MVector fails to compile or crashes at runtime due to heap allocation (#45)
- Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master (#538)
- Improve 'VS C++ redistributable' error message (#764)
- CUSPARSE does not support reductions (#1406)
- CUDA test failed (#1690)
- Type constructor in broadcast doesn't compile (#1761)
- accumulate(+) gives different results for CuArray compared to Array. (#1810)
- Compat driver: preload all libraries (#1859)
- Stream synchronization is slow when waiting on the event from CUDA (#1910)
- cuDNN: Store convolution algorithm choice to disk. (#1947)
- Disable 'No CUDA-capable device found' error log (#1955)
- CUDNN_STATUS_NOT_SUPPORTED using 1D CNN model (#1977)
- Memory allocations during in-place sparse matrix-vector multiplication (#1982)
CUSPARSE.sum_dim1
sums the absolute values of elements (#1983)- Update to CUDA 12.2 (#1984)
unsafe_wrap
fails on zero element CuArrays (#1985)rand
in kernel works in a deterministic way (#2008)- Scalar indexing with
CuArray * ReshapedArray{SubArray{CuArray}}}
(#2009) - volumerhs performance regression (#2010)
- CuSparseMatrix constructors allocate too much memory? (#2015)
- Native profiler using CUPTI (#2017)
- libLLVM-15jl.so (#2018)
- "symbol multiply defined" error (#2021)
- Confusion on row major vs column major (#2023)
- Printing of CuArrays gives zeros or random numbers (#2033)
sortperm!
fails when output isUInt
vector (#2046)- Re-introduce spinning loop before nonblocking synchronization (#2057)
Merged pull requests:
- Check mathType only if not Float32 (#1943) (@RomeoV)
- 1.10 enablement (#1946) (@dkarrasch)
- Implement reverse lookup (Ptr->Tuple) for CUDNN descriptors. (#1948) (@RomeoV)
- Wrapper with tests for
gemmBatchedEx!
(#1975) (@lpawela) - Add wrappers for
gemv_batched!
(#1981) (@lpawela) - Update
CUSPARSE.sum_dim<n>
to allow for arbitrary function on elements (#1987) (@lpawela) - Update manifest (#1988) (@github-actions[bot])
- Add vectorized cached loads (#1993) (@Zentrik)
- Update manifest (#1995) (@github-actions[bot])
- Fix typo in captured macro example (#1996) (@Zentrik)
- Adapt Type call broadcasting to a function (#2000) (@simonbyrne)
- [CUSPARSE] Added support for generalized dot product dot(x, A, y) = dot(x, A * y) without allocating A * y (#2001) (@albertomercurio)
- Update manifest (#2002) (@github-actions[bot])
- Support for printing types. (#2003) (@maleadt)
- Fix accumulate bug (#2005) (@chrstphrbrns)
- Update manifest (#2013) (@github-actions[bot])
- Add a raw mode to code_sass. (#2019) (@maleadt)
- Update manifest (#2022) (@github-actions[bot])
- Add a native profiler. (#2024) (@maleadt)
- Perform synchronization on a worker thread (#2025) (@maleadt)
- Remove broken video link in docs (#2028) (@christiangnrd)
- When freeing memory, use the high-level device getter. (#2029) (@maleadt)
- Add support for @cuda fastmath (#2030) (@maleadt)
- Make "CUDA.jl" a link on the doc entry page (#2031) (@carstenbauer)
- Add support for CUDA 12.2. (#2034) (@maleadt)
- rand: seed kernels from the host. (#2035) (@maleadt)
- Update wrappers for CUDA 12.2. (#2039) (@maleadt)
- On CUDA 12.2, have the memory pool enforce hard memory limits. (#2040) (@maleadt)
- Delay all initialization errors until run time. (#2041) (@maleadt)
- JLL/CI/Julia changes. (#2042) (@maleadt)
- Add support for NVTX events to the integrated profiler. (#2043) (@maleadt)
- Update cuStateVec to cuQuantum 23.6. (#2044) (@maleadt)
- Add some more fastmath functions (#2047) (@Zentrik)
- Fixup wrong key lookup. (#2048) (@RomeoV)
- Update manifest (#2049) (@github-actions[bot])
- Make sortperm! resilient to type mismatches. (#2051) (@maleadt)
- Disable tests that cause GC corruption on 1.10. (#2053) (@maleadt)
- enable dependabot for GitHub actions (#2054) (@ranocha)
- Bump actions/checkout from 2 to 3 (#2055) (@dependabot[bot])
- Bump peter-evans/create-pull-request from 3 to 5 (#2056) (@dependabot[bot])
- Rework how local toolkits are selected. (#2058) (@maleadt)
- Busy-wait before doing nonblocking synchronization. (#2059) (@maleadt)
v4.4.0
CUDA v4.4.0
Closed issues:
- Unreachable control flow leads to illegal divergent barriers (#1746)
- CUBLAS fails on new CUDA.jl v4 (#1852)
- Sort fails on Lovelace (sm8.9) GPUs (#1874)
- gesvd! crashes on Pascal and v12.0 (#1932)
- No effect for calling "nsys launch" (#1938)
- Basic math operations with nested adjoint and transpose (#1940)
- CPU and GPU implementations return results at dissimilar scales, even in double precision arithmetics (#1950)
- Failed CUDA.jl initialization breaks Flux? (#1952)
- Recent
mul!
changes break multiplication with matrices that haveStaticArray
elements (#1953) - Test infrastructure: define test groups (#1961)
- Strange
rand
errors when sampling large matrices (#1963) - Add aqua tests (#1964)
- Support of Orin GPU from Nvidia ? (#1966)
- Crash in LLVM (#1971)
- Warning cuDNN Convolution (#1972)
- Strange behaviour when installed at system level (#1973)
Merged pull requests:
- Update benchmarks for 1.8 and 1.9 (#1933) (@maleadt)
- CUSOLVER: Explicitly pass NULL when not requesting svd outputs. (#1934) (@maleadt)
- Detect and complain about loading system libraries. (#1935) (@maleadt)
- Update manifest (#1936) (@github-actions[bot])
- Avoid stack overflow with eary OOM reporting. (#1937) (@maleadt)
- [CUSPARSE] Improved support for UniformScaling ad Diagonal (#1941) (@albertomercurio)
- Update manifest (#1949) (@github-actions[bot])
- Update GPUCompiler to fix unreachable control flow. (#1951) (@maleadt)
- Allow StaticArray eltype in matmat{vec,mul} (#1954) (@lcw)
- Bump CUDNN to v8.9. (#1959) (@maleadt)
- Bump CUTENSOR to v1.7. (#1960) (@maleadt)
- Add and fix some aqua tests (#1965) (@charleskawczynski)
- Fix compatibility of CUDA 11.4 to support Orin. (#1967) (@maleadt)
- Don't use Int32 indices in rand kernels. (#1969) (@maleadt)
- CI simplifications (#1970) (@maleadt)
- Use Base.pkgversion on 1.9. (#1974) (@maleadt)
- Update to LLVM.jl 6. (#1976) (@maleadt)
- fix launch config bug in bitonic sort (#1979) (@xaellison)
- Update manifest (#1980) (@github-actions[bot])
v4.3.2
v4.3.1
CUDA v4.3.1
Closed issues:
- Array testsuite compiles kernel with large types (#1902)
- CUDA.jl v4 installs CUDA runtime despite version=local (#1922)
- Occaisonal "CUSOLVERError: an internal operation failed (code 7, CUSOLVER_STATUS_INTERNAL_ERROR)" (#1924)
- Does [email protected] need [email protected]? (#1929)
Merged pull requests: