Release v4.4.2 · JuliaGPU/CUDA.jl

CUDA v4.4.2

Diff since v4.4.1

Merged pull requests:

Added support for more transform directions (#1903) (@RainerHeintzmann)
CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
Add some performance tips to the documentation (#1999) (@Zentrik)
Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
Adapt to GPUCompiler#master. (#2062) (@maleadt)
Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
Use released GPUCompiler. (#2064) (@maleadt)
Fixes for Windows. (#2065) (@maleadt)
Switch to GPUArrays buffer management. (#2068) (@maleadt)
Update CUDA 12 to Update 2. (#2071) (@maleadt)
[CUSOLVER] Add generic routines (#2074) (@amontoison)
Update manifest (#2076) (@github-actions[bot])
Test improvements (#2079) (@maleadt)
Rework and extend the cooperative groups API. (#2081) (@maleadt)
Update manifest (#2082) (@github-actions[bot])
[CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
Fix some typos in perfomance tips (#2086) (@Zentrik)
Improve PTX ISA selection (#2088) (@maleadt)
Update manifest (#2090) (@github-actions[bot])
support ChainRulesCore inplaceability (#2091) (@piever)
Add a method inv(CuMatrix) (#2095) (@amontoison)
Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
Add a with_workspaces function (#2099) (@amontoison)
[CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
[CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
Call exit when handling exceptions. (#2103) (@maleadt)
Bump packages. (#2104) (@maleadt)
Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
Update manifest (#2107) (@github-actions[bot])
Make Ref mutable on the GPU. (#2109) (@maleadt)
CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
Small profiler improvements (#2113) (@maleadt)
Update manifest (#2114) (@github-actions[bot])
[CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
[CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
Fix incorrect timing results for CUDA.@Elapsed (#2118) (@thomasfaingnaert)
[CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
Update manifest (#2123) (@github-actions[bot])
Profiler: Show used local memory. (#2124) (@maleadt)
Support for CUDA 12.3 (#2125) (@maleadt)
[CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
[CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
[CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
Better support for unified and host memory (#2138) (@maleadt)
Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
Avoid allocations during derived array construction. (#2142) (@maleadt)
More performance tweaks for memory copying (#2143) (@maleadt)
Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
Update documentation (#2146) (@maleadt)
Fixes for sm_61 (#2151) (@maleadt)
Update sparse factorizations (#2152) (@amontoison)
Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
Sanitizer improvements. (#2157) (@maleadt)
[CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
[CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
[CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
expand docs on launch parameters (#2167) (@simonbyrne)
Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
[CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
Added kronecker product support for dense matrices (#2177) (@albertomercurio)
Update to CUTENSOR 2.0 (#2178) (@maleadt)
Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
provide more information on kernel compilation error (#2180) (@simonbyrne)
[CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
[CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
[CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
[CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
Support more kwarg syntax with kernel launches (#2189) (@maleadt)
Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
NVML: Add support for clock queries. (#2194) (@maleadt)
Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
Improvements to context handling (#2200) (@maleadt)
Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
Rework unique context management. (#2202) (@maleadt)
Preserve the buffer type when broadcasting. (#2203) (@maleadt)
Fixes for Windows (#2206) (@maleadt)
Bump Aqua. (#2207) (@maleadt)
Updates for new CUQUANTUM (#2210) (@kshyatt)
CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
Default to testing with only a single device. (#2221) (@maleadt)
Backports for v5.1 (#2224) (@maleadt)
Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
cuTensor fixes (#2228) (@maleadt)
Bump versions. (#2229) (@maleadt)
Add a note about threaded for-blocks. (#2232) (@kshyatt)
cuTENSOR plan handling changes. (#2234) (@maleadt)
Fix dynamic dispatch issues (#2235) (@MilesCranmer)
CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
Fixes for nightly (#2240) (@maleadt)
CUBLAS: Support more strided inputs (#2242) (@maleadt)
CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
Slightly rework error handling (#2245) (@maleadt)
cuTENSOR improvements (#2246) (@maleadt)
Make @device_code_sass work with non-Julia kernels. (#2247) (@maleadt)
Improve Tegra detection. (#2251) (@maleadt)
Added few SparseArrays functions (#2254) (@albertomercurio)
Reduce locking in the handle cache (#2256) (@maleadt)
Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
Re-generate headers. (#2265) (@maleadt)
Update to CUDNN 9. (#2267) (@maleadt)
[CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
Add extension package for StaticArrays (#2273) (@trahflow)
Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
Cached workspace prototype for custatevec (#2279) (@kshyatt)
Update the Julia wrappers for v12.4 (#2282) (@amontoison)
Add support for CUDA 12.4. (#2286) (@maleadt)
Test suite changes (#2288) (@maleadt)
Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
Fix typo in performance tips (#2294) (@Zentrik)
Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
Set default buffer size in CUSPARSE mm! functions (#2298) (@lpawela)
Avoid OOMs during OOM handling. (#2299) (@maleadt)
[CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
[CUSOLVER] Interface larft! (#2301) (@amontoison)
Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
[CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
[CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)

Closed issues:

Element-wise conversion to Duals (#127)
IDEA: CuHostArray (#28)
Make Ref pass by-reference (#267)
Failed to compile PTX code when using NSight on Win11 (#1601)
view(data, idx) boundschecking is disproportionately expensive (#1678)
[CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
Trouble using nsight systems for profiling CUDA in Julia (#1779)
dlopen("libcudart") results in duplicate libraries (#1814)
Support for JLD2 (#1833)
Windows Defender mis-labels artifacts as threat (#1836)
Support Cholesky factorization of CuSparseMatrixCSR (#1855)
Runtime not re-selected after driver upgrade (#1877)
Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
Cannot precompile GPU code with PrecompileTools (#2006)
Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
Support for LinearAlgebra.pinv (#2070)
PTX ISA 8.1 support (#2080)
Segmentation fault when importing CUDA (#2083)
"No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
Missing CUDA_Runtime_Discovery as a dependency in cuDNN (#2094)
Binaries for Jetson (#2105)
Minimum/maximum of array of NaNs is infinity (#2111)
Performance regression for multiple @sync copyto! on CUDA v5 (#2112)
[CUBLAS] Regenerate the wrappers with updated argument types (#2115)
More informative errors when parameter size is too big (#2119)
Unable to allocate unified memory buffers (#2120)
CUDA 12.3 has been released (#2122)
atomic min, max for Float32 and Float64 (#2129)
Native profiler output is limited to around 100 columns when printing to a file (#2130)
Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
LLVM generates max.NaN which only works on sm_80 (#2148)
Unified memory-related error on Tegra T194 (#2149)
Errors on sm_61 (#2150)
First test for Julia/CUDA with 15 failures (#2158)
High CPU load during GPU syncronization (#2161)
Modifying struct containing CuArray fails in threads in 5.0.0 and 5.1.0 (#2171)
Update to CUTENSOR 2.0 (#2174)
Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
Support for combining duplicate elements in sparse matrices (#2185)
Interactive sessions: periodically trim the memory pool (#2190)
Broadcast does not preserve buffer type (#2191)
CUDA doesn't precompile on Julia nightly/1.11 (#2195)
Latest julia: UndefVarError: make_seed not defined in Random (#2198)
NVTX-related segfault on Windows under compute-sanitizer (#2204)
CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
Most recent package versions not supported on CUDA.jl (#2212)
Testing of CUDA fails (#2222)
Tests fail for CUDA#master (#2223)
--debug-info=2 makes NNlibCUDACUDNNExt precompilation run forever (#2225)
Test failures on Nvidia GH200 (#2227)
mul! should support strided outputs (#2230)
Please add support for older cuda versions (cuda 8 and older) (#2231)
NSight Compute: prevent API calls during precompilation (#2233)
Integrated profiler: detect lack of permissions (#2237)
Inverse Complex-to-Real FFT allocates GPU memory (#2249)
cuDNN not available for your platform (#2252)
Cannot reset CuArray to zero (#2257)
Cannot take gradient of sort on 2D CuArray (#2259)
Multi-threaded code hanging forever with Julia 1.10 (#2261)
CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
Adjoint not supported on Diagonal arrays (#2275)
Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
Release v5.3? (#2283)
Wrap CUDSS? (#2287)
Bug concerning broadcast between device array and unified array (#2289)
StackOverflowError trying to throw OutOfGPUMemoryError, subsequent errors (#2292)
BUG: sortperm! seems to perform much slower than it should (#2293)
Multiplying CuSparseMatrixCSC by CuMatrix results in Out of GPU memory (#2296)
BFloat16 support broken on Julia 1.11 (#2306)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.4.2

CUDA v4.4.2

Contributors