v4.4.2
CUDA v4.4.2
Merged pull requests:
- Added support for more transform directions (#1903) (@RainerHeintzmann)
- CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
- Add some performance tips to the documentation (#1999) (@Zentrik)
- Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
- Adapt to GPUCompiler#master. (#2062) (@maleadt)
- Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
- Use released GPUCompiler. (#2064) (@maleadt)
- Fixes for Windows. (#2065) (@maleadt)
- Switch to GPUArrays buffer management. (#2068) (@maleadt)
- Update CUDA 12 to Update 2. (#2071) (@maleadt)
- [CUSOLVER] Add generic routines (#2074) (@amontoison)
- Update manifest (#2076) (@github-actions[bot])
- Test improvements (#2079) (@maleadt)
- Rework and extend the cooperative groups API. (#2081) (@maleadt)
- Update manifest (#2082) (@github-actions[bot])
- [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
- Fix some typos in perfomance tips (#2086) (@Zentrik)
- Improve PTX ISA selection (#2088) (@maleadt)
- Update manifest (#2090) (@github-actions[bot])
- support ChainRulesCore inplaceability (#2091) (@piever)
- Add a method inv(CuMatrix) (#2095) (@amontoison)
- Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
- Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
- Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
- Add a with_workspaces function (#2099) (@amontoison)
- [CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
- [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
- Call exit when handling exceptions. (#2103) (@maleadt)
- Bump packages. (#2104) (@maleadt)
- Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
- Update manifest (#2107) (@github-actions[bot])
- Make Ref mutable on the GPU. (#2109) (@maleadt)
- CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
- Small profiler improvements (#2113) (@maleadt)
- Update manifest (#2114) (@github-actions[bot])
- [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
- [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
- Fix incorrect timing results for CUDA.@Elapsed (#2118) (@thomasfaingnaert)
- [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
- Update manifest (#2123) (@github-actions[bot])
- Profiler: Show used local memory. (#2124) (@maleadt)
- Support for CUDA 12.3 (#2125) (@maleadt)
- [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
- [CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
- Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
- Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
- [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
- CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
- CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
- Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
- Better support for unified and host memory (#2138) (@maleadt)
- Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
- Avoid allocations during derived array construction. (#2142) (@maleadt)
- More performance tweaks for memory copying (#2143) (@maleadt)
- Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
- Update documentation (#2146) (@maleadt)
- Fixes for sm_61 (#2151) (@maleadt)
- Update sparse factorizations (#2152) (@amontoison)
- Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
- Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
- Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
- Sanitizer improvements. (#2157) (@maleadt)
- [CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
- Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
- [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
- [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
- Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
- expand docs on launch parameters (#2167) (@simonbyrne)
- Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
- kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
- [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
- Added kronecker product support for dense matrices (#2177) (@albertomercurio)
- Update to CUTENSOR 2.0 (#2178) (@maleadt)
- Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
- provide more information on kernel compilation error (#2180) (@simonbyrne)
- [CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
- [CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
- [CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
- Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
- [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
- Support more kwarg syntax with kernel launches (#2189) (@maleadt)
- Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
- NVML: Add support for clock queries. (#2194) (@maleadt)
- Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
- Improvements to context handling (#2200) (@maleadt)
- Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
- Rework unique context management. (#2202) (@maleadt)
- Preserve the buffer type when broadcasting. (#2203) (@maleadt)
- Fixes for Windows (#2206) (@maleadt)
- Bump Aqua. (#2207) (@maleadt)
- Updates for new CUQUANTUM (#2210) (@kshyatt)
- CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
- CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
- Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
- Default to testing with only a single device. (#2221) (@maleadt)
- Backports for v5.1 (#2224) (@maleadt)
- Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
- cuTensor fixes (#2228) (@maleadt)
- Bump versions. (#2229) (@maleadt)
- Add a note about threaded for-blocks. (#2232) (@kshyatt)
- cuTENSOR plan handling changes. (#2234) (@maleadt)
- Fix dynamic dispatch issues (#2235) (@MilesCranmer)
- CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
- Fixes for nightly (#2240) (@maleadt)
- CUBLAS: Support more strided inputs (#2242) (@maleadt)
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make
@device_code_sass
work with non-Julia kernels. (#2247) (@maleadt) - Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in
CUSPARSE
mm!
functions (#2298) (@lpawela) - Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- [CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
Closed issues:
- Element-wise conversion to Duals (#127)
- IDEA: CuHostArray (#28)
- Make Ref pass by-reference (#267)
- Failed to compile PTX code when using NSight on Win11 (#1601)
- view(data, idx) boundschecking is disproportionately expensive (#1678)
- [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
- Trouble using nsight systems for profiling CUDA in Julia (#1779)
- dlopen("libcudart") results in duplicate libraries (#1814)
- Support for JLD2 (#1833)
- Windows Defender mis-labels artifacts as threat (#1836)
- Support Cholesky factorization of CuSparseMatrixCSR (#1855)
- Runtime not re-selected after driver upgrade (#1877)
- Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
- Cannot precompile GPU code with PrecompileTools (#2006)
- Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
- CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
StaticArrays.SHermitianCompact
not working in kernels in Julia 1.10.0-beta2 (#2069)- Support for LinearAlgebra.pinv (#2070)
- PTX ISA 8.1 support (#2080)
- Segmentation fault when importing CUDA (#2083)
- "No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n)
can not be used whenm
orn
is zero (#2093)- Missing CUDA_Runtime_Discovery as a dependency in cuDNN (#2094)
- Binaries for Jetson (#2105)
- Minimum/maximum of array of NaNs is infinity (#2111)
- Performance regression for multiple @sync copyto! on CUDA v5 (#2112)
- [CUBLAS] Regenerate the wrappers with updated argument types (#2115)
- More informative errors when parameter size is too big (#2119)
- Unable to allocate unified memory buffers (#2120)
- CUDA 12.3 has been released (#2122)
- atomic min, max for Float32 and Float64 (#2129)
- Native profiler output is limited to around 100 columns when printing to a file (#2130)
- Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
- LLVM generates max.NaN which only works on sm_80 (#2148)
- Unified memory-related error on Tegra T194 (#2149)
- Errors on sm_61 (#2150)
- First test for Julia/CUDA with 15 failures (#2158)
- High CPU load during GPU syncronization (#2161)
- Modifying
struct
containingCuArray
fails in threads in 5.0.0 and 5.1.0 (#2171) - Update to CUTENSOR 2.0 (#2174)
- Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
- Support for combining duplicate elements in sparse matrices (#2185)
- Interactive sessions: periodically trim the memory pool (#2190)
- Broadcast does not preserve buffer type (#2191)
- CUDA doesn't precompile on Julia nightly/1.11 (#2195)
- Latest julia: UndefVarError:
make_seed
not defined inRandom
(#2198) - NVTX-related segfault on Windows under compute-sanitizer (#2204)
- CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
- Most recent package versions not supported on CUDA.jl (#2212)
- Testing of CUDA fails (#2222)
- Tests fail for CUDA#master (#2223)
--debug-info=2
makesNNlibCUDACUDNNExt
precompilation run forever (#2225)- Test failures on Nvidia GH200 (#2227)
- mul! should support strided outputs (#2230)
- Please add support for older cuda versions (cuda 8 and older) (#2231)
- NSight Compute: prevent API calls during precompilation (#2233)
- Integrated profiler: detect lack of permissions (#2237)
- Inverse Complex-to-Real FFT allocates GPU memory (#2249)
- cuDNN not available for your platform (#2252)
- Cannot reset CuArray to zero (#2257)
- Cannot take gradient of
sort
on 2D CuArray (#2259) - Multi-threaded code hanging forever with Julia 1.10 (#2261)
- CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
- Adjoint not supported on Diagonal arrays (#2275)
- Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
- Release v5.3? (#2283)
- Wrap CUDSS? (#2287)
- Bug concerning broadcast between device array and unified array (#2289)
StackOverflowError
trying to throwOutOfGPUMemoryError
, subsequent errors (#2292)- BUG: sortperm! seems to perform much slower than it should (#2293)
- Multiplying
CuSparseMatrixCSC
byCuMatrix
results inOut of GPU memory
(#2296) - BFloat16 support broken on Julia 1.11 (#2306)