Skip to content

Tags: JuliaGPU/CUDA.jl

Tags

v5.4.3

Toggle v5.4.3's commit message
## CUDA v5.4.3

[Diff since v5.4.2](v5.4.2...v5.4.3)


**Merged pull requests:**
- add cublas<t>getrsBatched (#2385) (@bjarthur)
- add two quirks for rationals (#2403) (@lanceXwq)
- Bump cuDNN (#2404) (@maleadt)
- Add convert method for ScaledPlan (#2409) (@david-macmahon)
- Conditionalize a quirk. (#2411) (@maleadt)
- Relax signature of generic matvecmul! (#2414) (@dkarrasch)
- Fix kron launch configuration. (#2418) (@maleadt)
- Run full GC when under very high memory pressure. (#2421) (@maleadt)
- Enzyme: Fix cuarray return type (#2425) (@wsmoses)
- CompatHelper: bump compat for LLVM to 8, (keep existing compat) (#2426) (@github-actions[bot])
- pre-allocated pivot and info buffers for getrf_batched (#2431) (@bjarthur)
- Profiler tweaks. (#2432) (@maleadt)
- Update the Julia wrappers for CUDA v12.5.1 (#2436) (@amontoison)
- Correct workspace handling (#2437) (@maleadt)

**Closed issues:**
- Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053)
- Broadcasted multiplication with a rational doesn't work (#1926)
- Incorrect grid size in `kron` (#2410)
- GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412)
-  Failure of Eigenvalue Decomposition for Large Matrices. (#2413)
- CUDA_Driver_jll's lazy artifacts cause a precompilation-time warning (#2415)
- Recurrence of integer overflow bug (#1880) for a large matrix  (#2427)
- CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429)
- CUDA_Runtime_Discovery Did not find cupti on Arm system with nvhpc (#2433)
- CUDA.jl won't install/run on Jetson Orin NX (#2435)

v5.4.2

Toggle v5.4.2's commit message
## CUDA v5.4.2

[Diff since v5.4.1](v5.4.1...v5.4.2)


**Merged pull requests:**
- Fix and test the legacy memory pool. (#2402) (@maleadt)

v5.4.1

Toggle v5.4.1's commit message
## CUDA v5.4.1

[Diff since v5.4.0](v5.4.0...v5.4.1)


**Merged pull requests:**
- Fixup Enzyme: Mark CuArray as noalias (#2401) (@wsmoses)

v5.4.0

Toggle v5.4.0's commit message
## CUDA v5.4.0

[Diff since v5.3.5](v5.3.5...v5.4.0)


**Merged pull requests:**
- Support CUDA 12.5 (#2392) (@maleadt)
- Mark cuarray as noalias (#2395) (@wsmoses)
- Update Julia wrappers for CUDA v12.5 (#2396) (@amontoison)
- Enable correct pool access for cublasXt. (#2398) (@maleadt)
- More fine-grained CUPTI version checks. (#2399) (@maleadt)

**Closed issues:**
- CUTENSOR breaks after device_reset! (#2319)
- cuBLASXt's `xt_gemm!` incompatible with stream-ordered allocated memory (#2320)
- Add helper function to recompile CUDA stack (#2364)

v5.3.5

Toggle v5.3.5's commit message
## CUDA v5.3.5

[Diff since v5.3.4](v5.3.4...v5.3.5)


**Merged pull requests:**
- Avoid constructing `MulAddMul`s on Julia v1.12+ (#2277) (@dkarrasch)
- CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot])
- Enzyme: allocation functions (#2386) (@wsmoses)
- Tweaks to prevent context construction on some operations (#2387) (@maleadt)
- Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt)
- CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt)
- Backport: Enzyme allocation fns (#2393) (@wsmoses)

**Closed issues:**
- Indexing a view uses scalar indexing (#1472)
- EnzymeCore is an unconditional dependency. (#2380)
- cuBLASLt wrappers ccall into cuBLAS (#2388)
- generic_trimatmul! error (#2389)

v5.3.4

Toggle v5.3.4's commit message
## CUDA v5.3.4

[Diff since v5.3.3](v5.3.3...v5.3.4)


**Merged pull requests:**
- Add Enzyme Forward mode custom rule (#1869) (@wsmoses)
- Handle cache improvements (#2352) (@maleadt)
- Fix cuTensorNet compat (#2354) (@maleadt)
- Optimize array allocation. (#2355) (@maleadt)
- Change type restrictions in cuTENSOR operations (#2356) (@lkdvos)
- Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot])
- Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik)
- Make generic_trimatmul more specific (#2359) (@tgymnich)
- Return the currect memory type when wrapping system memory. (#2363) (@maleadt)
- Mark cublas version/handle as non-differentiable (#2368) (@wsmoses)
- Enzyme: Forward mode sync (#2369) (@wsmoses)
- Enzyme: support fill (#2371) (@wsmoses)
- unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt)
- Remove external_gvars. (#2373) (@maleadt)
- Tegra support with artifacts (#2374) (@maleadt)
- Backport Enzyme extension (#2375) (@wsmoses)
- Add note about --check-bounds=yes (#2378) (@Zinoex)
- Test Enzyme in a separate CI job. (#2379) (@maleadt)
- Fix tests for Tegra. (#2381) (@maleadt)
- Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses)

**Closed issues:**
- Native Softmax (#175)
- CUSOLVER: support eigendecomposition (#173)
- backslash with gpu matrices crashes julia (#161)
- at-benchmark captures GPU arrays (#156)
- Support kernels returning Union{} (#62)
- mul! falls back to generic implementation (#148)
- \ on qr factorization objects gives a method error (#138)
- Compiler failure if dependent module only contains a `japi1`  function (#49)
- copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
- Calling Flux.gpu on a view dumps core (#125)
- Creating `CuArray{Tracker.TrackedReal{Float64},1}` a few times causes segfaults (#121)
- Guard against exceeding maximum kernel parameter size (#32)
- Detect common API misuse in error handlers (#31)
- `rand` and friends default to `Float64` (#108)
- \ does not work for least squares (#104)
- ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
- CuIterator assumes batches to consist of multiple arrays (#86)
- Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
- Document (un)supported language features for kernel programming (#13)
- Missing dispatch for indexing of reshaped arrays (#556)
- Track array ownership to avoid illegal memory accesses (#763)
- NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
- Support for `sm_80` `cp.async`: asynchronous on-device copies (#850)
- Profiling Julia with Nsight Systems on Windows results in blank window (#862)
- sort! and partialsort! are considerably slower than CPU versions (#937)
- mul! does not dispatch on Adjoint (#1363)
- Cross-device copy of wrapped arrays fails (#1377)
- Memory allocation becomes very slow when reserved bytes is large (#1540)
- Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
- Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
- device_reset! does not seem to work anymore (#1579)
- device-side rand() are not random between successive kernel launches  (#1633)
- Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
- `cusparseSetStream_v2` not defined (#1820)
- Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
- KernelAbstractions.jl-related issues (#1838)
- lock failing in multithreaded plan_fft() (#1921)
- CUSolver finalizer tries to take ReentrantLock (#1923)
- Testsuite could be more careful about parallel testing (#2192)
- Opportunistic GC collection (#2303)
- Unable to use local CUDA runtime toolkit (#2367)
- Enzyme prevents testing on 1.11 (#2376)

v5.3.3

Toggle v5.3.3's commit message
## CUDA v5.3.3

[Diff since v5.3.2](v5.3.2...v5.3.3)


**Merged pull requests:**
- Rework context handling (#2346) (@maleadt)
- fix kernel launch logic (#2353) (@xaellison)

**Closed issues:**
- Excessive allocations when running on multiple threads  (#1429)
- Fix and test multigpu support (#2218)
- Bitonic sort exceeds launch resources (#2331)

v5.3.2

Toggle v5.3.2's commit message
## CUDA v5.3.2

[Diff since v5.3.1](v5.3.1...v5.3.2)


**Merged pull requests:**
- Add EnzymeCore extension for parent_job (#2281) (@vchuravy)
- Consider running GC when allocating and synchronizing (#2304) (@maleadt)
- Refactor memory wrappers (#2335) (@maleadt)
- Auto-detect external profilers. (#2339) (@maleadt)
- Fix performance of indexing unified memory. (#2340) (@maleadt)
- Improve exception output (#2342) (@maleadt)
- Test multigpu on CI (#2348) (@maleadt)
- cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt)
- cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt)

**Closed issues:**
- CuArrays don't seem to display correctly in VS code (#875)
- Task scheduling can result in delays when synchronizing (#1525)
- Docs: add example on task-based parallelism with explicit synchronization (#1566)
- Exception output from many threads is not helpful (#1780)
- Autodetect external profiler (#2176)
- LazyInitialized is not GC-safe (#2216)
- Track CuArray stream usage (#2236)
- Improve cross-device usage (#2323)
- CUBLASLt wrapper for `cublasLtMatmulDescSetAttribute` can have device buffers as input (#2337)
- Improve error message when assigning real valued arrray with complex numbers (#2341)
- `@device_code_sass` broken (#2343)
- Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
- `@gcsafe_ccall` breaks inlining of ccall wrappers (#2347)

v5.3.1

Toggle v5.3.1's commit message
## CUDA v5.3.1

[Diff since v5.3.0](v5.3.0...v5.3.1)


**Merged pull requests:**
- [CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison)
- Regenerate headers (#2324) (@maleadt)
- Add some installation tips to docs/README.md (#2326) (@jlchan)
- fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3)
- Diagnose kernel limits on launch failure. (#2329) (@maleadt)
- Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt)

**Closed issues:**
- Missing CUBLASLt wrappers (#2322)
- error when switching device (#2323)
- v5.3.0: regression in Zygote performance (#2333)

v5.3.0

Toggle v5.3.0's commit message
## CUDA v5.3.0

[Diff since v5.2.0](v5.2.0...v5.3.0)


**Merged pull requests:**
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make `@device_code_sass` work with non-Julia kernels. (#2247) (@maleadt)
- Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Towards supporting Julia 1.11 (#2291) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in `CUSPARSE` `mm!` functions (#2298) (@lpawela)
- Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- sortperm with dims (#2308) (@xaellison)
- [CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
- Avoid capturing `AbstractArray`s in `BoundsError` (#2314) (@lcw)
- Clarify debug level hint. (#2316) (@maleadt)

**Closed issues:**
- Failed to compile PTX code when using NSight on Win11 (#1601)
- `sortperm` fails with `dims` keyword (#2061)
- NVTX-related segfault on Windows under compute-sanitizer (#2204)
- Inverse Complex-to-Real FFT allocates GPU memory (#2249)
- cuDNN not available for your platform (#2252)
- Cannot reset CuArray to zero (#2257)
- Cannot take gradient of `sort` on 2D CuArray  (#2259)
- Multi-threaded code hanging forever with Julia 1.10  (#2261)
- CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
- Adjoint not supported on Diagonal arrays (#2275)
- Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
- Release v5.3? (#2283)
- Wrap CUDSS? (#2287)
- Bug concerning broadcast between device array and unified array (#2289)
- `StackOverflowError` trying to throw `OutOfGPUMemoryError`, subsequent errors (#2292)
- BUG: sortperm! seems to perform much slower than it should (#2293)
- Multiplying `CuSparseMatrixCSC` by `CuMatrix` results in `Out of GPU memory` (#2296)
- BFloat16 support broken on Julia 1.11 (#2306)
- does not emit line info for debbuging/profiling (#2312)
- Kernel using `StaticArray` compiles in julia v1.9.4 but not in v1.10.2 (#2313)
- Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)