Tags · JuliaGPU/CUDA.jl

v5.4.3

## CUDA v5.4.3

[Diff since v5.4.2](v5.4.2...v5.4.3)


**Merged pull requests:**
- add cublas<t>getrsBatched (#2385) (@bjarthur)
- add two quirks for rationals (#2403) (@lanceXwq)
- Bump cuDNN (#2404) (@maleadt)
- Add convert method for ScaledPlan (#2409) (@david-macmahon)
- Conditionalize a quirk. (#2411) (@maleadt)
- Relax signature of generic matvecmul! (#2414) (@dkarrasch)
- Fix kron launch configuration. (#2418) (@maleadt)
- Run full GC when under very high memory pressure. (#2421) (@maleadt)
- Enzyme: Fix cuarray return type (#2425) (@wsmoses)
- CompatHelper: bump compat for LLVM to 8, (keep existing compat) (#2426) (@github-actions[bot])
- pre-allocated pivot and info buffers for getrf_batched (#2431) (@bjarthur)
- Profiler tweaks. (#2432) (@maleadt)
- Update the Julia wrappers for CUDA v12.5.1 (#2436) (@amontoison)
- Correct workspace handling (#2437) (@maleadt)

**Closed issues:**
- Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053)
- Broadcasted multiplication with a rational doesn't work (#1926)
- Incorrect grid size in `kron` (#2410)
- GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412)
-  Failure of Eigenvalue Decomposition for Large Matrices. (#2413)
- CUDA_Driver_jll's lazy artifacts cause a precompilation-time warning (#2415)
- Recurrence of integer overflow bug (#1880) for a large matrix  (#2427)
- CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429)
- CUDA_Runtime_Discovery Did not find cupti on Arm system with nvhpc (#2433)
- CUDA.jl won't install/run on Jetson Orin NX (#2435)

Jul 9, 2024
71311af
zip
tar.gz
Notes

v5.4.2

## CUDA v5.4.2

[Diff since v5.4.1](v5.4.1...v5.4.2)


**Merged pull requests:**
- Fix and test the legacy memory pool. (#2402) (@maleadt)

May 29, 2024
7e6a57a
zip
tar.gz
Notes

v5.4.1

## CUDA v5.4.1

[Diff since v5.4.0](v5.4.0...v5.4.1)


**Merged pull requests:**
- Fixup Enzyme: Mark CuArray as noalias (#2401) (@wsmoses)

May 28, 2024
5bbd9a7
zip
tar.gz
Notes

v5.4.0

## CUDA v5.4.0

[Diff since v5.3.5](v5.3.5...v5.4.0)


**Merged pull requests:**
- Support CUDA 12.5 (#2392) (@maleadt)
- Mark cuarray as noalias (#2395) (@wsmoses)
- Update Julia wrappers for CUDA v12.5 (#2396) (@amontoison)
- Enable correct pool access for cublasXt. (#2398) (@maleadt)
- More fine-grained CUPTI version checks. (#2399) (@maleadt)

**Closed issues:**
- CUTENSOR breaks after device_reset! (#2319)
- cuBLASXt's `xt_gemm!` incompatible with stream-ordered allocated memory (#2320)
- Add helper function to recompile CUDA stack (#2364)

May 28, 2024
f2062a5
zip
tar.gz
Notes

v5.3.5

## CUDA v5.3.5

[Diff since v5.3.4](v5.3.4...v5.3.5)


**Merged pull requests:**
- Avoid constructing `MulAddMul`s on Julia v1.12+ (#2277) (@dkarrasch)
- CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot])
- Enzyme: allocation functions (#2386) (@wsmoses)
- Tweaks to prevent context construction on some operations (#2387) (@maleadt)
- Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt)
- CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt)
- Backport: Enzyme allocation fns (#2393) (@wsmoses)

**Closed issues:**
- Indexing a view uses scalar indexing (#1472)
- EnzymeCore is an unconditional dependency. (#2380)
- cuBLASLt wrappers ccall into cuBLAS (#2388)
- generic_trimatmul! error (#2389)

May 24, 2024
7232f85
zip
tar.gz
Notes

v5.3.4

## CUDA v5.3.4

[Diff since v5.3.3](v5.3.3...v5.3.4)


**Merged pull requests:**
- Add Enzyme Forward mode custom rule (#1869) (@wsmoses)
- Handle cache improvements (#2352) (@maleadt)
- Fix cuTensorNet compat (#2354) (@maleadt)
- Optimize array allocation. (#2355) (@maleadt)
- Change type restrictions in cuTENSOR operations (#2356) (@lkdvos)
- Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot])
- Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik)
- Make generic_trimatmul more specific (#2359) (@tgymnich)
- Return the currect memory type when wrapping system memory. (#2363) (@maleadt)
- Mark cublas version/handle as non-differentiable (#2368) (@wsmoses)
- Enzyme: Forward mode sync (#2369) (@wsmoses)
- Enzyme: support fill (#2371) (@wsmoses)
- unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt)
- Remove external_gvars. (#2373) (@maleadt)
- Tegra support with artifacts (#2374) (@maleadt)
- Backport Enzyme extension (#2375) (@wsmoses)
- Add note about --check-bounds=yes (#2378) (@Zinoex)
- Test Enzyme in a separate CI job. (#2379) (@maleadt)
- Fix tests for Tegra. (#2381) (@maleadt)
- Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses)

**Closed issues:**
- Native Softmax (#175)
- CUSOLVER: support eigendecomposition (#173)
- backslash with gpu matrices crashes julia (#161)
- at-benchmark captures GPU arrays (#156)
- Support kernels returning Union{} (#62)
- mul! falls back to generic implementation (#148)
- \ on qr factorization objects gives a method error (#138)
- Compiler failure if dependent module only contains a `japi1`  function (#49)
- copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
- Calling Flux.gpu on a view dumps core (#125)
- Creating `CuArray{Tracker.TrackedReal{Float64},1}` a few times causes segfaults (#121)
- Guard against exceeding maximum kernel parameter size (#32)
- Detect common API misuse in error handlers (#31)
- `rand` and friends default to `Float64` (#108)
- \ does not work for least squares (#104)
- ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
- CuIterator assumes batches to consist of multiple arrays (#86)
- Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
- Document (un)supported language features for kernel programming (#13)
- Missing dispatch for indexing of reshaped arrays (#556)
- Track array ownership to avoid illegal memory accesses (#763)
- NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
- Support for `sm_80` `cp.async`: asynchronous on-device copies (#850)
- Profiling Julia with Nsight Systems on Windows results in blank window (#862)
- sort! and partialsort! are considerably slower than CPU versions (#937)
- mul! does not dispatch on Adjoint (#1363)
- Cross-device copy of wrapped arrays fails (#1377)
- Memory allocation becomes very slow when reserved bytes is large (#1540)
- Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
- Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
- device_reset! does not seem to work anymore (#1579)
- device-side rand() are not random between successive kernel launches  (#1633)
- Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
- `cusparseSetStream_v2` not defined (#1820)
- Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
- KernelAbstractions.jl-related issues (#1838)
- lock failing in multithreaded plan_fft() (#1921)
- CUSolver finalizer tries to take ReentrantLock (#1923)
- Testsuite could be more careful about parallel testing (#2192)
- Opportunistic GC collection (#2303)
- Unable to use local CUDA runtime toolkit (#2367)
- Enzyme prevents testing on 1.11 (#2376)

May 15, 2024
c373258
zip
tar.gz
Notes

v5.3.3

## CUDA v5.3.3

[Diff since v5.3.2](v5.3.2...v5.3.3)


**Merged pull requests:**
- Rework context handling (#2346) (@maleadt)
- fix kernel launch logic (#2353) (@xaellison)

**Closed issues:**
- Excessive allocations when running on multiple threads  (#1429)
- Fix and test multigpu support (#2218)
- Bitonic sort exceeds launch resources (#2331)

Apr 27, 2024
50137ae
zip
tar.gz
Notes

v5.3.2

## CUDA v5.3.2

[Diff since v5.3.1](v5.3.1...v5.3.2)


**Merged pull requests:**
- Add EnzymeCore extension for parent_job (#2281) (@vchuravy)
- Consider running GC when allocating and synchronizing (#2304) (@maleadt)
- Refactor memory wrappers (#2335) (@maleadt)
- Auto-detect external profilers. (#2339) (@maleadt)
- Fix performance of indexing unified memory. (#2340) (@maleadt)
- Improve exception output (#2342) (@maleadt)
- Test multigpu on CI (#2348) (@maleadt)
- cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt)
- cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt)

**Closed issues:**
- CuArrays don't seem to display correctly in VS code (#875)
- Task scheduling can result in delays when synchronizing (#1525)
- Docs: add example on task-based parallelism with explicit synchronization (#1566)
- Exception output from many threads is not helpful (#1780)
- Autodetect external profiler (#2176)
- LazyInitialized is not GC-safe (#2216)
- Track CuArray stream usage (#2236)
- Improve cross-device usage (#2323)
- CUBLASLt wrapper for `cublasLtMatmulDescSetAttribute` can have device buffers as input (#2337)
- Improve error message when assigning real valued arrray with complex numbers (#2341)
- `@device_code_sass` broken (#2343)
- Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
- `@gcsafe_ccall` breaks inlining of ccall wrappers (#2347)

Apr 26, 2024
e2e7b57
zip
tar.gz
Notes

v5.3.1

## CUDA v5.3.1

[Diff since v5.3.0](v5.3.0...v5.3.1)


**Merged pull requests:**
- [CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison)
- Regenerate headers (#2324) (@maleadt)
- Add some installation tips to docs/README.md (#2326) (@jlchan)
- fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3)
- Diagnose kernel limits on launch failure. (#2329) (@maleadt)
- Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt)

**Closed issues:**
- Missing CUBLASLt wrappers (#2322)
- error when switching device (#2323)
- v5.3.0: regression in Zygote performance (#2333)

Apr 19, 2024
9c9a05f
zip
tar.gz
Notes

v5.3.0

## CUDA v5.3.0

[Diff since v5.2.0](v5.2.0...v5.3.0)


**Merged pull requests:**
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make `@device_code_sass` work with non-Julia kernels. (#2247) (@maleadt)
- Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Towards supporting Julia 1.11 (#2291) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in `CUSPARSE` `mm!` functions (#2298) (@lpawela)
- Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- sortperm with dims (#2308) (@xaellison)
- [CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
- Avoid capturing `AbstractArray`s in `BoundsError` (#2314) (@lcw)
- Clarify debug level hint. (#2316) (@maleadt)

**Closed issues:**
- Failed to compile PTX code when using NSight on Win11 (#1601)
- `sortperm` fails with `dims` keyword (#2061)
- NVTX-related segfault on Windows under compute-sanitizer (#2204)
- Inverse Complex-to-Real FFT allocates GPU memory (#2249)
- cuDNN not available for your platform (#2252)
- Cannot reset CuArray to zero (#2257)
- Cannot take gradient of `sort` on 2D CuArray  (#2259)
- Multi-threaded code hanging forever with Julia 1.10  (#2261)
- CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
- Adjoint not supported on Diagonal arrays (#2275)
- Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
- Release v5.3? (#2283)
- Wrap CUDSS? (#2287)
- Bug concerning broadcast between device array and unified array (#2289)
- `StackOverflowError` trying to throw `OutOfGPUMemoryError`, subsequent errors (#2292)
- BUG: sortperm! seems to perform much slower than it should (#2293)
- Multiplying `CuSparseMatrixCSC` by `CuMatrix` results in `Out of GPU memory` (#2296)
- BFloat16 support broken on Julia 1.11 (#2306)
- does not emit line info for debbuging/profiling (#2312)
- Kernel using `StaticArray` compiles in julia v1.9.4 but not in v1.10.2 (#2313)
- Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)

Apr 12, 2024
5da4d1d
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v5.4.3

v5.4.2

v5.4.1

v5.4.0

v5.3.5

v5.3.4

v5.3.3

v5.3.2

v5.3.1

v5.3.0

Tags: JuliaGPU/CUDA.jl