Tags: JuliaGPU/CUDA.jl
Tags
## CUDA v5.4.3 [Diff since v5.4.2](v5.4.2...v5.4.3) **Merged pull requests:** - add cublas<t>getrsBatched (#2385) (@bjarthur) - add two quirks for rationals (#2403) (@lanceXwq) - Bump cuDNN (#2404) (@maleadt) - Add convert method for ScaledPlan (#2409) (@david-macmahon) - Conditionalize a quirk. (#2411) (@maleadt) - Relax signature of generic matvecmul! (#2414) (@dkarrasch) - Fix kron launch configuration. (#2418) (@maleadt) - Run full GC when under very high memory pressure. (#2421) (@maleadt) - Enzyme: Fix cuarray return type (#2425) (@wsmoses) - CompatHelper: bump compat for LLVM to 8, (keep existing compat) (#2426) (@github-actions[bot]) - pre-allocated pivot and info buffers for getrf_batched (#2431) (@bjarthur) - Profiler tweaks. (#2432) (@maleadt) - Update the Julia wrappers for CUDA v12.5.1 (#2436) (@amontoison) - Correct workspace handling (#2437) (@maleadt) **Closed issues:** - Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053) - Broadcasted multiplication with a rational doesn't work (#1926) - Incorrect grid size in `kron` (#2410) - GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412) - Failure of Eigenvalue Decomposition for Large Matrices. (#2413) - CUDA_Driver_jll's lazy artifacts cause a precompilation-time warning (#2415) - Recurrence of integer overflow bug (#1880) for a large matrix (#2427) - CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429) - CUDA_Runtime_Discovery Did not find cupti on Arm system with nvhpc (#2433) - CUDA.jl won't install/run on Jetson Orin NX (#2435)
## CUDA v5.4.0 [Diff since v5.3.5](v5.3.5...v5.4.0) **Merged pull requests:** - Support CUDA 12.5 (#2392) (@maleadt) - Mark cuarray as noalias (#2395) (@wsmoses) - Update Julia wrappers for CUDA v12.5 (#2396) (@amontoison) - Enable correct pool access for cublasXt. (#2398) (@maleadt) - More fine-grained CUPTI version checks. (#2399) (@maleadt) **Closed issues:** - CUTENSOR breaks after device_reset! (#2319) - cuBLASXt's `xt_gemm!` incompatible with stream-ordered allocated memory (#2320) - Add helper function to recompile CUDA stack (#2364)
## CUDA v5.3.5 [Diff since v5.3.4](v5.3.4...v5.3.5) **Merged pull requests:** - Avoid constructing `MulAddMul`s on Julia v1.12+ (#2277) (@dkarrasch) - CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot]) - Enzyme: allocation functions (#2386) (@wsmoses) - Tweaks to prevent context construction on some operations (#2387) (@maleadt) - Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt) - CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt) - Backport: Enzyme allocation fns (#2393) (@wsmoses) **Closed issues:** - Indexing a view uses scalar indexing (#1472) - EnzymeCore is an unconditional dependency. (#2380) - cuBLASLt wrappers ccall into cuBLAS (#2388) - generic_trimatmul! error (#2389)
## CUDA v5.3.4 [Diff since v5.3.3](v5.3.3...v5.3.4) **Merged pull requests:** - Add Enzyme Forward mode custom rule (#1869) (@wsmoses) - Handle cache improvements (#2352) (@maleadt) - Fix cuTensorNet compat (#2354) (@maleadt) - Optimize array allocation. (#2355) (@maleadt) - Change type restrictions in cuTENSOR operations (#2356) (@lkdvos) - Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot]) - Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik) - Make generic_trimatmul more specific (#2359) (@tgymnich) - Return the currect memory type when wrapping system memory. (#2363) (@maleadt) - Mark cublas version/handle as non-differentiable (#2368) (@wsmoses) - Enzyme: Forward mode sync (#2369) (@wsmoses) - Enzyme: support fill (#2371) (@wsmoses) - unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt) - Remove external_gvars. (#2373) (@maleadt) - Tegra support with artifacts (#2374) (@maleadt) - Backport Enzyme extension (#2375) (@wsmoses) - Add note about --check-bounds=yes (#2378) (@Zinoex) - Test Enzyme in a separate CI job. (#2379) (@maleadt) - Fix tests for Tegra. (#2381) (@maleadt) - Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses) **Closed issues:** - Native Softmax (#175) - CUSOLVER: support eigendecomposition (#173) - backslash with gpu matrices crashes julia (#161) - at-benchmark captures GPU arrays (#156) - Support kernels returning Union{} (#62) - mul! falls back to generic implementation (#148) - \ on qr factorization objects gives a method error (#138) - Compiler failure if dependent module only contains a `japi1` function (#49) - copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126) - Calling Flux.gpu on a view dumps core (#125) - Creating `CuArray{Tracker.TrackedReal{Float64},1}` a few times causes segfaults (#121) - Guard against exceeding maximum kernel parameter size (#32) - Detect common API misuse in error handlers (#31) - `rand` and friends default to `Float64` (#108) - \ does not work for least squares (#104) - ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94) - CuIterator assumes batches to consist of multiple arrays (#86) - Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85) - Document (un)supported language features for kernel programming (#13) - Missing dispatch for indexing of reshaped arrays (#556) - Track array ownership to avoid illegal memory accesses (#763) - NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793) - Support for `sm_80` `cp.async`: asynchronous on-device copies (#850) - Profiling Julia with Nsight Systems on Windows results in blank window (#862) - sort! and partialsort! are considerably slower than CPU versions (#937) - mul! does not dispatch on Adjoint (#1363) - Cross-device copy of wrapped arrays fails (#1377) - Memory allocation becomes very slow when reserved bytes is large (#1540) - Cannot reclaim GPU Memory; CUDA.reclaim() (#1562) - Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572) - device_reset! does not seem to work anymore (#1579) - device-side rand() are not random between successive kernel launches (#1633) - Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811) - `cusparseSetStream_v2` not defined (#1820) - Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821) - KernelAbstractions.jl-related issues (#1838) - lock failing in multithreaded plan_fft() (#1921) - CUSolver finalizer tries to take ReentrantLock (#1923) - Testsuite could be more careful about parallel testing (#2192) - Opportunistic GC collection (#2303) - Unable to use local CUDA runtime toolkit (#2367) - Enzyme prevents testing on 1.11 (#2376)
## CUDA v5.3.3 [Diff since v5.3.2](v5.3.2...v5.3.3) **Merged pull requests:** - Rework context handling (#2346) (@maleadt) - fix kernel launch logic (#2353) (@xaellison) **Closed issues:** - Excessive allocations when running on multiple threads (#1429) - Fix and test multigpu support (#2218) - Bitonic sort exceeds launch resources (#2331)
## CUDA v5.3.2 [Diff since v5.3.1](v5.3.1...v5.3.2) **Merged pull requests:** - Add EnzymeCore extension for parent_job (#2281) (@vchuravy) - Consider running GC when allocating and synchronizing (#2304) (@maleadt) - Refactor memory wrappers (#2335) (@maleadt) - Auto-detect external profilers. (#2339) (@maleadt) - Fix performance of indexing unified memory. (#2340) (@maleadt) - Improve exception output (#2342) (@maleadt) - Test multigpu on CI (#2348) (@maleadt) - cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt) - cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt) **Closed issues:** - CuArrays don't seem to display correctly in VS code (#875) - Task scheduling can result in delays when synchronizing (#1525) - Docs: add example on task-based parallelism with explicit synchronization (#1566) - Exception output from many threads is not helpful (#1780) - Autodetect external profiler (#2176) - LazyInitialized is not GC-safe (#2216) - Track CuArray stream usage (#2236) - Improve cross-device usage (#2323) - CUBLASLt wrapper for `cublasLtMatmulDescSetAttribute` can have device buffers as input (#2337) - Improve error message when assigning real valued arrray with complex numbers (#2341) - `@device_code_sass` broken (#2343) - Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345) - `@gcsafe_ccall` breaks inlining of ccall wrappers (#2347)
## CUDA v5.3.1 [Diff since v5.3.0](v5.3.0...v5.3.1) **Merged pull requests:** - [CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison) - Regenerate headers (#2324) (@maleadt) - Add some installation tips to docs/README.md (#2326) (@jlchan) - fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3) - Diagnose kernel limits on launch failure. (#2329) (@maleadt) - Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt) **Closed issues:** - Missing CUBLASLt wrappers (#2322) - error when switching device (#2323) - v5.3.0: regression in Zygote performance (#2333)
## CUDA v5.3.0 [Diff since v5.2.0](v5.2.0...v5.3.0) **Merged pull requests:** - CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj) - Slightly rework error handling (#2245) (@maleadt) - cuTENSOR improvements (#2246) (@maleadt) - Make `@device_code_sass` work with non-Julia kernels. (#2247) (@maleadt) - Improve Tegra detection. (#2251) (@maleadt) - Added few SparseArrays functions (#2254) (@albertomercurio) - Reduce locking in the handle cache (#2256) (@maleadt) - Mark all CUDA ccalls as GC safe (#2262) (@vchuravy) - cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos) - cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos) - Re-generate headers. (#2265) (@maleadt) - Update to CUDNN 9. (#2267) (@maleadt) - [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison) - CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot]) - Minor improvements to nonblocking synchronization. (#2272) (@maleadt) - Add extension package for StaticArrays (#2273) (@trahflow) - Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4) - Cached workspace prototype for custatevec (#2279) (@kshyatt) - Update the Julia wrappers for v12.4 (#2282) (@amontoison) - Add support for CUDA 12.4. (#2286) (@maleadt) - Test suite changes (#2288) (@maleadt) - Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt) - Towards supporting Julia 1.11 (#2291) (@maleadt) - Fix typo in performance tips (#2294) (@Zentrik) - Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt) - Set default buffer size in `CUSPARSE` `mm!` functions (#2298) (@lpawela) - Avoid OOMs during OOM handling. (#2299) (@maleadt) - [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison) - [CUSOLVER] Interface larft! (#2301) (@amontoison) - Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt) - sortperm with dims (#2308) (@xaellison) - [CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison) - [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison) - Avoid capturing `AbstractArray`s in `BoundsError` (#2314) (@lcw) - Clarify debug level hint. (#2316) (@maleadt) **Closed issues:** - Failed to compile PTX code when using NSight on Win11 (#1601) - `sortperm` fails with `dims` keyword (#2061) - NVTX-related segfault on Windows under compute-sanitizer (#2204) - Inverse Complex-to-Real FFT allocates GPU memory (#2249) - cuDNN not available for your platform (#2252) - Cannot reset CuArray to zero (#2257) - Cannot take gradient of `sort` on 2D CuArray (#2259) - Multi-threaded code hanging forever with Julia 1.10 (#2261) - CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268) - Adjoint not supported on Diagonal arrays (#2275) - Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276) - Release v5.3? (#2283) - Wrap CUDSS? (#2287) - Bug concerning broadcast between device array and unified array (#2289) - `StackOverflowError` trying to throw `OutOfGPUMemoryError`, subsequent errors (#2292) - BUG: sortperm! seems to perform much slower than it should (#2293) - Multiplying `CuSparseMatrixCSC` by `CuMatrix` results in `Out of GPU memory` (#2296) - BFloat16 support broken on Julia 1.11 (#2306) - does not emit line info for debbuging/profiling (#2312) - Kernel using `StaticArray` compiles in julia v1.9.4 but not in v1.10.2 (#2313) - Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)
PreviousNext