Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Cholesky factorization of CuSparseMatrixCSR #1855

Closed
shakedregev opened this issue Apr 4, 2023 · 12 comments
Closed

Support Cholesky factorization of CuSparseMatrixCSR #1855

shakedregev opened this issue Apr 4, 2023 · 12 comments
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request help wanted Extra attention is needed

Comments

@shakedregev
Copy link

shakedregev commented Apr 4, 2023

Describe the bug

When you factorize a GPU matrix, the output is on a CPU. I believe
https://github.com/JuliaGPU/CUDA.jl/blob/6d2e9ab2f2a1bb1df7791bb30178e0fe956940a3/lib/cusolver/linalg.jl#L484 is the only mention of Cholesky in the entire package and it looks like it's coded to work this way (just copy to CPU and factorize). This is not desired behavior, as it makes the entire GPU operation pointless.

To reproduce

The Minimal Working Example (MWE) for this bug:

using Random
using SparseArrays
using LinearAlgebra
using CUDA
using CUDA.CUSPARSE

n= 4 #num variables
rng = MersenneTwister(1235) # this is the seed
Q1 = sprand(rng, n, n, 1.0/n)
Q = Q1'*Q1 + I
A = CuSparseMatrixCSR(Q)
cholA = cholesky(A)
b = CUDA.fill(1.0, n)
x = cholA\b
println("Type of cholA: ", typeof(cholA))
println("Type of b ", typeof(b))
println("Type of x ", typeof(x))

Output

Type of cholA: Cholesky{Float64, Matrix{Float64}}
Type of b CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}
Type of x Vector{Float64}
Manifest.toml

# This file is machine-generated - editing it directly is not advised

julia_version = "1.9.0-alpha1"
manifest_format = "2.0"
project_hash = "fe1c204c663e26329f68ccc2911a2d13920bea4e"

[[deps.AbstractFFTs]]
deps = ["ChainRulesCore", "LinearAlgebra"]
git-tree-sha1 = "69f7020bd72f069c219b5e8c236c1fa90d2cb409"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "1.2.1"

[[deps.Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "0310e08cb19f5da31d08341c6120c047598f5b9c"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "3.5.0"

[[deps.ArgTools]]
uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f"
version = "1.1.1"

[[deps.Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"

[[deps.BFloat16s]]
deps = ["LinearAlgebra", "Printf", "Random", "Test"]
git-tree-sha1 = "dbf84058d0a8cbbadee18d25cf606934b22d7c66"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.4.2"

[[deps.Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"

[[deps.CEnum]]
git-tree-sha1 = "eb4cb44a499229b3b8426dcfb5dd85333951ff90"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.2"

[[deps.CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Preferences", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions"]
git-tree-sha1 = "edff14c60784c8f7191a62a23b15a421185bc8a8"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "4.0.1"

[[deps.CUDA_Driver_jll]]
deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg"]
git-tree-sha1 = "75d7896d1ec079ef10d3aee8f3668c11354c03a1"
uuid = "4ee394cb-3365-5eb0-8335-949819d2adfc"
version = "0.2.0+0"

[[deps.CUDA_Runtime_Discovery]]
deps = ["Libdl"]
git-tree-sha1 = "58dd8ec29f54f08c04b052d2c2fa6760b4f4b3a4"
uuid = "1af6417a-86b4-443c-805f-a4643ffb695f"
version = "0.1.1"

[[deps.CUDA_Runtime_jll]]
deps = ["Artifacts", "CUDA_Driver_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "d3e6ccd30f84936c1a3a53d622d85d7d3f9b9486"
uuid = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"
version = "0.2.3+2"

[[deps.ChainRulesCore]]
deps = ["Compat", "LinearAlgebra", "SparseArrays"]
git-tree-sha1 = "c6d890a52d2c4d55d326439580c3b8d0875a77d9"
uuid = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
version = "1.15.7"

[[deps.ChangesOfVariables]]
deps = ["ChainRulesCore", "LinearAlgebra", "Test"]
git-tree-sha1 = "844b061c104c408b24537482469400af6075aae4"
uuid = "9e997f8a-9a97-42d5-a9f1-ce6bfc15e2c0"
version = "0.1.5"

[[deps.Compat]]
deps = ["Dates", "LinearAlgebra", "UUIDs"]
git-tree-sha1 = "61fdd77467a5c3ad071ef8277ac6bd6af7dd4c04"
uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
version = "4.6.0"

[[deps.CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"
version = "1.0.1+0"

[[deps.Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[deps.DocStringExtensions]]
deps = ["LibGit2"]
git-tree-sha1 = "2fb1e02f2b635d0845df5d7c167fec4dd739b00d"
uuid = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
version = "0.9.3"

[[deps.Downloads]]
deps = ["ArgTools", "FileWatching", "LibCURL", "NetworkOptions"]
uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
version = "1.6.0"

[[deps.ExprTools]]
git-tree-sha1 = "56559bbef6ca5ea0c0818fa5c90320398a6fbf8d"
uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04"
version = "0.1.8"

[[deps.FileIO]]
deps = ["Pkg", "Requires", "UUIDs"]
git-tree-sha1 = "7be5f99f7d15578798f338f5433b6c432ea8037b"
uuid = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
version = "1.16.0"

[[deps.FileWatching]]
uuid = "7b1f6079-737a-58dc-b8bc-7a2ca5c1b5ee"

[[deps.GPUArrays]]
deps = ["Adapt", "GPUArraysCore", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "Serialization", "Statistics"]
git-tree-sha1 = "4dfaff044eb2ce11a897fecd85538310e60b91e6"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "8.6.2"

[[deps.GPUArraysCore]]
deps = ["Adapt"]
git-tree-sha1 = "57f7cde02d7a53c9d1d28443b9f11ac5fbe7ebc9"
uuid = "46192b85-c4d5-4398-a991-12ede77f4527"
version = "0.1.3"

[[deps.GPUCompiler]]
deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "95185985a5d2388c6d0fedb06181ad4ddd40e0cb"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.17.2"

[[deps.InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"

[[deps.InverseFunctions]]
deps = ["Test"]
git-tree-sha1 = "49510dfcb407e572524ba94aeae2fced1f3feb0f"
uuid = "3587e190-3f89-42d0-90ee-14403ec27112"
version = "0.1.8"

[[deps.IrrationalConstants]]
git-tree-sha1 = "7fd44fd4ff43fc60815f8e764c0f352b83c49151"
uuid = "92d709cd-6900-40b7-9082-c6be49f344b6"
version = "0.1.1"

[[deps.JLLWrappers]]
deps = ["Preferences"]
git-tree-sha1 = "abc9885a7ca2052a736a600f7fa66209f96506e1"
uuid = "692b3bcd-3c85-4b1f-b108-f13ce0eb3210"
version = "1.4.1"

[[deps.Krylov]]
deps = ["LinearAlgebra", "Printf", "SparseArrays"]
path = "/home/shakedregev/.julia/dev/Krylov"
uuid = "ba0b0d4f-ebba-5204-a429-3ac8c609bfb7"
version = "0.9.0"

[[deps.LLVM]]
deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "df115c31f5c163697eede495918d8e85045c8f04"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "4.16.0"

[[deps.LLVMExtra_jll]]
deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "771bfe376249626d3ca12bcd58ba243d3f961576"
uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab"
version = "0.0.16+0"

[[deps.LazyArtifacts]]
deps = ["Artifacts", "Pkg"]
uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3"

[[deps.LibCURL]]
deps = ["LibCURL_jll", "MozillaCACerts_jll"]
uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21"
version = "0.6.3"

[[deps.LibCURL_jll]]
deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"]
uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0"
version = "7.84.0+0"

[[deps.LibGit2]]
deps = ["Base64", "NetworkOptions", "Printf", "SHA"]
uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"

[[deps.LibSSH2_jll]]
deps = ["Artifacts", "Libdl", "MbedTLS_jll"]
uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8"
version = "1.10.2+0"

[[deps.Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[deps.LinearAlgebra]]
deps = ["Libdl", "OpenBLAS_jll", "libblastrampoline_jll"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[deps.LogExpFunctions]]
deps = ["ChainRulesCore", "ChangesOfVariables", "DocStringExtensions", "InverseFunctions", "IrrationalConstants", "LinearAlgebra"]
git-tree-sha1 = "680e733c3a0a9cea9e935c8c2184aea6a63fa0b5"
uuid = "2ab3a3ac-af41-5b50-aa03-7779005ae688"
version = "0.3.21"

[[deps.Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[deps.Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"

[[deps.MbedTLS_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1"
version = "2.28.0+0"

[[deps.MozillaCACerts_jll]]
uuid = "14a3606d-f60d-562e-9121-12d972cd8159"
version = "2022.10.11"

[[deps.NPZ]]
deps = ["FileIO", "ZipFile"]
git-tree-sha1 = "60a8e272fe0c5079363b28b0953831e2dd7b7e6f"
uuid = "15e1cf62-19b3-5cfa-8e77-841668bca605"
version = "0.4.3"

[[deps.NetworkOptions]]
uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908"
version = "1.2.0"

[[deps.OpenBLAS_jll]]
deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"]
uuid = "4536629a-c528-5b80-bd46-f80d51c5b363"
version = "0.3.21+0"

[[deps.OpenLibm_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "05823500-19ac-5b8b-9628-191a04bc5112"
version = "0.8.1+0"

[[deps.OpenSpecFun_jll]]
deps = ["Artifacts", "CompilerSupportLibraries_jll", "JLLWrappers", "Libdl", "Pkg"]
git-tree-sha1 = "13652491f6856acfd2db29360e1bbcd4565d04f1"
uuid = "efe28fd5-8261-553b-a9e1-b2916fc3738e"
version = "0.5.5+0"

[[deps.Pkg]]
deps = ["Artifacts", "Dates", "Downloads", "FileWatching", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs", "p7zip_jll"]
uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
version = "1.8.0"

[[deps.Preferences]]
deps = ["TOML"]
git-tree-sha1 = "47e5f437cc0e7ef2ce8406ce1e7e24d44915f88d"
uuid = "21216c6a-2e73-6563-6e65-726566657250"
version = "1.3.0"

[[deps.Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[deps.REPL]]
deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"]
uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"

[[deps.Random]]
deps = ["SHA", "Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[deps.Random123]]
deps = ["Random", "RandomNumbers"]
git-tree-sha1 = "7a1a306b72cfa60634f03a911405f4e64d1b718b"
uuid = "74087812-796a-5b5d-8853-05524746bad3"
version = "1.6.0"

[[deps.RandomNumbers]]
deps = ["Random", "Requires"]
git-tree-sha1 = "043da614cc7e95c703498a491e2c21f58a2b8111"
uuid = "e6cf234a-135c-5ec9-84dd-332b85af5143"
version = "1.5.3"

[[deps.Reexport]]
git-tree-sha1 = "45e428421666073eab6f2da5c9d310d99bb12f9b"
uuid = "189a3867-3050-52da-a836-e630ba90ab69"
version = "1.2.2"

[[deps.Requires]]
deps = ["UUIDs"]
git-tree-sha1 = "838a3a4188e2ded87a4f9f184b4b0d78a1e91cb7"
uuid = "ae029012-a4dd-5104-9daa-d747884805df"
version = "1.3.0"

[[deps.SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
version = "0.7.0"

[[deps.Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[deps.Sockets]]
uuid = "6462fe0b-24de-5631-8697-dd941f90decc"

[[deps.SparseArrays]]
deps = ["Libdl", "LinearAlgebra", "Random", "Serialization", "SuiteSparse_jll"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[deps.SpecialFunctions]]
deps = ["ChainRulesCore", "IrrationalConstants", "LogExpFunctions", "OpenLibm_jll", "OpenSpecFun_jll"]
git-tree-sha1 = "d75bda01f8c31ebb72df80a46c88b25d1c79c56d"
uuid = "276daf66-3868-5448-9aa4-cd146d93841b"
version = "2.1.7"

[[deps.Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
version = "1.9.0"

[[deps.SuiteSparse_jll]]
deps = ["Artifacts", "Libdl", "Pkg", "libblastrampoline_jll"]
uuid = "bea87d4a-7f5b-5778-9afe-8cc45184846c"
version = "5.10.1+0"

[[deps.TOML]]
deps = ["Dates"]
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.3"

[[deps.Tar]]
deps = ["ArgTools", "SHA"]
uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e"
version = "1.10.0"

[[deps.Test]]
deps = ["InteractiveUtils", "Logging", "Random", "Serialization"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[[deps.TimerOutputs]]
deps = ["ExprTools", "Printf"]
git-tree-sha1 = "f2fd3f288dfc6f507b0c3a2eb3bac009251e548b"
uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
version = "0.5.22"

[[deps.UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[deps.Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[[deps.ZipFile]]
deps = ["Libdl", "Printf", "Zlib_jll"]
git-tree-sha1 = "f492b7fe1698e623024e873244f10d89c95c340a"
uuid = "a5390f91-8eb1-5f08-bee0-b1d1ffed6cea"
version = "0.10.1"

[[deps.Zlib_jll]]
deps = ["Libdl"]
uuid = "83775a58-1f1d-513f-b197-d71354ab007a"
version = "1.2.13+0"

[[deps.libblastrampoline_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850b90-86db-534c-a0d3-1478176c7d93"
version = "5.2.0+0"

[[deps.nghttp2_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d"
version = "1.48.0+0"

[[deps.p7zip_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "3f19e933-33d8-53b3-aaab-bd5110c3b7a0"
version = "17.4.0+0"

Expected behavior

I expect the factorization to stay on the GPU and the solution to also be on the GPU. Meaning
typeof(x)==typeof(b) and typeof(cholA) should be something GPU related.

Version info

Details on Julia:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 1 on 12 virtual cores

Details on CUDA:

# please post the output of:
CUDA runtime 11.8, artifact installation
CUDA driver 11.8
NVIDIA driver 470.161.3, originally for CUDA 11.4

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+470.161.3

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)```


**Additional context**

[https://discourse.julialang.org/t/complete-and-incomplete-sparse-cholesky-factorization/96918](Posted on discourse). The functionality was clearly available in previous versions of Julia when CuSolver, CuArrays, and CuSparse were separate. Searching the web reveals old documentation with options like csrlvchol! and ic0 which also did not work. My guess is that this functionality was intended to just be wrapped in `cholesky(A)`, but the GPU version was never fully implemented and instead happens with a CPU copy.
@shakedregev shakedregev added the bug Something isn't working label Apr 4, 2023
@maleadt
Copy link
Member

maleadt commented Apr 5, 2023

I believe this is the only mention of Cholesky in the entire package and it looks like it's coded to work this way (just copy to CPU and factorize).

That's incorrect; cholesky on dense inputs works fine:

julia> using CUDA, LinearAlgebra

julia> CUDA.allowscalar(false)

julia> A = CUDA.rand(10, 10);
julia> A = A*A'+I;

julia> cholesky(A)
Cholesky{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}
U factor:
10×10 UpperTriangular{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}:
 2.07792  1.04945  1.31206   1.20788    1.16905    1.25365   1.19039   0.792913    1.1288     1.07848
         1.794    0.943456  0.393756   0.655787   0.66578   0.537175  0.213479    0.737071   0.155549
                 1.36309   0.394237   0.370576   0.603764  0.570361  0.414203    0.541527   0.262899
                          1.38618   -0.0552797  0.39563   0.190204  0.00359829  0.141378   0.322178
                                    1.53461    0.564346  0.130046  0.114851    0.20222    0.332031
                                              1.54097   0.301906  0.191423    0.368203   0.00617357
                                                       1.21205   0.350676    0.424222   0.0375677
                                                                1.32711     0.19541   -0.0467459
                                                                           1.36569   -0.0209019
                                                                                     1.1683
A = CuSparseMatrixCSR(Q)
cholA = cholesky(A)

What CUSOLVER/CUSPARSE API calls do you expect this to use?

@maleadt maleadt added enhancement New feature or request help wanted Extra attention is needed cuda libraries Stuff about CUDA library wrappers. and removed bug Something isn't working labels Apr 5, 2023
@maleadt maleadt changed the title Cholesky factorization of CuSparseMatrixCSR in CUDA.jl happening on CPU Support Cholesky factorization of CuSparseMatrixCSR Apr 5, 2023
@junyuan-chen
Copy link

junyuan-chen commented Apr 9, 2023

There is a wrapper for solving a sparse system with Cholesky factorization here:

for (fname, elty, relty) in ((:cusolverSpScsrlsvchol, :Float32, :Float32),
(:cusolverSpDcsrlsvchol, :Float64, :Float64),
(:cusolverSpCcsrlsvchol, :ComplexF32, :Float32),
(:cusolverSpZcsrlsvchol, :ComplexF64, :Float64))
@eval begin
function csrlsvchol!(A::CuSparseMatrixCSR{$elty},
b::CuVector{$elty},
x::CuVector{$elty},
tol::$relty,
reorder::Cint,
inda::Char)
n = size(A,1)
if size(A,2) != n
throw(DimensionMismatch("Cholesky factorization is only possible for square matrices!"))
end
if size(A,2) != length(b)
throw(DimensionMismatch("second dimension of A, $(size(A,2)), must match the length of b, $(length(b))"))
end
if length(x) != length(b)
throw(DimensionMismatch("length of x, $(length(x)), must match the length of b, $(length(b))"))
end
desca = CuMatrixDescriptor('G', 'L', 'N', inda)
singularity = zeros(Cint,1)
$fname(sparse_handle(), n, A.nnz, desca, A.nzVal, A.rowPtr, A.colVal, b, tol, reorder, x, singularity)
if singularity[1] != -1
throw(SingularException(singularity[1]))
end
x
end
end
end

It seems that this solver does what cholesky! and ldiv! would do together and hence the factorization part cannot be wrapped by the Julia interface separately.

Also, according to the CUDA documentation here, there is no such thing like cholesky! for sparse matrices from their high-level interface.

@shakedregev
Copy link
Author

shakedregev commented Apr 11, 2023

Thanks, but this is not useful in my case. I'm not sure why, but NVIDIA removed the ability to keep your factorizations. The move is good for simplicity when you're only solving one system, but not in many applications such as optimization problems, or conjugate gradient with a preconditioner. Almost anything where you're iterating. Back when I wrote in C++, I just went and used older CUDA versions and they worked fine for me. The newer ones did not have what I needed. It seems (perhaps) that CUDA.jl followed the same move by NVIDIA.

I am solving a system Ax=b, where A is dense and is just given as an operator (product and sum of sparse matrices). I am forming a sparse approximation of A called M which I want to factorize and solve systems with.
I need to be able to keep the factorization to solve multiple systems (because I am using this as a preconditioner in conjugate gradient). I also need the symbolic information, because I have multiple systems that look like Ax=b but the sparsity structure of M is constant.
Basically, I need to be able to do three separate things

  1. Take the symbolic factorization of a csr matrix
  2. Given this symbolic factorization, compute a numeric factorization
  3. Solve systems with said numeric factorization.

This may or may not be useful, but for a matrix M with csr representation (rows, cols, vals), the C++ style CUDA code for what I want to do is

// Symbolic factorization - happens once
  csrcholInfo_t info = NULL;
  cusolverSpCreateCsrcholInfo(&info);
  cusolverSpXcsrcholAnalysis(handle_cusolver, M->n, nnzM, descrM, rows, cols, info);
  size_t internalDataInBytes, workspaceInBytes;
  cusolverSpDcsrcholBufferInfo(handle_cusolver, M->n, nnzM, descrM, vals, rows,
   cols, info, &internalDataInBytes, &workspaceInBytes);
  void* buffer_gpu = NULL;
  cudaMalloc(&buffer_gpu, sizeof(char) * workspaceInBytes);
  // Numerical factorization - happens for every different `Ax=b` system for the different `M`
  cusolverSpDcsrcholFactor(
    handle_cusolver, M->n, nnzM, descrM, vals, rows, cols, info, buffer_gpu);
  cusolverSpDcsrcholZeroPivot(handle_cusolver, info, tol, &singularity);
// Solve that happens every conjugate gradient iteration
`cusolverSpDcsrcholSolve(handle_cusolver, M->n, x, b, info, buffer_gpu);`

@shakedregev
Copy link
Author

shakedregev commented Apr 11, 2023

Actually, an incomplete factorization would probably suffice for my purposes, but ic02 returns a matrix that I cannot make heads or tails of. It's not lower triangular or symmetric, so not sure how it incorporates the LL^T factors

Although still I would argue that a full factorization should be doable. In many cases a full Cholesky factorization stays sparse.

@shakedregev
Copy link
Author

@maleadt would you be willing to develop this functionality together so it can be part of the library? I can take care of the linear algebra, I just don't have experience with developing backend Julia functions (I do have experience with CUDA though)

@maleadt
Copy link
Member

maleadt commented Apr 24, 2023

I'm totally unfamiliar with what you're trying to accomplish, so I can only offer high-level guidance. Since you know which API calls you want to perform; what we normally do first, is build a slightly more generic wrapper around these APIs, e.g.,

#csrlsvchol
for (fname, elty, relty) in ((:cusolverSpScsrlsvchol, :Float32, :Float32),
(:cusolverSpDcsrlsvchol, :Float64, :Float64),
(:cusolverSpCcsrlsvchol, :ComplexF32, :Float32),
(:cusolverSpZcsrlsvchol, :ComplexF64, :Float64))
@eval begin
function csrlsvchol!(A::CuSparseMatrixCSR{$elty},
b::CuVector{$elty},
x::CuVector{$elty},
tol::$relty,
reorder::Cint,
inda::Char)
n = size(A,1)
if size(A,2) != n
throw(DimensionMismatch("Cholesky factorization is only possible for square matrices!"))
end
if size(A,2) != length(b)
throw(DimensionMismatch("second dimension of A, $(size(A,2)), must match the length of b, $(length(b))"))
end
if length(x) != length(b)
throw(DimensionMismatch("length of x, $(length(x)), must match the length of b, $(length(b))"))
end
desca = CuMatrixDescriptor('G', 'L', 'N', inda)
singularity = zeros(Cint,1)
$fname(sparse_handle(), n, A.nnz, desca, A.nzVal, A.rowPtr, A.colVal, b, tol, reorder, x, singularity)
if singularity[1] != -1
throw(SingularException(singularity[1]))
end
x
end
end
end

Then, we use these wrappers to write methods that integrate with interfaces/stdlibs/packages like LinearAlgebra.jl, e.g.,

# LinearAlgebra.jl defines a comparable method with these restrictions on T, so we need
# to define with the same type parameters to resolve method ambiguity for these cases
function LinearAlgebra.ldiv!(F::LU{T,<:StridedCuMatrix{T}}, B::CuVecOrMat{T}) where {T <: Union{Float32, Float64, ComplexF32, ComplexF64}}
return getrs!('N', F.factors, F.ipiv, B)
end

@loonatick-src
Copy link

loonatick-src commented Aug 3, 2023

Thanks, but this is not useful in my case. I'm not sure why, but NVIDIA removed the ability to keep your factorizations. The move is good for simplicity when you're only solving one system, but not in many applications such as optimization problems, or conjugate gradient with a preconditioner. Almost anything where you're iterating. Back when I wrote in C++, I just went and used older CUDA versions and they worked fine for me. The newer ones did not have what I needed. It seems (perhaps) that CUDA.jl followed the same move by NVIDIA.

I am solving a system Ax=b, where A is dense and is just given as an operator (product and sum of sparse matrices). I am forming a sparse approximation of A called M which I want to factorize and solve systems with. I need to be able to keep the factorization to solve multiple systems (because I am using this as a preconditioner in conjugate gradient). I also need the symbolic information, because I have multiple systems that look like Ax=b but the sparsity structure of M is constant. Basically, I need to be able to do three separate things

1. Take the symbolic factorization of a csr matrix

2. Given this symbolic factorization, compute a numeric factorization

3. Solve systems with said numeric factorization.

This may or may not be useful, but for a matrix M with csr representation (rows, cols, vals), the C++ style CUDA code for what I want to do is

// Symbolic factorization - happens once
  csrcholInfo_t info = NULL;
  cusolverSpCreateCsrcholInfo(&info);
  cusolverSpXcsrcholAnalysis(handle_cusolver, M->n, nnzM, descrM, rows, cols, info);
  size_t internalDataInBytes, workspaceInBytes;
  cusolverSpDcsrcholBufferInfo(handle_cusolver, M->n, nnzM, descrM, vals, rows,
   cols, info, &internalDataInBytes, &workspaceInBytes);
  void* buffer_gpu = NULL;
  cudaMalloc(&buffer_gpu, sizeof(char) * workspaceInBytes);
  // Numerical factorization - happens for every different `Ax=b` system for the different `M`
  cusolverSpDcsrcholFactor(
    handle_cusolver, M->n, nnzM, descrM, vals, rows, cols, info, buffer_gpu);
  cusolverSpDcsrcholZeroPivot(handle_cusolver, info, tol, &singularity);
// Solve that happens every conjugate gradient iteration
`cusolverSpDcsrcholSolve(handle_cusolver, M->n, x, b, info, buffer_gpu);`

@shakedregev Using cusolverSpDcsrcholFactor etc. for exposing sparse cholesky factorizations and solves might be problematic since they appear to be undocumented, low-level API [1,2], and I cannot tell if they will be stable or maintained.

It might be likely they won't be supported since the CUDA component of CHOLMOD from SuiteSparse seems to be the officially supported sparse Cholesky library for CUDA. I guess this would require depending on SuiteSparse.

@maleadt I would like to contribute, but I imagine this particular direction would be more involved because of the dependence of SuiteSparse. Would you be open to contributions in this direction, and offering some high-level guidance?

@maleadt
Copy link
Member

maleadt commented Aug 8, 2023

It might be likely they won't be supported since the CUDA component of CHOLMOD from SuiteSparse seems to be the officially supported sparse Cholesky library for CUDA. I guess this would require depending on SuiteSparse.

@maleadt I would like to contribute, but I imagine this particular direction would be more involved because of the dependence of SuiteSparse. Would you be open to contributions in this direction, and offering some high-level guidance?

Sure, happy to offer some guidance. I'd think that this functionality would need to be a part of SuiteSparse.jl, ideally as a package extension that gets loaded when CUDA.jl is available. We've already built some artifacts, https://github.com/JuliaPackaging/Yggdrasil/tree/master/S/SuiteSparse/SuiteSparse_GPU%405, but they aren't used.

@amontoison
Copy link
Member

@shakedregev
I interfaced the low-level API for the sparse QR and sparse Cholesky decompositions in #2121.

@shakedregev
Copy link
Author

@amontoison - Thank you!

@maleadt
Copy link
Member

maleadt commented Oct 31, 2023

Can this be closed then, or is there more to this issue?

@loonatick-src
Copy link

CHOLMOD support for CUDA is very likely to be implemented as a package extension to the standard library / SuiteSparse (JuliaSparse/SparseArrays.jl#443). So, I think there should not be more to this issue and can be closed as completed.

@maleadt maleadt closed this as completed Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants