Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for JLD2 #1833

Closed
denizyuret opened this issue Mar 25, 2023 · 15 comments
Closed

Support for JLD2 #1833

denizyuret opened this issue Mar 25, 2023 · 15 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@denizyuret
Copy link
Contributor

Here is what I do to be able to save/load CuArrays with JLD2 files:

using CUDA
import JLD2, FileIO
struct JLD2CuArray{T,N}; array::Array{T,N}; end                                                                                              
JLD2.writeas(::Type{CuArray{T,N,D}}) where {T,N,D} = JLD2CuArray{T,N}                                                                        
JLD2.wconvert(::Type{JLD2CuArray{T,N}}, x::CuArray{T,N,D}) where {T,N,D} = JLD2CuArray(Array(x))                                             
JLD2.rconvert(::Type{CuArray{T,N,D}}, x::JLD2CuArray{T,N}) where {T,N,D} = CuArray(x.array)                                                  

This used to work with CuArray{T,N} but no longer works with CuArray{T,N,D}. Here is the error I get:

julia> a = CUDA.rand(3,5)
julia> FileIO.save("foo.jld2", "a", a)
julia> d = FileIO.load("foo.jld2")
Dict{String, Any} with 1 entry:Error showing value of type Dict{String, Any}:                                                                
ERROR: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)                                                                            
Stacktrace:                                                                                                                                  
  [1] throw_api_error(res::CUDA.cudaError_enum)                                                                                              
    @ CUDA /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:89                                                              
  [2] macro expansion                                                                                                                        
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:97 [inlined]                                                         
  [3] cuMemcpyDtoHAsync_v2                                                                                                                   
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/utils/call.jl:26 [inlined]                                                            
  [4] #unsafe_copyto!#8                                                                                                                      
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/memory.jl:397 [inlined]                                                       
  [5] (::CUDA.var"#189#190"{Float32, Matrix{Float32}, Int64, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Int64, Int64})()                    
    @ CUDA /userfiles/dyuret/.julia/packages/CUDA/BbliS/src/array.jl:413                                                                     
  [6] #context!#63                                                                                                                           
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/state.jl:164 [inlined]                                                        
  [7] context!                                                                                                                               
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/state.jl:159 [inlined]                                                        
  [8] unsafe_copyto!(dest::Matrix{Float32}, doffs::Int64, src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, soffs::Int64, n::Int64)           
    @ CUDA /userfiles/dyuret/.julia/packages/CUDA/BbliS/src/array.jl:406                                                                     
  [9] copyto!                                                                                                                                
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/src/array.jl:360 [inlined]                                                                
 [10] copyto!                                                                                                                                
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/src/array.jl:364 [inlined]                                                                
 [11] copyto_axcheck!(dest::Matrix{Float32}, src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})                                                
    @ Base ./abstractarray.jl:1127                                                                                                           
 [12] Array                                                                                                                                  
    @ ./array.jl:626 [inlined]                                                                                                               
 [13] Array                                                                                                                                  
    @ ./boot.jl:483 [inlined]                                                                                                                
 [14] convert                                                                                                                                
    @ ./array.jl:617 [inlined]                                                                                                               
 [15] adapt_storage                                                                                                                          
    @ /userfiles/dyuret/.julia/packages/GPUArrays/XR4WO/src/host/abstractarray.jl:23 [inlined]                                               
 [16] adapt_structure                                                                                                                        
    @ /userfiles/dyuret/.julia/packages/Adapt/xviDc/src/Adapt.jl:57 [inlined]                                                                
 [17] adapt                                                                                                                                  
    @ /userfiles/dyuret/.julia/packages/Adapt/xviDc/src/Adapt.jl:40 [inlined]                                                                
 [18] _show_nonempty                                                                                                                         
    @ /userfiles/dyuret/.julia/packages/GPUArrays/XR4WO/src/host/abstractarray.jl:30 [inlined]                                               
 [19] show(io::IOContext{IOBuffer}, X::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})                                                           
    @ Base ./arrayshow.jl:489                                                                                                                
 [20] sprint(f::Function, args::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}; context::IOContext{Base.TTY}, sizehint::Int64)                   
    @ Base ./strings/io.jl:112                                                                                                               
 [21] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, t::Dict{String, Any})                                              
    @ Base ./show.jl:112                                                                                                                     

When I compare the original array with the loaded version they seem similar except for the refcount:

julia> dump(a)                                                                                                                               
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}                                                                                                   
  storage: CUDA.ArrayStorage{CUDA.Mem.DeviceBuffer}                                                                                          
    buffer: CUDA.Mem.DeviceBuffer                                                                                                            
      ctx: CuContext                                                                                                                         
        handle: Ptr{Nothing} @0x0000000002bbbe80                                                                                             
        valid: Bool true                                                                                                                     
      ptr: CuPtr{Nothing} CuPtr{Nothing}(0x0000000200e00000)                                                                                 
      bytesize: Int64 60                                                                                                                     
      async: Bool true                                                                                                                       
    refcount: Base.Threads.Atomic{Int64}                                                                                                     
      value: Int64 1                                                                                                                         
  maxsize: Int64 60                                                                                                                          
  offset: Int64 0                                                                                                                            
  dims: Tuple{Int64, Int64}                                                                                                                  
    1: Int64 3                                                                                                                               
    2: Int64 5                                                                                                                               
julia> dump(d["a"])                                                                                                                          
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}                                                                                                   
  storage: CUDA.ArrayStorage{CUDA.Mem.DeviceBuffer}                                                                                          
    buffer: CUDA.Mem.DeviceBuffer                                                                                                            
      ctx: CuContext                                                                                                                         
        handle: Ptr{Nothing} @0x0000000002bbbe80                                                                                             
        valid: Bool true                                                                                                                     
      ptr: CuPtr{Nothing} CuPtr{Nothing}(0x0000000200e00200)                                                                                 
      bytesize: Int64 60                                                                                                                     
      async: Bool true                                                                                                                       
    refcount: Base.Threads.Atomic{Int64}                                                                                                     
      value: Int64 0                                                                                                                         
  maxsize: Int64 60                                                                                                                          
  offset: Int64 0                                                                                                                            
  dims: Tuple{Int64, Int64}                                                                                                                  
    1: Int64 3                                                                                                                               
    2: Int64 5                                                                                                                               

Finally, if I assign the value read to a global variable in rconvert it works without any errors:

julia> JLD2.rconvert(::Type{CuArray{T,N,D}}, x::JLD2CuArray{T,N}) where {T,N,D} = (y=CuArray(x.array); global dbg=y; y)
julia> d = FileIO.load("foo.jld2")
julia> d["a"] # works with no problems
@denizyuret denizyuret added the bug Something isn't working label Mar 25, 2023
@maleadt
Copy link
Member

maleadt commented Mar 25, 2023

JLD2 has never really been supported. I guess the fact it worked was just sheer luck? In any case, I'm not familiar with JLD2, so I'll defer to anybody who is to take a look 🙂

@maleadt maleadt changed the title Saving and loading CuArray to JLD2 no longer works Support for JLD2 Mar 25, 2023
@maleadt maleadt added enhancement New feature or request help wanted Extra attention is needed and removed bug Something isn't working labels Mar 25, 2023
@JonasIsensee
Copy link

JonasIsensee commented Mar 30, 2023

Hi @denizyuret,

from the perspective of JLD2 your code looks absolutely ok.
What versions are you on? I can't reproduce the problem.

@denizyuret
Copy link
Contributor Author

[052768ef] CUDA v3.13.1 # (haven't upgraded to 4.x yet, but if it solves the JLD2 issue I will)
[5789e2e9] FileIO v1.16.0
[033835bb] JLD2 v0.4.31
julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+470.57.2
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

@denizyuret
Copy link
Contributor Author

Alas my hope was shortlived :( I get the same error with CUDA v4.1.2

@JonasIsensee
Copy link

I still can't reproduce your error. (I tried julia 1.8.5 and 1.9.0-rc1 with CUDA 3.13.1 and JLD2 v0.4.31)

@denizyuret
Copy link
Contributor Author

Can you send your CUDA.versioninfo so I can see what the difference may be? (library/driver version, gpu type etc could be a factor?)

@JonasIsensee
Copy link

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 515.86.1, for CUDA 11.7
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+515.86.1
  Downloaded artifact: CUDNN
- CUDNN: 8.30.2 (for CUDA 11.5.0)
  Downloaded artifact: CUTENSOR
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

@noyongkyoon
Copy link

noyongkyoon commented Apr 4, 2023

I tried JLD2.writeas(), JLD2.wconvert(), and JLD2.rconvert() as you suggested. Now I get the following error message:

AssertionError: refcount != 0

Stacktrace:
 [1] _derived_array
   @ ~/.julia/packages/CUDA/BbliS/src/array.jl:729 [inlined]
 [2] reshape(a::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, dims::Tuple{Int64})
   @ CUDA ~/.julia/packages/CUDA/BbliS/src/array.jl:723
 [3] reshape
   @ ./reshapedarray.jl:117 [inlined]
 [4] vec(a::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
   @ Base ./abstractarraymath.jl:41
 [5] (::RNN)(x::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}; batchSizes::Nothing)
   @ Knet.Ops20 ~/.julia/packages/Knet/YIFWC/src/ops20/rnn.jl:332
 [6] (::RNN)(x::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
   @ Knet.Ops20 ~/.julia/packages/Knet/YIFWC/src/ops20/rnn.jl:329
 [7] (::Chain)(x::Matrix{UInt16})
   @ Main ./In[5]:6
 [8] tag(tagger::Chain, s::String)
   @ Main ./In[29]:6
 [9] top-level scope
   @ In[30]:1

What is "refcount"? What purpose does it serve? How can one alter its value, if altering it is necessary?
You do say above: "they seem similar except for the refcount." Can you elaborate on it?

@JonasIsensee
Copy link

Finally, if I assign the value read to a global variable in rconvert it works without any errors:
julia> JLD2.rconvert(::Type{CuArray{T,N,D}}, x::JLD2CuArray{T,N}) where {T,N,D} = (y=CuArray(x.array); global dbg=y; y)
julia> d = FileIO.load("foo.jld2")
julia> d["a"] # works with no problems

This here (and also the refcount ) makes me think that this is a problem with the memory management when creating the CuArray. JLD2 allocates the underlying array and passes it to the CuArray(data) constructor and then ceases to keep track of it. (leading to refcount = 0).
This would explain, why the global scope thing could fix it.
@denizyuret Could you try a few functions of this type?

function f()
     data = rand(10,10)
     CuArray(data)
end

@denizyuret
Copy link
Contributor Author

@denizyuret Could you try a few functions of this type?

The f() function you suggested works without problems. refcount of the resulting array is 1.

JLD2 allocates the underlying array and passes it to the CuArray(data) constructor and then ceases to keep track of it. (leading to refcount = 0).

CuArray copies the contents of data (stored in RAM) to the GPU memory, and once the GPU array is constructed I don't think it cares about what happens to the RAM array. But I am not sure what refcount is for and how it is set, so I may be talking nonsense. If I change the value of refcount manually to 0, things don't break for example.

@maleadt any idea how refcount=0 may appear and whether it may be the source of our problems?

@maleadt
Copy link
Member

maleadt commented Apr 5, 2023

But I am not sure what refcount is for and how it is set, so I may be talking nonsense.

The refcount field is to keep track of the underlying buffer, so that multiple CuArrays can share the same memory (e.g., when you take a view, or reinterpret an array, or reshape it).

refcount=0 may happen when you're serializing a freed array.

@JonasIsensee
Copy link

The refcount field is to keep track of the underlying buffer, so that multiple CuArrays can share the same memory (e.g., when you take a view, or reinterpret an array, or reshape it).

refcount=0 may happen when you're serializing a freed array.

Thank you for this info.
It is a bit odd, though. The problem here is most certainly during deserialization. (Otherwise the workarounds above couldn't work)

@maleadt
Copy link
Member

maleadt commented Apr 5, 2023

Hmm, I was misunderstanding how JLD serializes object. If we're really just calling Array(...) and CuArray(...) (i.e., not serializing CuArray objects directly), I fail to see how we would ever run into refcount=0. FWIW, I also can't reproduce this issue.

@maleadt maleadt added bug Something isn't working and removed enhancement New feature or request labels Apr 5, 2023
@JonasIsensee
Copy link

JonasIsensee commented Apr 5, 2023

Yeah, that's the curious bit. Let me summarize it quickly:

  • default: JLD2 attempts to serialize structs by going through its fields. This fails for CuArray since they don't actually contain the data
  • custom serialization: This is what @denizyuret attempted here:
using CUDA
import JLD2, FileIO
struct JLD2CuArray{T,N}; array::Array{T,N}; end                                                                                              
JLD2.writeas(::Type{CuArray{T,N,D}}) where {T,N,D} = JLD2CuArray{T,N}                                                                        
JLD2.wconvert(::Type{JLD2CuArray{T,N}}, x::CuArray{T,N,D}) where {T,N,D} = JLD2CuArray(Array(x))                                             
JLD2.rconvert(::Type{CuArray{T,N,D}}, x::JLD2CuArray{T,N}) where {T,N,D} = CuArray(x.array)                                                  

We define a struct JLD2CuArray that contains data that JLD2 can safely store, along with convert methods for both directions. (rconvert and wconvert - Base.convert also works but that is risky with invalidations...)

When you give JLD2 any object, it always asks JLD2.writeas what type to store it as (default writeas(::T) where T = T)
and it will then call the conversion methods as necessary.

Therefore, with this code, we store the data in Array form AND the full CuArray{T,N,D} type signature (not shown) to call the correct rconvert method upon loading.

@maleadt
Copy link
Member

maleadt commented Apr 5, 2023

The fact that the deserialized object contains a different buffer pointer indicates that the rconvert function has run. This seems to point to a GC-related issue, but if JLD2 is just storing the deserialized object in a regular dictionary the finalizer shouldn't ever run.

@denizyuret since only you seem to be able to reproduce this, I'd add some logging to the CuArray finalizer that decrements the refcount, to see when and from where it gets run (e.g. by adding sprint(Base.show_backtrace, backtrace()) or so to your log messages).

@maleadt maleadt closed this as completed Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants