Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Julia crashes with composite datatype containing SparseMatrixCSC #12848

Closed
cschwarzbach opened this issue Aug 28, 2015 · 4 comments
Closed
Labels
domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior kind:regression Regression in behavior compared to a previous version

Comments

@cschwarzbach
Copy link

Julia crashes when trying to copy a composite datatype containing two copies of a reference to the same SparseMatrixCSC object from the master process to a parallel worker. This problem occurs for Julia 0.4 starting late June 2015. The following code reproduces the error if running in parallel (julia -p N with N >= 1):

@everywhere begin
    type Two{T}
        A::T
        B::T
    end
end

p = workers()[1]

function send2(A,B)
    x = Two(A,B)
    try
        xref = remotecall_wait(p, identity, x)
        println("Success")
    catch
        println("Failure")
    end
end

S =    ["dense",    "sparse"    ]
A = Any[zeros(0,0), spzeros(0,0)]
B = Any[zeros(0,0), spzeros(0,0)]

println()

for k = 1:2

    println("Testing send2(A, B) with ", S[k], " matrices")
    send2(A[k], B[k])
    println()

    println("Testing send2(A, A) with ", S[k], " matrices")
    send2(A[k], A[k])
    println()

end

Screen output from Julia 0.3 and old Julia 0.4 (commit 75432c9*, e.g.):


Testing send2(A, B) with dense matrices
Success

Testing send2(A, A) with dense matrices
Success

Testing send2(A, B) with sparse matrices
Success

Testing send2(A, A) with sparse matrices
Success

Screen output from recent Julia 0.4 (commit f42b222*, e.g.):


Testing send2(A, B) with dense matrices
Success

Testing send2(A, A) with dense matrices
Success

Testing send2(A, B) with sparse matrices
Success

Testing send2(A, A) with sparse matrices
fatal error on 2: ERROR: KeyError: 3 not found
 in handle_deserialize at serialize.jl:455
 in deserialize at serialize.jl:683
 in deserialize_datatype at serialize.jl:636
 in handle_deserialize at serialize.jl:457
 in deserialize at serialize.jl:429
 in anonymous at serialize.jl:472
 in ntuple at ./tuple.jl:32
 in deserialize_tuple at serialize.jl:472
 in handle_deserialize at serialize.jl:450
 in deserialize at serialize.jl:683
 in deserialize_datatype at serialize.jl:636
 in handle_deserialize at serialize.jl:457
 in message_handler_loop at multi.jl:844
 in process_tcp_streams at multi.jl:833
 in anonymous at task.jl:67
Worker 2 terminated.Failure


ERROR (unhandled task failure): EOFError: read end of file

The Julia and OS version that I'm using is (versioninfo())

Julia Version 0.4.0-dev+7002
Commit f42b222* (2015-08-26 20:27 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
@pao pao added kind:bug Indicates an unexpected problem or unintended behavior domain:parallelism Parallel or distributed computation domain:arrays:sparse Sparse arrays labels Aug 28, 2015
@amitmurthy
Copy link
Contributor

I can confirm that this works fine when fields A and B of Two refer to two different spzeros objects, but fails when it refers to the same object.

May be related to #12079 (comment) - doesn't look like it, ser/deser works fine over a socket connection. The above comment stands - fails only when the fields of Two refer to the same object.

@amitmurthy
Copy link
Contributor

Reduced case:

type Two{T}
    A::T
    B::T
end

x=spzeros(0,0)
X=Two(x,x)

foo = Base.CallWaitMsg(println, (X,), (1,2), (1,3))

io=IOBuffer()
serialize(io, foo)
seekstart(io)
deserialize(io)

results in

ERROR: KeyError: 3 not found
 in handle_deserialize at serialize.jl:455
 in deserialize at serialize.jl:684
 in deserialize_datatype at serialize.jl:637
 in handle_deserialize at serialize.jl:457
 in deserialize at serialize.jl:429
 in anonymous at serialize.jl:472
 in ntuple at ./tuple.jl:32
 in deserialize_tuple at serialize.jl:472
 in handle_deserialize at serialize.jl:450
 in deserialize at serialize.jl:684
 in deserialize_datatype at serialize.jl:637
 in handle_deserialize at serialize.jl:457
 in deserialize at serialize.jl:429
 in deserialize at serialize.jl:426

CallWaitMsg is defined as

type CallWaitMsg <: AbstractMsg
    f::Function
    args::Tuple
    response_oid::Tuple
    notify_oid::Tuple
end

@JeffBezanson , do you think 78b999f in the context of SparseMatrix could be the cause?

@mattcbro
Copy link

I'm getting the same kind of error without any sparse matrices at all. In particular I have a @parallel for loop that looks something like,

    Npts = size(deltaq,1)
    pout = SharedArray(Float64,Npts)
    xmitantpos = copy(xmitprams.xmitantpos)
    @sync @parallel for n=1:Npts
        # theoretically this is a copied version of xmitantpos and so I can write to it
        #xmitantpos[qq,:] = deltaq[n,:] 
        pout[n] = perfmet(rcons, wgts, xmitprams, xmitprams.xmitantpos) 
    end
    # extract data from shared array
    pret = sdata(pout)

So xmitprams is a composite type and there are two references to it in the input arguments. However
even if I copy it, ie if the last argument in perfmet uses the copy xmitantpos I get a similiar error, namely:
ulia> fatal error on 3: fatal error on 6: fatal error on 13: fatal error on 7: fatal error on 10: fatal error on fatal error on ERROR: stack overflow
in deserialize at serialize.jl:357
in handle_deserialize at serialize.jl:352
in deserialize at serialize.jl:361

I'll try to get a standalone simpler test case running later. Right now though it's a show stopper since I can't see how to parallelize calls to my function perfmet. The same error occurs if I configure it to use pmap instead.

By the way this is happening on 0.3.11 in a linux 64 bit OS.

@tkelman
Copy link
Contributor

tkelman commented Sep 16, 2015

given that there's a PR #13134 for this, moving the backport label over there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior kind:regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

6 participants