Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault caused by deserialize (?) or too many tasks (?) #8551

Closed
eschnett opened this issue Oct 2, 2014 · 45 comments
Closed

Segfault caused by deserialize (?) or too many tasks (?) #8551

eschnett opened this issue Oct 2, 2014 · 45 comments
Labels
kind:bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@eschnett
Copy link
Contributor

eschnett commented Oct 2, 2014

For the past two weeks I've been trying to find the cause of a rather persistent segfault in a Julia program of mine. So far, I whittled it down to less then 1,000 lines of Julia code, and no external dependencies except Deque from DataStructures.jl (as a replacement for MPI.jl).

I create many tasks, many of them active (and yielding from time to time). I'm sure there's a lot of memory allocation and garbage collection going on. I also serialize and de-serialize many objects, including functions and lambdas.

The error is a segfault. I assume that there is a safe subset of Julia programs that should never segfault (no ccall, no @inbounds, etc.). I believe my program is safe in this respect.

Here is the code: https://bitbucket.org/eschnett/funhpc.jl/branch/memdebug. I apologize for the size -- I've already greatly reduced it. This is how to run it, and how it fails (after a few seconds):

$ ~/julia/bin/julia-debug Wave.jl
Wave[FunHPC.jl]
Initialization

signal (11): Segmentation fault: 11
unknown function (ip: 0)
Segmentation fault: 11

This with the current development version of Julia.

I believe the problem is somehow connected to or triggered by deserialization and/or by using many tasks. Deserialization is called from the file Comm.jl, routine recv_item, line 39. The result of the deserialize call is never used. If this call is commented out, the program runs fine (and output "Done."). (In my attempts to reduce code size, termination detection probably got wonky, but if the text "Done." is output, everything is fine, and the segfault avoided.)

Similarly, when I reduce the number of concurrent tasks (i.e. tasks that are runnable simultaneously, as opposed to waiting), the segfault becomes more sporadic, and disappears if there are just a few tasks. The current example runs 1000 tasks simultaneously.

When I attach a debugger, the segfault happens in array.c or gc.c in an allocation routine, with a seemingly impossible internal state. I assume that malloc's internal heap data structures have been destroyed by then. Adding assert statements didn't reveal anything. I can't use MEMDEBUG since this doesn't bootstrap since about five weeks ago ( #8422 ).

I tried SANITIZE=1, but this didn't bootstrap for me either.

I tried building Julia with both LLVM 3.3 and LLVM 3.5.

I tried running this via Travis instead on my laptop (OS X) and my workstation (Ubuntu) since I thought something may be wrong with my local setups, and Travis's setup should be well tested. However, I didn't have much luck there either -- debugging via Travis is too indirect to be productive.

I have looked in detail into the files array.c and gc.c (where the segfault is reported), as well as the file serialize.jl and iobuffer.jl (which handle deserialization), but I have not found any problems there. I would describe the programming styles of these files as "real world" rather than "Julia showcase", and the number of comments as "strictly for experts only" ( #8492 ), but they seem otherwise sound and well optimized.

I would be grateful for any comments or suggestions. At the moment, two promising courses of action seem either (1) make MEMDEBUG work again, or (2) investigate what happens during deserialization.

@ViralBShah ViralBShah added the kind:bug Indicates an unexpected problem or unintended behavior label Oct 2, 2014
@eschnett
Copy link
Contributor Author

@JeffBezanson Would you have time for a Skype call some time next week? I'm in the EST time zone. I want to continue to track this down. Either investigating bootstrapping with MEMDEBUG or looking at how functions are serialized seem promising next steps. For that I need to know more details about what your branch jb/size0stuff changed, or how functions (lambdas) are represented internally.

@JeffBezanson
Copy link
Sponsor Member

I can't do a call, but I found+fixed a bug that might be related (13638cd). I will keep working on this off and on if the segfaults persist.

@eschnett
Copy link
Contributor Author

The segfault persists.

@JeffBezanson
Copy link
Sponsor Member

Another possible segfault fix: 819d9c8

@JeffBezanson
Copy link
Sponsor Member

I've fixed a couple more memory bugs. Worth checking again.

@eschnett
Copy link
Contributor Author

Thanks. So far (i.e. a few days ago), things still reliably crash on two machines, and work flawlessly on a third. I don't understand this, as I'm using the same toolchain to build the same Julia version. I fear that LLVM code generation may be involved... Anyway, I'll test this again, after recovering from installing Yosemite.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Oct 21, 2014

deserialize(s, t::DataType) creates objects that aren't properly initialized until they are returned, which would probably segfault if the gc were to ever run while this function was on the stack:

x = ccall(:jl_new_struct_uninit, Any, (Any,), t)

(edit: and i'm not really sure what it would do if it encountered an #undef, although it would perhaps just throw an undefined method deserialize(IO, UndefRefTag) error)

(edit2: between ser_tag -- which oddly isn't a WeakKeyDict --, deser_tag, and lambda_numbers the serializer assumes there is exactly one global serialize/deserialize pair (worldwide). however, since you aren't using multiple julia instances, I don't think this is your issue)

@JeffBezanson
Copy link
Sponsor Member

I think that function zeros the new object, so it should be ok. The _uninit
part is a bit of a misnomer.

@StefanKarpinski
Copy link
Sponsor Member

Ah, I had already written

Should jl_new_struct_uninit maybe zero memory so that GC can at least know not to follow junk pointers?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Oct 21, 2014

I think that function zeros the new object, so it should be ok. The _uninit
part is a bit of a misnomer.

ah, yes i see that now

@fundamental
Copy link
Contributor

If any help is needed to verify this I'm getting the same/extremely similar segfault by just transmitting 1D floating point arrays and integers. It's a pretty deterministic segfault as well.

@JeffBezanson
Copy link
Sponsor Member

Is it possible to provide code that causes this? Ideally a fairly minimal test case? If you don't want to post it publicly you can email it to me directly.

@fundamental
Copy link
Contributor

Right now it's in the middle of some unpublished research code, so I can't put it up directly, but I get the feeling that I should be able to reduce it to a minimal test case.
I'll update once the test is minimized and I verify that my system llvm isn't responsible.

@JeffBezanson
Copy link
Sponsor Member

Awesome, thanks!

@fundamental
Copy link
Contributor

While trying to reduce the code, it looked like an early on section might have caused the bug.
This early section was reduced to:

using PyPlot
DataSet = rand(9,9)
Spectra = nothing

function import_data()
    global DataSet = rand(4096*2)
end

function main()
    (Spectra,_) = specgram(DataSet, 4096*2, 100, noverlap=0)
    Spectra = Spectra[1:600,:]
    Spectra = mapslices(x->x./norm(sort(x)[floor(end/2)]), Spectra, 2)
    imshow(Spectra)
end

Even the names of things start to matter at this scale which makes me think this might be a red herring created by a llvm bug. llvm is currently getting recompiled in the background on my system.
To get the segfault, save as "julia-test-case.jl", start up a repl, run reload("julia-test-case.jl"), and then run main() several times on the repl resulting in a bounds error each time, and then exit().

Hopefully with the rebuilt llvm I'll get at something more relevant, though it is possible that interfacing with pyplot leaves in a sleeper bug.

@mweastwood
Copy link
Contributor

Probably related to JuliaPy/PyCall.jl#95. I have also been seeing a pretty consistent segfault while using PyCall + PyPlot on the latest master that I haven't been able to reduce to a minimal test case yet.

@fundamental
Copy link
Contributor

The test case doesn't fail under llvm r208372, but the segfault at the deserialize code does still fail.
Rebuilding llvm to the most recent version r220584 results in a linker error undefined references to stuff that is defined in the llvm libraries, so I am unable to test.
I'd guess the build system might have gotten confused as there currently is a large variety of llvm installs on this machine.

@JeffBezanson
Copy link
Sponsor Member

Here's the backtrace I get on exit:

julia> exit()
Exception SystemError: 'Objects/methodobject.c:120: bad argument to internal function' in 
signal (11): Segmentation fault
strlen at /usr/lib/libc.so.6 (unknown line)
PyString_FromFormatV at /usr/lib/libpython2.7.so (unknown line)
PyString_FromFormat at /usr/lib/libpython2.7.so (unknown line)
PyObject_Repr at /usr/lib/libpython2.7.so (unknown line)
unknown function (ip: 335546442)
PyFile_WriteObject at /usr/lib/libpython2.7.so (unknown line)
PyErr_WriteUnraisable at /usr/lib/libpython2.7.so (unknown line)
PyObject_ClearWeakRefs at /usr/lib/libpython2.7.so (unknown line)
unknown function (ip: 316758376)
pydecref at /home/jeff/.julia/v0.4/PyCall/src/PyCall.jl:71
jl_apply_generic at /home/jeff/src/julia2/julia/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 603255118)
unknown function (ip: 603258288)
uv_atexit_hook at /home/jeff/src/julia2/julia/usr/bin/../lib/libjulia.so (unknown line)
jl_exit at /home/jeff/src/julia2/julia/usr/bin/../lib/libjulia.so (unknown line)
exit at client.jl:37
Segmentation fault (core dumped)

I'd still like to see the "deserialize" part though.

@fundamental
Copy link
Contributor

Unfortunately after a reboot and relinking to the original LLVM version I am no longer seeing this particular segfault.
Sorry for the noise.

@stevengj
Copy link
Member

stevengj commented Nov 8, 2014

I did a bisect on JuliaPy/PyCall.jl#95, and the problem seems to originate in 0d60213, which is indeed related to serialization. (Alternatively, if it is some random memory-related bug, the bisect could be unreliable.) I'll try to verify this.

@stevengj
Copy link
Member

stevengj commented Nov 8, 2014

Okay, I can confirm that the missing deserialize method seems to be the cause of the crash. Adding

function Base.deserialize(s, ::Type{TypeVar})
 name = deserialize(s)
 lb = deserialize(s)
 ub = deserialize(s)
 TypeVar(name, lb, ub)
end

at the top of the PyCall test program causes it to succeed.

I have no idea why this is relevant, though; when is deserialize called? (PyCall never calls it explicitly.)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 8, 2014

You can add ccall(:jl_breakpoint, Void, (Any,)) in there and then interactively view the backtrace in a debugger to find out when it gets called

@stevengj
Copy link
Member

stevengj commented Nov 8, 2014

Okay, false alarm: when make debug Julia, the crash reappears even if I add the deserialize method, so it seems to be unrelated.

@stevengj
Copy link
Member

stevengj commented Nov 8, 2014

Okay, it looks like it was a longstanding PyCall memory bug that just happened to have been exposed by Julia 0.4; should be fixed now on my end by PyCall 0.6.1.

@stevengj
Copy link
Member

stevengj commented Nov 8, 2014

The bug in PyCall seems to have been triggered by a change in Julia 0.4 that I just mentioned on the mailing list: previously, passing pointers to an immutable type to ccall via & apparently passed an actual pointer to the data (so that it allowed you to mutate the immutable type in C), whereas now this no longer occurs.

It still "works" to pass a pointer to a mutable type via ccall via & and mutate the type in C. I'm still relying on this behavior in PyCall, although I probably shouldn't (unless you agree that this is a good behavior and document it). But right now I don't see any good alternative (you can't pass pointer to mutable types via arrays, as I understand it).

@stevengj
Copy link
Member

stevengj commented Nov 8, 2014

I'm going to close this as we no longer seem to have a reproducible problem.

@stevengj stevengj closed this as completed Nov 8, 2014
@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 8, 2014

the behavior of immutables in ccall shouldn't have changed. it's always been the policy that you can mutate the pointer object however you want, with the proviso that the memory will be reclaimed as soon the ccall returns and the changes will not be reflected back into the original immutable object.

@stevengj
Copy link
Member

stevengj commented Nov 8, 2014

The manual's documentation for & in ccall says that changes are not reflected in the calling code. It sounds like this should be amended to say that changes are only reflected for mutable types?

@eschnett
Copy link
Contributor Author

eschnett commented Nov 9, 2014

I've seen quite a few patches being committed that address memory (or gc) problems in the past weeks. Thanks!

A brief status report: My reduced test case at https://bitbucket.org/eschnett/funhpc.jl/branch/memdebug now hangs instead of segfaulting. The original code still segfaults.

@stevengj
Copy link
Member

stevengj commented Nov 9, 2014

Echnett, can you boil it down to a small amount of code and file a separate issue?

@eschnett
Copy link
Contributor Author

Actually, I find that my original bug report still holds: There is a call to deserialize that causes a segfault; if I comment out the call, the segfault goes away. (The result of this call is never used.)

Above, I was confused because I built Julia with MEMDEBUG enabled; this seems to lead to a hang instead of a segfault. But stock Julia still segfaults, most likely caused by deserialized.

How can I re-open this issue?

@kmsquire kmsquire reopened this Nov 10, 2014
@kmsquire
Copy link
Member

@eschnett, at this point, you can just ask for it to be reopened.

@stevengj
Copy link
Member

Yes, sorry, I got confused by a message above that said that the original segfault disappeared after a rebuild; now I realize that this didn't apply to your system.

I tried checking out your code, but it depended on MPI; you mentioned there was a version with the MPI dependency removed?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 22, 2014

i think i've tracked this down. each new task was getting a small copy of the previous tasks's stack. after 1000 tasks, it seems that starts to cause issues

@vtjnash vtjnash changed the title Segfault caused by deserialize (?) Segfault caused by deserialize (?) or too many tasks (?) Nov 22, 2014
@JeffBezanson
Copy link
Sponsor Member

Very nice. Can you elaborate on why exactly that caused a segfault?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 22, 2014

i couldn't get a backtrace to be certain, and I couldn't use jl_ since the stack was broken, but i observed that most of jl_current_task was uninitialized (which is why we get unknown function (ip: 0), since neither ctx had been initialized). additionally, it appears that jl_current_task->exception != jl_nothing, so it appears that something (memory error?) may have been trying to throw an error simultaneously with trying to initialize the task

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 22, 2014

also, another possible optimization that could be done here: we don't need to jump to the base_ctx at all if the task is already started. we only need to go there if this is the first time we are executing the task, otherwise, we could be avoiding the second call to setjmp.

waTeim pushed a commit to waTeim/julia that referenced this issue Nov 23, 2014
previously, each new stack would have a copy of a small portion of the
stack that started it. in small amounts, this was fine, but it is bad if
the user is starting many tasks. additionally, it wasted memory inside
every task to have a per-task base_ctx.
@tkelman
Copy link
Contributor

tkelman commented Nov 25, 2014

thoughts on whether the fix for this might be causing intermittent failures of the parallel test on travis? https://travis-ci.org/JuliaLang/julia/jobs/42121382

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 26, 2014

i don't have any reason to suspect this commit

vtjnash added a commit that referenced this issue Dec 11, 2014
previously, each new stack would have a copy of a small portion of the
stack that started it. in small amounts, this was fine, but it is bad if
the user is starting many tasks. additionally, it wasted memory inside
every task to have a per-task base_ctx.

(cherry picked from commit ac13711)
@armgong
Copy link
Contributor

armgong commented Feb 6, 2015

https://travis-ci.org/JuliaLang/julia/jobs/42121382 parallel test failures same as #10058

@tkelman
Copy link
Contributor

tkelman commented Feb 6, 2015

@armgong that was some time ago. The relevant part of that log is:

    From worker 9:       * bitarray
    From worker 5:       * math
    From worker 8:       * functional
    From worker 8:       * bigint
    From worker 8:       * sorting
    From worker 5:       * statistics
    From worker 6:       * spawn
    From worker 6:         [stdio passthrough ok]
    From worker 8:       * backtrace
    From worker 8:       * priorityqueue
    From worker 5:       * arpack
    From worker 6:       * file
    From worker 8:       * suitesparse
    From worker 8:       * version
    From worker 8:       * resolve
    From worker 6:       * pollfd
    From worker 6:       * mpfr
    From worker 8:       * broadcast
    From worker 5:       * complex
    From worker 6:       * socket
    From worker 6:       * floatapprox
    From worker 6:       * readdlm
    From worker 4:       * reflection
    From worker 2:       * regex
    From worker 4:       * float16
    From worker 2:       * combinatorics
    From worker 5:       * sysinfo
    From worker 4:       * rounding
    From worker 4:       * ranges
    From worker 5:       * mod2pi
    From worker 5:       * euler
    From worker 2:       * show
    From worker 5:       * lineedit
    From worker 2:       * replcompletions
    From worker 5:       * repl
    From worker 2:       * test
    From worker 2:       * goto
    From worker 2:       * llvmcall
    From worker 2:       * grisu
    From worker 6:       * nullable
    From worker 6:       * meta
    From worker 6:       * profile
    From worker 5:       * libgit2
    From worker 5:       * docs
    From worker 5:       * examples
    From worker 2:       * unicode
    From worker 6:       * parallel
exception on 6: Worker 9 terminated.

Note how worker 9 was terminated. The last test worker 9 was running on was bitarray. That failure was another issue I reported that I don't think we ever fixed, #9176. Though I have been seeing that failure a little less often lately, it's probably just masked by other recently-more-common failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests