Segfault caused by deserialize (?) or too many tasks (?) #8551

eschnett · 2014-10-02T03:17:42Z

For the past two weeks I've been trying to find the cause of a rather persistent segfault in a Julia program of mine. So far, I whittled it down to less then 1,000 lines of Julia code, and no external dependencies except Deque from DataStructures.jl (as a replacement for MPI.jl).

I create many tasks, many of them active (and yielding from time to time). I'm sure there's a lot of memory allocation and garbage collection going on. I also serialize and de-serialize many objects, including functions and lambdas.

The error is a segfault. I assume that there is a safe subset of Julia programs that should never segfault (no ccall, no @inbounds, etc.). I believe my program is safe in this respect.

Here is the code: https://bitbucket.org/eschnett/funhpc.jl/branch/memdebug. I apologize for the size -- I've already greatly reduced it. This is how to run it, and how it fails (after a few seconds):

$ ~/julia/bin/julia-debug Wave.jl
Wave[FunHPC.jl]
Initialization

signal (11): Segmentation fault: 11
unknown function (ip: 0)
Segmentation fault: 11

This with the current development version of Julia.

I believe the problem is somehow connected to or triggered by deserialization and/or by using many tasks. Deserialization is called from the file Comm.jl, routine recv_item, line 39. The result of the deserialize call is never used. If this call is commented out, the program runs fine (and output "Done."). (In my attempts to reduce code size, termination detection probably got wonky, but if the text "Done." is output, everything is fine, and the segfault avoided.)

Similarly, when I reduce the number of concurrent tasks (i.e. tasks that are runnable simultaneously, as opposed to waiting), the segfault becomes more sporadic, and disappears if there are just a few tasks. The current example runs 1000 tasks simultaneously.

When I attach a debugger, the segfault happens in array.c or gc.c in an allocation routine, with a seemingly impossible internal state. I assume that malloc's internal heap data structures have been destroyed by then. Adding assert statements didn't reveal anything. I can't use MEMDEBUG since this doesn't bootstrap since about five weeks ago ( #8422 ).

I tried SANITIZE=1, but this didn't bootstrap for me either.

I tried building Julia with both LLVM 3.3 and LLVM 3.5.

I tried running this via Travis instead on my laptop (OS X) and my workstation (Ubuntu) since I thought something may be wrong with my local setups, and Travis's setup should be well tested. However, I didn't have much luck there either -- debugging via Travis is too indirect to be productive.

I have looked in detail into the files array.c and gc.c (where the segfault is reported), as well as the file serialize.jl and iobuffer.jl (which handle deserialization), but I have not found any problems there. I would describe the programming styles of these files as "real world" rather than "Julia showcase", and the number of comments as "strictly for experts only" ( #8492 ), but they seem otherwise sound and well optimized.

I would be grateful for any comments or suggestions. At the moment, two promising courses of action seem either (1) make MEMDEBUG work again, or (2) investigate what happens during deserialization.

The text was updated successfully, but these errors were encountered:

eschnett · 2014-10-12T15:02:35Z

@JeffBezanson Would you have time for a Skype call some time next week? I'm in the EST time zone. I want to continue to track this down. Either investigating bootstrapping with MEMDEBUG or looking at how functions are serialized seem promising next steps. For that I need to know more details about what your branch jb/size0stuff changed, or how functions (lambdas) are represented internally.

JeffBezanson · 2014-10-13T19:08:53Z

I can't do a call, but I found+fixed a bug that might be related (13638cd). I will keep working on this off and on if the segfaults persist.

eschnett · 2014-10-14T19:39:43Z

The segfault persists.

JeffBezanson · 2014-10-16T23:26:25Z

Another possible segfault fix: 819d9c8

JeffBezanson · 2014-10-20T05:31:34Z

I've fixed a couple more memory bugs. Worth checking again.

eschnett · 2014-10-20T14:25:36Z

Thanks. So far (i.e. a few days ago), things still reliably crash on two machines, and work flawlessly on a third. I don't understand this, as I'm using the same toolchain to build the same Julia version. I fear that LLVM code generation may be involved... Anyway, I'll test this again, after recovering from installing Yosemite.

vtjnash · 2014-10-21T01:27:11Z

deserialize(s, t::DataType) creates objects that aren't properly initialized until they are returned, which would probably segfault if the gc were to ever run while this function was on the stack:

julia/base/serialize.jl

Line 562 in e3a74ee

x = ccall(:jl_new_struct_uninit, Any, (Any,), t)

(edit: and i'm not really sure what it would do if it encountered an #undef, although it would perhaps just throw an undefined method deserialize(IO, UndefRefTag) error)

(edit2: between ser_tag -- which oddly isn't a WeakKeyDict --, deser_tag, and lambda_numbers the serializer assumes there is exactly one global serialize/deserialize pair (worldwide). however, since you aren't using multiple julia instances, I don't think this is your issue)

JeffBezanson · 2014-10-21T01:35:26Z

I think that function zeros the new object, so it should be ok. The _uninit
part is a bit of a misnomer.

StefanKarpinski · 2014-10-21T01:36:00Z

Ah, I had already written

Should jl_new_struct_uninit maybe zero memory so that GC can at least know not to follow junk pointers?

vtjnash · 2014-10-21T01:39:31Z

I think that function zeros the new object, so it should be ok. The _uninit
part is a bit of a misnomer.

ah, yes i see that now

fundamental · 2014-10-24T18:55:23Z

If any help is needed to verify this I'm getting the same/extremely similar segfault by just transmitting 1D floating point arrays and integers. It's a pretty deterministic segfault as well.

JeffBezanson · 2014-10-24T18:59:49Z

Is it possible to provide code that causes this? Ideally a fairly minimal test case? If you don't want to post it publicly you can email it to me directly.

fundamental · 2014-10-24T19:08:10Z

Right now it's in the middle of some unpublished research code, so I can't put it up directly, but I get the feeling that I should be able to reduce it to a minimal test case.
I'll update once the test is minimized and I verify that my system llvm isn't responsible.

JeffBezanson · 2014-10-24T19:33:18Z

Awesome, thanks!

fundamental · 2014-10-24T20:12:17Z

While trying to reduce the code, it looked like an early on section might have caused the bug.
This early section was reduced to:

using PyPlot
DataSet = rand(9,9)
Spectra = nothing

function import_data()
    global DataSet = rand(4096*2)
end

function main()
    (Spectra,_) = specgram(DataSet, 4096*2, 100, noverlap=0)
    Spectra = Spectra[1:600,:]
    Spectra = mapslices(x->x./norm(sort(x)[floor(end/2)]), Spectra, 2)
    imshow(Spectra)
end

Even the names of things start to matter at this scale which makes me think this might be a red herring created by a llvm bug. llvm is currently getting recompiled in the background on my system.
To get the segfault, save as "julia-test-case.jl", start up a repl, run reload("julia-test-case.jl"), and then run main() several times on the repl resulting in a bounds error each time, and then exit().

Hopefully with the rebuilt llvm I'll get at something more relevant, though it is possible that interfacing with pyplot leaves in a sleeper bug.

mweastwood · 2014-10-24T20:23:25Z

Probably related to JuliaPy/PyCall.jl#95. I have also been seeing a pretty consistent segfault while using PyCall + PyPlot on the latest master that I haven't been able to reduce to a minimal test case yet.

fundamental · 2014-10-24T21:10:21Z

The test case doesn't fail under llvm r208372, but the segfault at the deserialize code does still fail.
Rebuilding llvm to the most recent version r220584 results in a linker error undefined references to stuff that is defined in the llvm libraries, so I am unable to test.
I'd guess the build system might have gotten confused as there currently is a large variety of llvm installs on this machine.

JeffBezanson · 2014-10-24T22:09:53Z

Here's the backtrace I get on exit:

julia> exit()
Exception SystemError: 'Objects/methodobject.c:120: bad argument to internal function' in 
signal (11): Segmentation fault
strlen at /usr/lib/libc.so.6 (unknown line)
PyString_FromFormatV at /usr/lib/libpython2.7.so (unknown line)
PyString_FromFormat at /usr/lib/libpython2.7.so (unknown line)
PyObject_Repr at /usr/lib/libpython2.7.so (unknown line)
unknown function (ip: 335546442)
PyFile_WriteObject at /usr/lib/libpython2.7.so (unknown line)
PyErr_WriteUnraisable at /usr/lib/libpython2.7.so (unknown line)
PyObject_ClearWeakRefs at /usr/lib/libpython2.7.so (unknown line)
unknown function (ip: 316758376)
pydecref at /home/jeff/.julia/v0.4/PyCall/src/PyCall.jl:71
jl_apply_generic at /home/jeff/src/julia2/julia/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 603255118)
unknown function (ip: 603258288)
uv_atexit_hook at /home/jeff/src/julia2/julia/usr/bin/../lib/libjulia.so (unknown line)
jl_exit at /home/jeff/src/julia2/julia/usr/bin/../lib/libjulia.so (unknown line)
exit at client.jl:37
Segmentation fault (core dumped)

I'd still like to see the "deserialize" part though.

fundamental · 2014-10-26T18:08:40Z

Unfortunately after a reboot and relinking to the original LLVM version I am no longer seeing this particular segfault.
Sorry for the noise.

stevengj · 2014-11-08T17:16:12Z

I did a bisect on JuliaPy/PyCall.jl#95, and the problem seems to originate in 0d60213, which is indeed related to serialization. (Alternatively, if it is some random memory-related bug, the bisect could be unreliable.) I'll try to verify this.

stevengj · 2014-11-08T17:27:00Z

Okay, I can confirm that the missing deserialize method seems to be the cause of the crash. Adding

function Base.deserialize(s, ::Type{TypeVar})
 name = deserialize(s)
 lb = deserialize(s)
 ub = deserialize(s)
 TypeVar(name, lb, ub)
end

at the top of the PyCall test program causes it to succeed.

I have no idea why this is relevant, though; when is deserialize called? (PyCall never calls it explicitly.)

vtjnash · 2014-11-08T17:29:12Z

You can add ccall(:jl_breakpoint, Void, (Any,)) in there and then interactively view the backtrace in a debugger to find out when it gets called

stevengj · 2014-11-08T19:29:27Z

Okay, false alarm: when make debug Julia, the crash reappears even if I add the deserialize method, so it seems to be unrelated.

stevengj · 2014-11-08T20:50:20Z

Okay, it looks like it was a longstanding PyCall memory bug that just happened to have been exposed by Julia 0.4; should be fixed now on my end by PyCall 0.6.1.

stevengj · 2014-11-08T21:16:08Z

The bug in PyCall seems to have been triggered by a change in Julia 0.4 that I just mentioned on the mailing list: previously, passing pointers to an immutable type to ccall via & apparently passed an actual pointer to the data (so that it allowed you to mutate the immutable type in C), whereas now this no longer occurs.

It still "works" to pass a pointer to a mutable type via ccall via & and mutate the type in C. I'm still relying on this behavior in PyCall, although I probably shouldn't (unless you agree that this is a good behavior and document it). But right now I don't see any good alternative (you can't pass pointer to mutable types via arrays, as I understand it).

stevengj · 2014-11-08T21:41:53Z

I'm going to close this as we no longer seem to have a reproducible problem.

vtjnash · 2014-11-08T21:51:18Z

the behavior of immutables in ccall shouldn't have changed. it's always been the policy that you can mutate the pointer object however you want, with the proviso that the memory will be reclaimed as soon the ccall returns and the changes will not be reflected back into the original immutable object.

stevengj · 2014-11-08T23:05:42Z

The manual's documentation for & in ccall says that changes are not reflected in the calling code. It sounds like this should be amended to say that changes are only reflected for mutable types?

eschnett · 2014-11-09T15:57:07Z

I've seen quite a few patches being committed that address memory (or gc) problems in the past weeks. Thanks!

A brief status report: My reduced test case at https://bitbucket.org/eschnett/funhpc.jl/branch/memdebug now hangs instead of segfaulting. The original code still segfaults.

stevengj · 2014-11-09T19:05:58Z

Echnett, can you boil it down to a small amount of code and file a separate issue?

eschnett · 2014-11-10T18:28:07Z

Actually, I find that my original bug report still holds: There is a call to deserialize that causes a segfault; if I comment out the call, the segfault goes away. (The result of this call is never used.)

Above, I was confused because I built Julia with MEMDEBUG enabled; this seems to lead to a hang instead of a segfault. But stock Julia still segfaults, most likely caused by deserialized.

How can I re-open this issue?

kmsquire · 2014-11-10T18:31:11Z

@eschnett, at this point, you can just ask for it to be reopened.

stevengj · 2014-11-10T18:47:13Z

Yes, sorry, I got confused by a message above that said that the original segfault disappeared after a rebuild; now I realize that this didn't apply to your system.

I tried checking out your code, but it depended on MPI; you mentioned there was a version with the MPI dependency removed?

vtjnash · 2014-11-22T21:00:16Z

i think i've tracked this down. each new task was getting a small copy of the previous tasks's stack. after 1000 tasks, it seems that starts to cause issues

JeffBezanson · 2014-11-22T21:36:22Z

Very nice. Can you elaborate on why exactly that caused a segfault?

vtjnash · 2014-11-22T21:55:24Z

i couldn't get a backtrace to be certain, and I couldn't use jl_ since the stack was broken, but i observed that most of jl_current_task was uninitialized (which is why we get unknown function (ip: 0), since neither ctx had been initialized). additionally, it appears that jl_current_task->exception != jl_nothing, so it appears that something (memory error?) may have been trying to throw an error simultaneously with trying to initialize the task

vtjnash · 2014-11-22T22:08:02Z

also, another possible optimization that could be done here: we don't need to jump to the base_ctx at all if the task is already started. we only need to go there if this is the first time we are executing the task, otherwise, we could be avoiding the second call to setjmp.

previously, each new stack would have a copy of a small portion of the stack that started it. in small amounts, this was fine, but it is bad if the user is starting many tasks. additionally, it wasted memory inside every task to have a per-task base_ctx.

tkelman · 2014-11-25T22:22:04Z

thoughts on whether the fix for this might be causing intermittent failures of the parallel test on travis? https://travis-ci.org/JuliaLang/julia/jobs/42121382

vtjnash · 2014-11-26T02:39:14Z

i don't have any reason to suspect this commit

previously, each new stack would have a copy of a small portion of the stack that started it. in small amounts, this was fine, but it is bad if the user is starting many tasks. additionally, it wasted memory inside every task to have a per-task base_ctx. (cherry picked from commit ac13711)

armgong · 2015-02-06T04:39:06Z

https://travis-ci.org/JuliaLang/julia/jobs/42121382 parallel test failures same as #10058

tkelman · 2015-02-06T04:43:48Z

@armgong that was some time ago. The relevant part of that log is:

    From worker 9:       * bitarray
    From worker 5:       * math
    From worker 8:       * functional
    From worker 8:       * bigint
    From worker 8:       * sorting
    From worker 5:       * statistics
    From worker 6:       * spawn
    From worker 6:         [stdio passthrough ok]
    From worker 8:       * backtrace
    From worker 8:       * priorityqueue
    From worker 5:       * arpack
    From worker 6:       * file
    From worker 8:       * suitesparse
    From worker 8:       * version
    From worker 8:       * resolve
    From worker 6:       * pollfd
    From worker 6:       * mpfr
    From worker 8:       * broadcast
    From worker 5:       * complex
    From worker 6:       * socket
    From worker 6:       * floatapprox
    From worker 6:       * readdlm
    From worker 4:       * reflection
    From worker 2:       * regex
    From worker 4:       * float16
    From worker 2:       * combinatorics
    From worker 5:       * sysinfo
    From worker 4:       * rounding
    From worker 4:       * ranges
    From worker 5:       * mod2pi
    From worker 5:       * euler
    From worker 2:       * show
    From worker 5:       * lineedit
    From worker 2:       * replcompletions
    From worker 5:       * repl
    From worker 2:       * test
    From worker 2:       * goto
    From worker 2:       * llvmcall
    From worker 2:       * grisu
    From worker 6:       * nullable
    From worker 6:       * meta
    From worker 6:       * profile
    From worker 5:       * libgit2
    From worker 5:       * docs
    From worker 5:       * examples
    From worker 2:       * unicode
    From worker 6:       * parallel
exception on 6: Worker 9 terminated.

Note how worker 9 was terminated. The last test worker 9 was running on was bitarray. That failure was another issue I reported that I don't think we ever fixed, #9176. Though I have been seeing that failure a little less often lately, it's probably just masked by other recently-more-common failures.

ViralBShah added the kind:bug Indicates an unexpected problem or unintended behavior label Oct 2, 2014

stevengj mentioned this issue Nov 8, 2014

tests segfault with Julia 0.4 JuliaPy/PyCall.jl#95

Closed

stevengj closed this as completed Nov 8, 2014

stevengj mentioned this issue Nov 9, 2014

update documentation for passing struct pointers to C #8948

Merged

kmsquire reopened this Nov 10, 2014

staticfloat mentioned this issue Nov 11, 2014

Doc system segfault #8979

Closed

ViralBShah added this to the 0.4 milestone Nov 19, 2014

This was referenced Nov 19, 2014

New debug flag MEMNOFREE that disables all calls to free #9068

Closed

Segfault with large number of tasks #9066

Closed

vtjnash closed this as completed in ac13711 Nov 22, 2014

vtjnash changed the title ~~Segfault caused by deserialize (?)~~ Segfault caused by deserialize (?) or too many tasks (?) Nov 22, 2014

vtjnash added the backport pending label Nov 22, 2014

waTeim mentioned this issue Nov 23, 2014

_jl_base_ctx not exported in libjulia.dynlib #9117

Closed

waTeim mentioned this issue Nov 25, 2014

Introduce a new variable in Make.inc that applies only to libjulia.so #9130

Merged

tkelman removed the backport pending label Dec 11, 2014

tkelman mentioned this issue Feb 6, 2015

0.3.6 release planning issue #10058

Closed

tkelman mentioned this issue Feb 7, 2015

parallel bug in 0.35 release #10085

Closed

Segfault caused by deserialize (?) or too many tasks (?) #8551

Segfault caused by deserialize (?) or too many tasks (?) #8551

Comments

eschnett commented Oct 2, 2014

eschnett commented Oct 12, 2014

JeffBezanson commented Oct 13, 2014

eschnett commented Oct 14, 2014

JeffBezanson commented Oct 16, 2014

JeffBezanson commented Oct 20, 2014

eschnett commented Oct 20, 2014

vtjnash commented Oct 21, 2014

JeffBezanson commented Oct 21, 2014

StefanKarpinski commented Oct 21, 2014

vtjnash commented Oct 21, 2014

fundamental commented Oct 24, 2014

JeffBezanson commented Oct 24, 2014

fundamental commented Oct 24, 2014

JeffBezanson commented Oct 24, 2014

fundamental commented Oct 24, 2014

mweastwood commented Oct 24, 2014

fundamental commented Oct 24, 2014

JeffBezanson commented Oct 24, 2014

fundamental commented Oct 26, 2014

stevengj commented Nov 8, 2014

stevengj commented Nov 8, 2014

vtjnash commented Nov 8, 2014

stevengj commented Nov 8, 2014

stevengj commented Nov 8, 2014

stevengj commented Nov 8, 2014

stevengj commented Nov 8, 2014

vtjnash commented Nov 8, 2014

stevengj commented Nov 8, 2014

eschnett commented Nov 9, 2014

stevengj commented Nov 9, 2014

eschnett commented Nov 10, 2014

kmsquire commented Nov 10, 2014

stevengj commented Nov 10, 2014

vtjnash commented Nov 22, 2014

JeffBezanson commented Nov 22, 2014

vtjnash commented Nov 22, 2014

vtjnash commented Nov 22, 2014

tkelman commented Nov 25, 2014

vtjnash commented Nov 26, 2014

armgong commented Feb 6, 2015

tkelman commented Feb 6, 2015