-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault caused by deserialize (?) or too many tasks (?) #8551
Comments
@JeffBezanson Would you have time for a Skype call some time next week? I'm in the EST time zone. I want to continue to track this down. Either investigating bootstrapping with MEMDEBUG or looking at how functions are serialized seem promising next steps. For that I need to know more details about what your branch jb/size0stuff changed, or how functions (lambdas) are represented internally. |
I can't do a call, but I found+fixed a bug that might be related (13638cd). I will keep working on this off and on if the segfaults persist. |
The segfault persists. |
Another possible segfault fix: 819d9c8 |
I've fixed a couple more memory bugs. Worth checking again. |
Thanks. So far (i.e. a few days ago), things still reliably crash on two machines, and work flawlessly on a third. I don't understand this, as I'm using the same toolchain to build the same Julia version. I fear that LLVM code generation may be involved... Anyway, I'll test this again, after recovering from installing Yosemite. |
Line 562 in e3a74ee
(edit: and i'm not really sure what it would do if it encountered an (edit2: between |
I think that function zeros the new object, so it should be ok. The _uninit |
Ah, I had already written
|
ah, yes i see that now |
If any help is needed to verify this I'm getting the same/extremely similar segfault by just transmitting 1D floating point arrays and integers. It's a pretty deterministic segfault as well. |
Is it possible to provide code that causes this? Ideally a fairly minimal test case? If you don't want to post it publicly you can email it to me directly. |
Right now it's in the middle of some unpublished research code, so I can't put it up directly, but I get the feeling that I should be able to reduce it to a minimal test case. |
Awesome, thanks! |
While trying to reduce the code, it looked like an early on section might have caused the bug.
Even the names of things start to matter at this scale which makes me think this might be a red herring created by a llvm bug. llvm is currently getting recompiled in the background on my system. Hopefully with the rebuilt llvm I'll get at something more relevant, though it is possible that interfacing with pyplot leaves in a sleeper bug. |
Probably related to JuliaPy/PyCall.jl#95. I have also been seeing a pretty consistent segfault while using PyCall + PyPlot on the latest master that I haven't been able to reduce to a minimal test case yet. |
The test case doesn't fail under llvm r208372, but the segfault at the deserialize code does still fail. |
Here's the backtrace I get on exit:
I'd still like to see the "deserialize" part though. |
Unfortunately after a reboot and relinking to the original LLVM version I am no longer seeing this particular segfault. |
I did a bisect on JuliaPy/PyCall.jl#95, and the problem seems to originate in 0d60213, which is indeed related to serialization. (Alternatively, if it is some random memory-related bug, the bisect could be unreliable.) I'll try to verify this. |
Okay, I can confirm that the missing function Base.deserialize(s, ::Type{TypeVar})
name = deserialize(s)
lb = deserialize(s)
ub = deserialize(s)
TypeVar(name, lb, ub)
end at the top of the PyCall test program causes it to succeed. I have no idea why this is relevant, though; when is |
You can add |
Okay, false alarm: when |
Okay, it looks like it was a longstanding PyCall memory bug that just happened to have been exposed by Julia 0.4; should be fixed now on my end by PyCall 0.6.1. |
The bug in PyCall seems to have been triggered by a change in Julia 0.4 that I just mentioned on the mailing list: previously, passing pointers to an immutable type to It still "works" to pass a pointer to a mutable type via |
I'm going to close this as we no longer seem to have a reproducible problem. |
the behavior of immutables in ccall shouldn't have changed. it's always been the policy that you can mutate the pointer object however you want, with the proviso that the memory will be reclaimed as soon the ccall returns and the changes will not be reflected back into the original immutable object. |
The manual's documentation for & in ccall says that changes are not reflected in the calling code. It sounds like this should be amended to say that changes are only reflected for mutable types? |
I've seen quite a few patches being committed that address memory (or gc) problems in the past weeks. Thanks! A brief status report: My reduced test case at https://bitbucket.org/eschnett/funhpc.jl/branch/memdebug now hangs instead of segfaulting. The original code still segfaults. |
Echnett, can you boil it down to a small amount of code and file a separate issue? |
Actually, I find that my original bug report still holds: There is a call to deserialize that causes a segfault; if I comment out the call, the segfault goes away. (The result of this call is never used.) Above, I was confused because I built Julia with MEMDEBUG enabled; this seems to lead to a hang instead of a segfault. But stock Julia still segfaults, most likely caused by deserialized. How can I re-open this issue? |
@eschnett, at this point, you can just ask for it to be reopened. |
Yes, sorry, I got confused by a message above that said that the original segfault disappeared after a rebuild; now I realize that this didn't apply to your system. I tried checking out your code, but it depended on MPI; you mentioned there was a version with the MPI dependency removed? |
i think i've tracked this down. each new task was getting a small copy of the |
Very nice. Can you elaborate on why exactly that caused a segfault? |
i couldn't get a backtrace to be certain, and I couldn't use |
also, another possible optimization that could be done here: we don't need to jump to the base_ctx at all if the task is already started. we only need to go there if this is the first time we are executing the task, otherwise, we could be avoiding the second call to setjmp. |
previously, each new stack would have a copy of a small portion of the stack that started it. in small amounts, this was fine, but it is bad if the user is starting many tasks. additionally, it wasted memory inside every task to have a per-task base_ctx.
thoughts on whether the fix for this might be causing intermittent failures of the parallel test on travis? https://travis-ci.org/JuliaLang/julia/jobs/42121382 |
i don't have any reason to suspect this commit |
previously, each new stack would have a copy of a small portion of the stack that started it. in small amounts, this was fine, but it is bad if the user is starting many tasks. additionally, it wasted memory inside every task to have a per-task base_ctx. (cherry picked from commit ac13711)
https://travis-ci.org/JuliaLang/julia/jobs/42121382 parallel test failures same as #10058 |
@armgong that was some time ago. The relevant part of that log is:
Note how worker 9 was terminated. The last test worker 9 was running on was bitarray. That failure was another issue I reported that I don't think we ever fixed, #9176. Though I have been seeing that failure a little less often lately, it's probably just masked by other recently-more-common failures. |
For the past two weeks I've been trying to find the cause of a rather persistent segfault in a Julia program of mine. So far, I whittled it down to less then 1,000 lines of Julia code, and no external dependencies except Deque from DataStructures.jl (as a replacement for MPI.jl).
I create many tasks, many of them active (and yielding from time to time). I'm sure there's a lot of memory allocation and garbage collection going on. I also serialize and de-serialize many objects, including functions and lambdas.
The error is a segfault. I assume that there is a safe subset of Julia programs that should never segfault (no ccall, no @inbounds, etc.). I believe my program is safe in this respect.
Here is the code: https://bitbucket.org/eschnett/funhpc.jl/branch/memdebug. I apologize for the size -- I've already greatly reduced it. This is how to run it, and how it fails (after a few seconds):
This with the current development version of Julia.
I believe the problem is somehow connected to or triggered by deserialization and/or by using many tasks. Deserialization is called from the file Comm.jl, routine recv_item, line 39. The result of the deserialize call is never used. If this call is commented out, the program runs fine (and output "Done."). (In my attempts to reduce code size, termination detection probably got wonky, but if the text "Done." is output, everything is fine, and the segfault avoided.)
Similarly, when I reduce the number of concurrent tasks (i.e. tasks that are runnable simultaneously, as opposed to waiting), the segfault becomes more sporadic, and disappears if there are just a few tasks. The current example runs 1000 tasks simultaneously.
When I attach a debugger, the segfault happens in array.c or gc.c in an allocation routine, with a seemingly impossible internal state. I assume that malloc's internal heap data structures have been destroyed by then. Adding assert statements didn't reveal anything. I can't use MEMDEBUG since this doesn't bootstrap since about five weeks ago ( #8422 ).
I tried SANITIZE=1, but this didn't bootstrap for me either.
I tried building Julia with both LLVM 3.3 and LLVM 3.5.
I tried running this via Travis instead on my laptop (OS X) and my workstation (Ubuntu) since I thought something may be wrong with my local setups, and Travis's setup should be well tested. However, I didn't have much luck there either -- debugging via Travis is too indirect to be productive.
I have looked in detail into the files array.c and gc.c (where the segfault is reported), as well as the file serialize.jl and iobuffer.jl (which handle deserialization), but I have not found any problems there. I would describe the programming styles of these files as "real world" rather than "Julia showcase", and the number of comments as "strictly for experts only" ( #8492 ), but they seem otherwise sound and well optimized.
I would be grateful for any comments or suggestions. At the moment, two promising courses of action seem either (1) make MEMDEBUG work again, or (2) investigate what happens during deserialization.
The text was updated successfully, but these errors were encountered: