Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccall test races with GC #12122

Closed
JeffBezanson opened this issue Jul 12, 2015 · 12 comments · Fixed by #20207
Closed

ccall test races with GC #12122

JeffBezanson opened this issue Jul 12, 2015 · 12 comments · Fixed by #20207
Labels
test This change adds or pertains to unit tests

Comments

@JeffBezanson
Copy link
Sponsor Member

I once saw this in MEMDEBUG mode. It depends on precise timing but it's pretty clear what's going on:

     * ccall               exception on 1: ERROR: LoadError: test failed: -4919131752989213765 is 881
 in expression: unsafe_load(ccall_echo_func(IntLike(881),Ptr{IntLike},Ref{IntLike})).x === 881
 in error at ./error.jl:21
 in default_handler at test.jl:29
 in do_test at test.jl:52
 in include at ./boot.jl:254
 in runtests at /home/jeff/src/julia/test/testdefs.jl:197
 in anonymous at multi.jl:629
 in run_work_thunk at multi.jl:590
 in remotecall_fetch at multi.jl:663
 in remotecall_fetch at multi.jl:678
 in anonymous at task.jl:1401
while loading /home/jeff/src/julia/test/ccall.jl, in expression starting on line 20
ERROR: LoadError: LoadError: test failed: -4919131752989213765 is 881
 in expression: unsafe_load(ccall_echo_func(IntLike(881),Ptr{IntLike},Ref{IntLike})).x === 881
 in error at ./error.jl:21
 in default_handler at test.jl:29
 in do_test at test.jl:52
 in include at ./boot.jl:254
 in runtests at /home/jeff/src/julia/test/testdefs.jl:197
 in anonymous at multi.jl:629
 in run_work_thunk at multi.jl:590
 in remotecall_fetch at multi.jl:663
 in remotecall_fetch at multi.jl:678
 in anonymous at task.jl:1401
while loading /home/jeff/src/julia/test/ccall.jl, in expression starting on line 20
while loading /home/jeff/src/julia/test/runtests.jl, in expression starting on line 5

-4919131752989213765 is 0xbbbbbbbbbbbbbbbb, so the IntLike value must have been freed before its contents could be loaded.

@JeffBezanson JeffBezanson added the test This change adds or pertains to unit tests label Jul 12, 2015
@yuyichao
Copy link
Contributor

Pasting my gdb session for this issue from #12179:

This is weird. (GDB session below, it's a concatenation of two sessions with address space randomization off so the addresses should be coherent but the gdb variable names might be off..).

@carnaval
The watchpoint should be the address of the tag of the corrupted object (((void**)v) - 1) but it seems that all of the modification to this address are done in the GC. The modifications up to jl_box_int64 makes sense and the tag stored is indeed Int64. However, somehow sweep_page decides to overwrite the tag even though it is later reachable from the GC and is also not caught by the GC verifier? (the line in sweep_page seems to be off and I think the actually line that does the modification is *pfl = v; two lines above). The trace that reaches this object is also printed at the end of the session.

Allocations: 1358172 (Pool: 1357200; Other: 972); GC: 1499
Hardware watchpoint 1: *(void**)0x7ffdf2dd8498

Old value = (void *) 0x7ffdf2dd84a8
New value = (void *) 0x0
__pool_alloc (end_offset=16360, osize=16, p=0x7ffff7dbf2f8 <norm_pools+24>)
    at gc.c:1017
1017            p->nfree--;
(gdb) c
Continuing.
Hardware watchpoint 1: *(void**)0x7ffdf2dd8498

Old value = (void *) 0x0
New value = (void *) 0x7ffdf1364550
jl_box_int64 (x=553) at alloc.c:760
760     SIBOX_FUNC(int64,  int64_t, 1)
(gdb) c
Continuing.
Hardware watchpoint 1: *(void**)0x7ffdf2dd8498

Old value = (void *) 0x7ffdf1364550
New value = (void *) 0x7ffdf2dd85a8
sweep_page (osize=16, sweep_mask=3, pfl=0x7ffdf2dd85a8, pg=0x7ffff137d3c0, 
    p=0x7ffff7dbf2f8 <norm_pools+24>) at gc.c:1240
1240                    pfl_begin = pfl_begin ? pfl_begin : pfl;
(gdb) p jl_(0x7ffdf1364550)
Int64
$1 = void
(gdb) c
Continuing.
Hardware watchpoint 1: *(void**)0x7ffdf2dd8498

Old value = (void *) 0x7ffdf2dd85a8
New value = (void *) 0x7ffdf2dd84d8
sweep_page (osize=16, sweep_mask=3, pfl=0x7ffdf2dd84d8, pg=0x7ffff137d3c0, 
    p=0x7ffff7dbf2f8 <norm_pools+24>) at gc.c:1240
1240                    pfl_begin = pfl_begin ? pfl_begin : pfl;
(gdb) c
Continuing.
GC error (probable corruption) :
<?:::cconvert>

Program received signal SIGABRT, Aborted.
0x00007ffff6441528 in raise () from /usr/lib/libc.so.6
(gdb) p jl_(0x7ffdf2dd85a8)
<?::svec(UInt64, UInt32)>
$2 = void
(gdb) up
#1  0x00007ffff644293a in abort () from /usr/lib/libc.so.6
(gdb) up
#2  0x00007ffff6b55db1 in push_root (v=v@entry=0x7ffdf2dd84a0, d=5, d@entry=4, 
    bits=0) at gc.c:1725
1725            abort();
(gdb) bt
#0  0x00007ffff6441528 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff644293a in abort () from /usr/lib/libc.so.6
#2  0x00007ffff6b55db1 in push_root (v=v@entry=0x7ffdf2dd84a0, d=5, d@entry=4, 
    bits=0) at gc.c:1725
#3  0x00007ffff6b5401a in gc_push_root (d=4, v=0x7ffdf2dd84a0) at gc.c:1441
#4  push_root (v=v@entry=0x7ffdf38f2590, d=4, d@entry=3, bits=3) at gc.c:1664
#5  0x00007ffff6b5388b in gc_push_root (d=3, v=0x7ffdf38f2590) at gc.c:1441
#6  push_root (v=v@entry=0x7ffdf34d5cf0, d=3, d@entry=2, bits=<optimized out>)
    at gc.c:1714
#7  0x00007ffff6b5388b in gc_push_root (d=2, v=0x7ffdf34d5cf0) at gc.c:1441
#8  push_root (v=v@entry=0x7ffdf34d5e70, d=2, d@entry=1, bits=<optimized out>)
    at gc.c:1714
#9  0x00007ffff6b56055 in gc_push_root (d=1, v=<optimized out>) at gc.c:1441
#10 gc_mark_stack (ta=ta@entry=0x7ffdf35007e0, s=0x7fffffffcb38, offset=0, d=1)
    at gc.c:1478
#11 0x00007ffff6b5761f in gc_mark_task_stack (d=1, ta=0x7ffdf35007e0)
    at gc.c:1540
#12 gc_mark_task (ta=ta@entry=0x3, d=d@entry=1) at gc.c:1558
#13 0x00007ffff6b54992 in push_root (v=0x3, d=1, d@entry=0, 
    bits=<optimized out>, bits@entry=1) at gc.c:1678
#14 0x00007ffff6b59608 in jl_gc_collect (full=<optimized out>) at gc.c:1993
#15 0x00007ffff6b5bec3 in jl_gc_alloc_2w () at gc.c:774
#16 0x00007ffff6b513e5 in newobj (nfields=2, type=0x7ffdf1364070)
    at julia_internal.h:25
#17 jl_alloc_svec_uninit (n=1) at simplevector.c:56
#18 0x00007ffff6b51531 in jl_svec_append (a=0x7ffdf2716410, b=0x7ffdf1368010)
    at simplevector.c:74
#19 0x00007ffff6b515ef in jl_svec_copy (a=<optimized out>) at simplevector.c:89
#20 0x00007ffff6ae65d0 in cache_method (mt=mt@entry=0x7ffdf2728790, 
    type=type@entry=0x7ffdf27251b0, method=0x7ffdf2716430, decl=0x7ffdf27251b0, 
    sparams=0x7ffdf1368010, isstaged=0 '\000') at gf.c:469
#21 0x00007ffff6ae7e54 in jl_mt_assoc_by_type (mt=mt@entry=0x7ffdf2728790, 
    tt=0x7ffdf27251b0, cache=cache@entry=1, inexact=inexact@entry=0) at gf.c:1027
#22 0x00007ffff6ae8542 in jl_apply_generic (F=0x7ffdf27163b0, 
    args=0x7fffffffcbb8, nargs=<optimized out>) at gf.c:1659
#23 0x00007ffdf123b3cc in julia_do_test_20851 (body=<optimized out>, 
    qex=<optimized out>) at test.jl:53
#24 0x00007ffff6ae848b in jl_apply (nargs=2, args=0x7fffffffcc98, 
    f=<optimized out>) at julia.h:1268
#25 jl_apply_generic (F=0x7ffdf2715f50, args=0x7fffffffcc98, 
    nargs=<optimized out>) at gf.c:1651
#26 0x00007ffff6b3a3a3 in jl_apply (nargs=2, args=0x7fffffffcc98, 
    f=0x7ffdf2715f50) at julia.h:1268
#27 do_call (f=f@entry=0x7ffdf2715f50, args=args@entry=0x7ffdf38f2328, 
    nargs=nargs@entry=2, eval0=eval0@entry=0x0, locals=locals@entry=0x0, 
    nl=nl@entry=0, ngensym=0) at interpreter.c:65
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) p jl_(0x7ffdf38f2590)
Array{Any, 1}[<?::<?:::cconvert>>, #<function>, 553]
$13 = void
(gdb) p jl_(0x7ffdf34d5cf0)
Expr(:comparison, <?::<?:::cconvert>>, #<function>, 553)::Any
$14 = void

@yuyichao
Copy link
Contributor

P.S. The command line I used was JL_GC_ALLOC_PRINT=0:100 JL_GC_ALLOC_OTHER=0 JL_GC_ALLOC_POOL=1192000:100 gdb --args ./julia -f test/runtests.jl ccall

@yuyichao
Copy link
Contributor

I've just reproduced the same error (with slightly different gc count because of the difference in configuration) on julia.mit.edu on a clean current master 97f62a9 built with make WITH_GC_DEBUG_ENV=1 and run with exactly the command line I pasted above.

Edit: I'm not suggesting debugging it on a server though since it's not a good idea to turn off address space randomization there...

@yuyichao
Copy link
Contributor

FWIW, the #<function> in that messed up expression/array is is/===

@carnaval
Copy link
Contributor

Well, if I read the ccall test correctly this is supposed to happen right ? Couldn't we just manually root the IntLike instance by hoisting it into a variable ?

@yuyichao
Copy link
Contributor

But the IntTest object should be a parameter to the function ccall_echo_func and should be rooted?

@yuyichao
Copy link
Contributor

Ahh, I see, so the echo function returns a pointer and that pointer is not valid anymore after ccall_echo_func returns....

@carnaval
Copy link
Contributor

yep, and even putting it in a variable is not technically safe if we become smarter about eliminating dead roots. At some point we may need a gc_root intrinsic in julia. It's not the case for now.

@yuyichao
Copy link
Contributor

become smarter about eliminating dead roots.

Right ... I'm just creating a wrapper function that does the load for now and I realized this because it feels really wierd that putting it in a function can fix it. I'll leave a comment there to state this (which is because the function argument "arbitrarily" extended the lifetime of the object...)...

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 17, 2015

yep, and even putting it in a variable is not technically safe if we become smarter about eliminating dead roots. At some point we may need a gc_root intrinsic in julia. It's not the case for now.

agreed. one other idea i had for that was that any usage of a variable in value position (even if the compiler can eliminate any codegen from it) should suffice to ensure that the original object is still alive. this would allow you to place a statement that just lists the variables to keep alive at some point in the function and be certain that the gc-analysis algorithm will respect it.

@carnaval
Copy link
Contributor

that would simplify things but it would remove latitude from the (future ?) optimizer. Or maybe I didn't understand your idea and you want to special case statements which consists only of a variable to mean that ? I think we might as well have some kind of intrinsic for that then

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 17, 2015

agreed, it would be a bit more conservative (like stack scanning?) and may be better to just have the intrinsic from the start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test This change adds or pertains to unit tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants