-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loading Oscar causes corruption #44661
Comments
Of course it might also be OSCAR that does something "sketchy" -- I'd like to draw attention to its dependency GAP.jl (I am one of the principal authors), which is (AFAIK) the unique package using Julia's "foreign type" kernel interface (which was added for and by us). Who knows, maybe something in there causes a problem. One thing that caused issues in the past: foreign types are indicated by |
Here is a current reproduce and backtrace (made on a Mac, but it can also be reproduced on Linux in exactly the same way):
|
Ok, this is interesting: julia> using JuliaInterpreter
julia> ex = :(f(x) = () -> x);
julia> JuliaInterpreter.finish_and_return!(JuliaInterpreter.finish_and_return!, Frame(Main, ex), true)
f (generic function with 1 method)
julia> JuliaInterpreter.finish_and_return!(JuliaInterpreter.finish_and_return!, Frame(Main, ex), true)
f (generic function with 1 method)
julia> import Oscar
[...]
julia> JuliaInterpreter.finish_and_return!(JuliaInterpreter.finish_and_return!, Frame(Main, ex), true)
f (generic function with 1 method)
julia> JuliaInterpreter.finish_and_return!(JuliaInterpreter.finish_and_return!, Frame(Main, ex), true)
signal (11): Segmentation fault
in expression starting at none:1
mtcache_hash_lookup at /cache/build/default-amdci4-4/julialang/julia-master/src/typemap.c:292 [inlined]
mtcache_hash_lookup at /cache/build/default-amdci4-4/julialang/julia-master/src/typemap.c:288 [inlined]
jl_typemap_level_assoc_exact at /cache/build/default-amdci4-4/julialang/julia-master/src/typemap.c:1025
jl_typemap_assoc_exact at /cache/build/default-amdci4-4/julialang/julia-master/src/julia_internal.h:1328 [inlined]
[...] So even though Oscar is never actually used anywhere, just loading it seems to cause corruption. I wonder whether this might be related to #36770? I don't really know what |
Ok, I looked more into it and if I run julia inside gdb, I already get a segfault if I just do The segfault in GDB
|
Interesting, I cannot reproduce this with just Instances of foreign types cannot be serialized at all (and I've inserted a check guarding against attempts to do so some time ago into AFAIK there is no special code at all for serializing the foreign types, yet so far I never run into this? But I'll add a guard in there now, too, to see if and where it triggers. I don't see how to serialize a foreign type in the current system, as they contain pointers to C functions. There are of course ways to do deal with that (e.g. associate a unique global identifier with the type which can be used to restore the function pointers during runtime). But I don't understand enough of what's going on to formulate a concrete plan. In particular, why would that type be serialized and then deserialized again in the first place? And isn't there logic in place which determines during (de)serialization that the type is in fact already in the system, and just uses a pointer to the existing type? I imagine you'd not want to have two copies of, say, type Sorry for all the dumb questions. |
Inserted an assertion into |
@simeonschaub ahha, just noticed you posted a gdb backtrace for that "segfault" you got when loading GAP.jl. But that one is "normal" -- it is caused by GAP scanning the stack during garbage collecting for references to objects. But since it only has imprecise information about the size of the stack, it may end up accessing guard pages at the end of stack, which cause a segfault -- but those are caught, and handled. Unfortunately, to gdb they look like a regular segfault. Unfortunately I don't know of a good way to deal with this in gdb -- one can tell it to ignore segfaults, but AFAIK there is no way to make this conditional (ie. ignore |
I compiled the commit right before 7b1e454 from source. I'm on Manjaro Linux and the following
Interesting, that's the first time I heard about "normal" segfaults! 😆 It does seem to be a literal null pointer though, I would have expected there to be some kind of check?
That could have changed after #43671, but that PR didn't actually use any typed globals yet, so the type field of a binding should only ever be |
It's pretty much standard for conservative stack scanning GC. And
Although I don't know whether that's still true. In any case, Julia's C kernel has facilities to deal with SIGSEGV which we use for GAP.jl (specifically,
Not sure what makes you say it's a null pointer? This is the code in question: static void FindLiveRangeReverse(PtrArray * arr, void * start, void * end)
{
if (lt_ptr(end, start)) {
SWAP(void *, start, end);
}
char * p = (char *)(align_ptr(start));
char * q = (char *)end - sizeof(void *);
while (!lt_ptr(q, p)) {
void * addr = *(void **)q; // <- line 504
if (addr && jl_gc_internal_obj_base_ptr(addr) == addr &&
jl_typeis(addr, datatype_mptr)) {
PtrArrayAdd(arr, addr);
}
q -= C_STACK_ALIGN;
}
} It segfaults dereferencing
Correct |
This might be a total red herring, but: in the "real" backtraces, I see functions named Going back to Finally, fully disclosure: current GAP.jl does something real evil, which I suspected for a time to be the cause of this, but I can remove this evilness and it still crashes: namely, it takes the type |
I don't really think this should be on the milestone. The discussion here seems to be way and beyond what is considered as Julia's public interface. Of course, it would be good to fix but it should not hold up a release. |
The crash is gone now. I have no idea what change caused it (maybe something we changed in GAP.jl or Oscar.jl). I tried all kinds of variations and both the 1.8 ( So from my POV this could be closed now. |
See timholy/Revise.jl#675. Was bisected to #43671, but I wasn't able to really extract anything from the rr trace, so I'm not sure if that might just have surfaced the issue or Revise might be doing something sketchy.
The text was updated successfully, but these errors were encountered: