-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI stack overflow error #11658
Comments
For people who want to reproduce it locally, this is what I used. for ((i = 0;i < 3000;i++)); do
echo $i; gdb --eval-command='set height 0' --eval-command='start' --eval-command='br segv_handler' --eval-command='continue' --eval-command='quit' --args ./julia -f test/sparsedir/sparse.jl
done And yuyichao% cat test/sparsedir/sparse.jl| grep -v '^ *#' | grep -v '^ *$'
using Base.Test
A = sprandbool(5,5,0.2)
@test float(A) == float(full(A)) P.S. (Edit:) WARNING. It's very hard to get out of the (almost) infinite gdb loop. I end up running |
I get this memory error
when I want to print the LLVM IR of the function. I guess it's because there isn't any stack space left to run the function? Does anyone know if there's a way to tell GDB to use a new stack for it? |
Another observation, the name of this function is very consistent (while it changes when I updated my branch to the current master just now but is the same otherwise). Also, if the error doesn't happen, there's not such function in the memory.... |
OK. I add a check for when is that bad function is emitted. and here's the backtrace. rec_backtrace at /home/yuyichao/projects/julia/master/src/task.c:648
gdbbacktrace at /home/yuyichao/projects/julia/master/src/task.c:776
emit_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:4044
to_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:646
jl_compile at /home/yuyichao/projects/julia/master/src/codegen.cpp:807
jl_get_specialization at /home/yuyichao/projects/julia/master/src/gf.c:1423
emit_known_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:1990
emit_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:2563
emit_known_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:2027
emit_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:2563
emit_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:4723
to_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:646
jl_compile at /home/yuyichao/projects/julia/master/src/codegen.cpp:807
jl_get_specialization at /home/yuyichao/projects/julia/master/src/gf.c:1423
emit_known_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:1990
emit_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:2563
emit_assignment at /home/yuyichao/projects/julia/master/src/codegen.cpp:2899
emit_assignment at /home/yuyichao/projects/julia/master/src/codegen.cpp:2920
emit_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:4636
to_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:646
jl_compile at /home/yuyichao/projects/julia/master/src/codegen.cpp:807
jl_trampoline_compile_function at /home/yuyichao/projects/julia/master/src/builtins.c:952
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1650
anonymous at ./test.jl:89
do_test at ./test.jl:49
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1650
do_call at /home/yuyichao/projects/julia/master/src/interpreter.c:66
eval at /home/yuyichao/projects/julia/master/src/interpreter.c:212
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/master/src/toplevel.c:540
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/master/src/toplevel.c:569
jl_load at /home/yuyichao/projects/julia/master/src/toplevel.c:616
include at ./boot.jl:253
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1652
include_from_node1 at ./loading.jl:133
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1650
process_options at ./client.jl:312
_start at ./client.jl:404
unknown function (ip: -200032791)
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1652
unknown function (ip: 4200983)
unknown function (ip: 4199951)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 4200041)
unknown function (ip: 0) |
And here's the IR at the end of ; Function Attrs: noreturn
define void @julia_convert_21091(%jl_value_t*, i64) #0 {
top:
%x = alloca i64, !dbg !14
call void @llvm.dbg.declare(metadata i64* %x, metadata !15, metadata !17)
store i64 %1, i64* %x, !dbg !14
%2 = load i64* %x, !dbg !18
call void @julia_convert_21091(%jl_value_t* inttoptr (i64 140728668487712 to %jl_value_t*), i64 %2), !dbg !18, !julia_type !20
unreachable, !dbg !18
after_assert: ; No predecessors!
ret void, !dbg !18
} It's a infinite loop waiting to happen.... |
Here's the patch I used to catch this (the exact function name might be different). diff --git a/src/codegen.cpp b/src/codegen.cpp
index a37dfe9..35cfcc7 100644
--- a/src/codegen.cpp
+++ b/src/codegen.cpp
@@ -100,6 +100,9 @@ extern "C" {
#include "builtin_proto.h"
+void gdbbacktrace();
+void jl_breakpoint(jl_value_t *v);
+
#ifdef HAVE_SSP
extern uintptr_t __stack_chk_guard;
extern void __stack_chk_fail();
@@ -4036,6 +4039,11 @@ static Function *emit_function(jl_lambda_info_t *lam)
#endif
funcName << "_" << globalUnique++;
+ if (funcName.str() == "julia_convert_21091") {
+ gdbbacktrace();
+ jl_breakpoint(NULL);
+ }
+
if (specsig) { // assumes !va
std::vector<Type*> fsig(0);
for(size_t i=0; i < jl_nparams(lam->specTypes); i++) {
@@ -4757,6 +4765,12 @@ static Function *emit_function(jl_lambda_info_t *lam)
JL_GC_POP();
+ if (funcName.str() == "julia_convert_21091") {
+ f->dump();
+ gdbbacktrace();
+ jl_breakpoint(NULL);
+ }
+
return f;
}
diff --git a/src/options.h b/src/options.h
index 43aba33..8723425 100644
--- a/src/options.h
+++ b/src/options.h
@@ -31,7 +31,7 @@
// #define FORCE_ELF
// with KEEP_BODIES, we keep LLVM function bodies around for later debugging
-// #define KEEP_BODIES
+#define KEEP_BODIES
// GC options ----------------------------------------------------------------- |
I have 3 processes breaking at either breakpoint right now and I will be out for a while now but I would love some advice on how should I figure out what's wrong. |
The convert is emitted for (gdb) p jl_(lam)
Base.Dates.convert(Type{Union()}, Int64) |
Not sure how it get's called yet here's a easier way to trigger the stackoverflow. julia> convert(Union(), 1)
ERROR: StackOverflowError:
in convert at ./dates/periods.jl:22
julia> @which convert(Union(), 1)
convert{T<:Base.Dates.Period}(::Type{T<:Base.Dates.Period}, x::Real) at dates/periods.jl:22
julia> @code_llvm convert(Union(), 1)
; Function Attrs: noreturn
define void @julia_convert_21201(%jl_value_t*, i64) #-1 {
top:
call void @julia_convert_21201(%jl_value_t* inttoptr (i64 140728668487712 to %jl_value_t*), i64 %1)
unreachable
}
julia> @code_typed convert(Union(), 1)
1-element Array{Any,1}:
:($(Expr(:lambda, Any[symbol("#s323"),:x], Any[Any[Any[symbol("#s323"),Type{Union()},0],Any[:x,Int64,0]],Any[],Any[],Any[:T]], :(begin # dates/periods.jl, line 22:
return (top(typeassert))($(Expr(:call1, :convert, Union(), :(x::Int64))),Union())::Union()
end::Union()))))
julia> @code_lowered convert(Union(), 1)
1-element Array{Any,1}:
:($(Expr(:lambda, Any[symbol("#s323"),:x], Any[Any[Any[symbol("#s323"),:Any,0],Any[:x,:Any,0]],Any[],0,Any[:T]], :(begin # dates/periods.jl, line 22:
return T(x)
end)))) |
@JeffBezanson What's the long term (and short term) plan for these |
I'm curious of how one ends up calling this method. I don't think there is an easy way to prevent those since, after all, everything is correct here. We could disallow matching tparams with Bottom in some circumstances but that feels rather hackish. Anyway, it's not like we can prevent the language from expressing non termination after all ;-) |
Here's a better backtrace (the gdb one instead of the julia one) Edit: forgot the link.... |
And it happens during this:
|
Which happens in this function (gdb) p jl_(((jl_function_t*)0x7ffdf4fc5f30)->linfo)
Base.==(Base.SparseMatrix.SparseMatrixCSC{Union(), Int64}, Array{Float64, 2})
$7 = void |
well you almost have it. So this Sparse{Union(),Int} is made somewhere, or perhaps it's an inference bug which infers this type for a Sparse{Float,Int}. If you have the address of the first argument you could check if the type tag of this matrix is really this type. If it is then you'll have to find out where it was made :-) |
Yes, that is what I'm doing (although I'm in the middle of sth else so it will be slow...) |
(but at least this is directly called from |
(gdb) p jl_(((jl_value_t**)0x7fffffffcc60)[0])
Base.SparseMatrix.SparseMatrixCSC{Union(), Int64}(m=5, n=5, colptr=Array{Int64, 1}[1, 1, 1, 1, 1, 1], rowval=Array{Int64, 1}[], nzval=Array{Union(), 1}[])
$10 = void @carnaval How do I check the tag? |
Like this? (gdb) p jl_(jl_typeof(((jl_value_t**)0x7fffffffcc60)[0]))
Base.SparseMatrix.SparseMatrixCSC{Union(), Int64}
$12 = void |
The function that calls the |
yes, but if it was in apply_generic then, as you said, the tags are read here anyway. There are not thousand of places where we could create a Union() array. I would say untyped comprehension gone wrong ? That could be ptsd speaking but it could be either a bug in type_goto or something which ended up looking like |
Hum. Rereading the test, this is probably hapenning only when the sparse matrix is empty right ? :) |
Here is the problem julia> typeof(float(Bool[]))
Array{Union(),1} |
I think I've seen it happen when the matrix is not empty but I'm not 100% sure |
And I think it's happening way too often (1% for me) if it only happen with empty matrix |
Hm, I'm unfamiliar with This would be a potential downside to returning |
Well if you run the test 300 times with 0.2 density and 25 elements you have about 70% proba of having it be empty at least once right ? Anyway, I think the fact that the empty sparse matrix reproduces the bug is quite telling, but if you manage to have this happen for a non empty matrix it's much more worrying. |
With callable types, it could probably be totally replaced with regular |
@mbauman yep, but if we go this way we would want to make the std lib more robust to Bottom arrays. I don't think it's that bad but it's definitely something to keep in mind. |
I don't know what's going on here, but wanted to say this is epic debugging work @yuyichao, keep it up! |
Isn't this just another instance of not rigorously checking and special casing the empty array case? |
I remember seeing it non-empty when I actually print out the matrix constructed but I'm not 100% sure. Let me try again with printing. |
Although it is certainly possible that I was confused by the output in a loop since an empty matrix won't print anything? |
@carnaval you are totally right about empty matrix. I think I was just confused yesterday because the empty sparse matrinx prints nothing.... (which IMHO should be fixed...) So I guess the solution is to make sure |
yep @jakebolewski has a patch on the way. You're right that print(empty_sparse) also deserves a fix. |
Ref #11661 |
Good to know. Thanks! |
not really a fix, more of a band-aid |
Should be fixed by #11663 |
I am running GDB in a loop to reproduce it and here's what I've got when it runs into infinite recursion.
I don't have
KEEP_BODIES
on so I might not be able to get the LLVM IR in this run but here's the disassembly from GDBAnd
0x00007ffff7e3a76a
is the address shown in the backtraceThe text was updated successfully, but these errors were encountered: