Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI stack overflow error #11658

Closed
yuyichao opened this issue Jun 10, 2015 · 40 comments
Closed

CI stack overflow error #11658

yuyichao opened this issue Jun 10, 2015 · 40 comments
Labels
kind:bug Indicates an unexpected problem or unintended behavior

Comments

@yuyichao
Copy link
Contributor

I am running GDB in a loop to reproduce it and here's what I've got when it runs into infinite recursion.

I don't have KEEP_BODIES on so I might not be able to get the LLVM IR in this run but here's the disassembly from GDB

(gdb) disassemble julia_convert_21056
Dump of assembler code for function julia_convert_21056:
   0x00007ffff7e3a750 <+0>:     push   %rbp
   0x00007ffff7e3a751 <+1>:     mov    %rsp,%rbp
   0x00007ffff7e3a754 <+4>:     movabs $0x7ffff7e3a750,%rax
   0x00007ffff7e3a75e <+14>:    movabs $0x7ffdf24b8020,%rdi
=> 0x00007ffff7e3a768 <+24>:    callq  *%rax
End of assembler dump.
(gdb) disassemble (void*)0x00007ffff7e3a76a
No function contains specified address.

And 0x00007ffff7e3a76a is the address shown in the backtrace

@yuyichao
Copy link
Contributor Author

For people who want to reproduce it locally, this is what I used.

for ((i = 0;i < 3000;i++)); do
echo $i; gdb --eval-command='set height 0' --eval-command='start' --eval-command='br segv_handler' --eval-command='continue' --eval-command='quit' --args ./julia -f test/sparsedir/sparse.jl
done

And test/sparsedir/sparse.jl is a stripped down version of the original file

yuyichao% cat test/sparsedir/sparse.jl| grep -v '^ *#' | grep -v '^ *$'
using Base.Test
A = sprandbool(5,5,0.2)
@test float(A) == float(full(A))

P.S. (Edit:) WARNING. It's very hard to get out of the (almost) infinite gdb loop. I end up running killall gdb in a long loop while holding my Ctrl-C down to quit the shell control flow....

@yuyichao
Copy link
Contributor Author

I get this memory error

(gdb) p jl_function_ptr_by_llvm_name("julia_convert_21091")
Cannot access memory at address 0x7fffff7fef7f

when I want to print the LLVM IR of the function. I guess it's because there isn't any stack space left to run the function? Does anyone know if there's a way to tell GDB to use a new stack for it?

@yuyichao
Copy link
Contributor Author

Another observation, the name of this function is very consistent (while it changes when I updated my branch to the current master just now but is the same otherwise). Also, if the error doesn't happen, there's not such function in the memory....

@yuyichao
Copy link
Contributor Author

OK. I add a check for when is that bad function is emitted. and here's the backtrace.

rec_backtrace at /home/yuyichao/projects/julia/master/src/task.c:648
gdbbacktrace at /home/yuyichao/projects/julia/master/src/task.c:776
emit_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:4044
to_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:646
jl_compile at /home/yuyichao/projects/julia/master/src/codegen.cpp:807
jl_get_specialization at /home/yuyichao/projects/julia/master/src/gf.c:1423
emit_known_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:1990
emit_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:2563
emit_known_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:2027
emit_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:2563
emit_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:4723
to_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:646
jl_compile at /home/yuyichao/projects/julia/master/src/codegen.cpp:807
jl_get_specialization at /home/yuyichao/projects/julia/master/src/gf.c:1423
emit_known_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:1990
emit_call at /home/yuyichao/projects/julia/master/src/codegen.cpp:2563
emit_assignment at /home/yuyichao/projects/julia/master/src/codegen.cpp:2899
emit_assignment at /home/yuyichao/projects/julia/master/src/codegen.cpp:2920
emit_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:4636
to_function at /home/yuyichao/projects/julia/master/src/codegen.cpp:646
jl_compile at /home/yuyichao/projects/julia/master/src/codegen.cpp:807
jl_trampoline_compile_function at /home/yuyichao/projects/julia/master/src/builtins.c:952
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1650
anonymous at ./test.jl:89
do_test at ./test.jl:49
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1650
do_call at /home/yuyichao/projects/julia/master/src/interpreter.c:66
eval at /home/yuyichao/projects/julia/master/src/interpreter.c:212
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/master/src/toplevel.c:540
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/master/src/toplevel.c:569
jl_load at /home/yuyichao/projects/julia/master/src/toplevel.c:616
include at ./boot.jl:253
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1652
include_from_node1 at ./loading.jl:133
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1650
process_options at ./client.jl:312
_start at ./client.jl:404
unknown function (ip: -200032791)
jl_apply_generic at /home/yuyichao/projects/julia/master/src/gf.c:1652
unknown function (ip: 4200983)
unknown function (ip: 4199951)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 4200041)
unknown function (ip: 0)

@yuyichao
Copy link
Contributor Author

And here's the IR at the end of emit_function

; Function Attrs: noreturn                                                        
define void @julia_convert_21091(%jl_value_t*, i64) #0 {                          
top:                                                                              
  %x = alloca i64, !dbg !14                                                       
  call void @llvm.dbg.declare(metadata i64* %x, metadata !15, metadata !17)       
  store i64 %1, i64* %x, !dbg !14                                                 
  %2 = load i64* %x, !dbg !18                                                     
  call void @julia_convert_21091(%jl_value_t* inttoptr (i64 140728668487712 to %jl_value_t*), i64 %2), !dbg !18, !julia_type !20                                    
  unreachable, !dbg !18                                                           

after_assert:                                     ; No predecessors!              
  ret void, !dbg !18                                                              
}

It's a infinite loop waiting to happen....

@yuyichao
Copy link
Contributor Author

Here's the patch I used to catch this (the exact function name might be different).

diff --git a/src/codegen.cpp b/src/codegen.cpp
index a37dfe9..35cfcc7 100644
--- a/src/codegen.cpp
+++ b/src/codegen.cpp
@@ -100,6 +100,9 @@ extern "C" {

 #include "builtin_proto.h"

+void gdbbacktrace();
+void jl_breakpoint(jl_value_t *v);
+
 #ifdef HAVE_SSP
 extern uintptr_t __stack_chk_guard;
 extern void __stack_chk_fail();
@@ -4036,6 +4039,11 @@ static Function *emit_function(jl_lambda_info_t *lam)
 #endif
     funcName << "_" << globalUnique++;

+    if (funcName.str() == "julia_convert_21091") {
+        gdbbacktrace();
+        jl_breakpoint(NULL);
+    }
+
     if (specsig) { // assumes !va
         std::vector<Type*> fsig(0);
         for(size_t i=0; i < jl_nparams(lam->specTypes); i++) {
@@ -4757,6 +4765,12 @@ static Function *emit_function(jl_lambda_info_t *lam)

     JL_GC_POP();

+    if (funcName.str() == "julia_convert_21091") {
+        f->dump();
+        gdbbacktrace();
+        jl_breakpoint(NULL);
+    }
+
     return f;
 }

diff --git a/src/options.h b/src/options.h
index 43aba33..8723425 100644
--- a/src/options.h
+++ b/src/options.h
@@ -31,7 +31,7 @@
 // #define FORCE_ELF

 // with KEEP_BODIES, we keep LLVM function bodies around for later debugging
-// #define KEEP_BODIES
+#define KEEP_BODIES

 // GC options -----------------------------------------------------------------

@yuyichao
Copy link
Contributor Author

I have 3 processes breaking at either breakpoint right now and I will be out for a while now but I would love some advice on how should I figure out what's wrong.

@yuyichao
Copy link
Contributor Author

The convert is emitted for

(gdb) p jl_(lam)
Base.Dates.convert(Type{Union()}, Int64)

@yuyichao
Copy link
Contributor Author

Not sure how it get's called yet here's a easier way to trigger the stackoverflow.

julia> convert(Union(), 1)
ERROR: StackOverflowError:
 in convert at ./dates/periods.jl:22

julia> @which convert(Union(), 1)
convert{T<:Base.Dates.Period}(::Type{T<:Base.Dates.Period}, x::Real) at dates/periods.jl:22
julia> @code_llvm convert(Union(), 1)

; Function Attrs: noreturn
define void @julia_convert_21201(%jl_value_t*, i64) #-1 {
top:
  call void @julia_convert_21201(%jl_value_t* inttoptr (i64 140728668487712 to %jl_value_t*), i64 %1)
  unreachable
}

julia> @code_typed convert(Union(), 1)
1-element Array{Any,1}:
 :($(Expr(:lambda, Any[symbol("#s323"),:x], Any[Any[Any[symbol("#s323"),Type{Union()},0],Any[:x,Int64,0]],Any[],Any[],Any[:T]], :(begin  # dates/periods.jl, line 22:
        return (top(typeassert))($(Expr(:call1, :convert, Union(), :(x::Int64))),Union())::Union()
    end::Union()))))

julia> @code_lowered convert(Union(), 1)
1-element Array{Any,1}:
 :($(Expr(:lambda, Any[symbol("#s323"),:x], Any[Any[Any[symbol("#s323"),:Any,0],Any[:x,:Any,0]],Any[],0,Any[:T]], :(begin  # dates/periods.jl, line 22:
        return T(x)
    end))))

@yuyichao
Copy link
Contributor Author

@JeffBezanson What's the long term (and short term) plan for these Union() issues? Will your next type system rework address these (don't remember the issue number)

@carnaval
Copy link
Contributor

I'm curious of how one ends up calling this method. I don't think there is an easy way to prevent those since, after all, everything is correct here. We could disallow matching tparams with Bottom in some circumstances but that feels rather hackish.

Anyway, it's not like we can prevent the language from expressing non termination after all ;-)

@yuyichao
Copy link
Contributor Author

Here's a better backtrace (the gdb one instead of the julia one)

Edit: forgot the link....

@yuyichao
Copy link
Contributor Author

And it happens during this:

(gdb) p jl_(0x7ffdf55539d0)
Expr(:call, :getindex, A::Base.SparseMatrix.SparseMatrixCSC{Union(), Int64}, Expr(:call, :getfield, GenSym(18), 1)::Int64, Expr(:call, :getfield, GenSym(18), 2)::Int64)::Union()
$2 = void

@yuyichao
Copy link
Contributor Author

Which happens in this function

(gdb) p jl_(((jl_function_t*)0x7ffdf4fc5f30)->linfo)
Base.==(Base.SparseMatrix.SparseMatrixCSC{Union(), Int64}, Array{Float64, 2})
$7 = void

@carnaval
Copy link
Contributor

well you almost have it. So this Sparse{Union(),Int} is made somewhere, or perhaps it's an inference bug which infers this type for a Sparse{Float,Int}. If you have the address of the first argument you could check if the type tag of this matrix is really this type. If it is then you'll have to find out where it was made :-)

@yuyichao
Copy link
Contributor Author

Yes, that is what I'm doing (although I'm in the middle of sth else so it will be slow...)

@yuyichao
Copy link
Contributor Author

(but at least this is directly called from jl_apply_generic so probably not caused by type inference)

@yuyichao
Copy link
Contributor Author

(gdb) p jl_(((jl_value_t**)0x7fffffffcc60)[0])
Base.SparseMatrix.SparseMatrixCSC{Union(), Int64}(m=5, n=5, colptr=Array{Int64, 1}[1, 1, 1, 1, 1, 1], rowval=Array{Int64, 1}[], nzval=Array{Union(), 1}[])
$10 = void

@carnaval How do I check the tag?

@yuyichao
Copy link
Contributor Author

Like this?

(gdb) p jl_(jl_typeof(((jl_value_t**)0x7fffffffcc60)[0]))
Base.SparseMatrix.SparseMatrixCSC{Union(), Int64}
$12 = void

@yuyichao
Copy link
Contributor Author

The function that calls the jl_apply_generic on the Union() matrix https://gist.github.com/yuyichao/19db7f32614c8f6979ee

@carnaval
Copy link
Contributor

yes, but if it was in apply_generic then, as you said, the tags are read here anyway. There are not thousand of places where we could create a Union() array. I would say untyped comprehension gone wrong ? That could be ptsd speaking but it could be either a bug in type_goto or something which ended up looking like [throw() for i in []]

@carnaval
Copy link
Contributor

Hum. Rereading the test, this is probably hapenning only when the sparse matrix is empty right ? :)

@carnaval
Copy link
Contributor

Here is the problem

julia> typeof(float(Bool[]))
Array{Union(),1}

@carnaval
Copy link
Contributor

Ok so map_promote(f,::AbstractArray) probably needs a special case for f :: Type the same way map does IIRC. Or float() should do something else entirely but I'm not very versed in the art of our array hierarchy. Tentatively @timholy @mbauman ?

@yuyichao
Copy link
Contributor Author

I think I've seen it happen when the matrix is not empty but I'm not 100% sure

@yuyichao
Copy link
Contributor Author

And I think it's happening way too often (1% for me) if it only happen with empty matrix

@mbauman
Copy link
Sponsor Member

mbauman commented Jun 10, 2015

Hm, I'm unfamiliar with map_promote. It looks like it's only used by complex and float — and it's not passing a type but rather a local generic function for convert(FloatingPoint, x).

This would be a potential downside to returning Bottom on all mapping operations, cf. #7258.

@carnaval
Copy link
Contributor

Well if you run the test 300 times with 0.2 density and 25 elements you have about 70% proba of having it be empty at least once right ?

Anyway, I think the fact that the empty sparse matrix reproduces the bug is quite telling, but if you manage to have this happen for a non empty matrix it's much more worrying.

@mbauman
Copy link
Sponsor Member

mbauman commented Jun 10, 2015

With callable types, it could probably be totally replaced with regular map(FloatingPoint, A).

@carnaval
Copy link
Contributor

@mbauman yep, but if we go this way we would want to make the std lib more robust to Bottom arrays. I don't think it's that bad but it's definitely something to keep in mind.

@tkelman
Copy link
Contributor

tkelman commented Jun 10, 2015

I don't know what's going on here, but wanted to say this is epic debugging work @yuyichao, keep it up!

@jakebolewski
Copy link
Member

Isn't this just another instance of not rigorously checking and special casing the empty array case?

@yuyichao
Copy link
Contributor Author

Well if you run the test 300 times with 0.2 density and 25 elements you have about 70% proba of having it be empty at least once right ?

I remember seeing it non-empty when I actually print out the matrix constructed but I'm not 100% sure. Let me try again with printing.

@yuyichao
Copy link
Contributor Author

I remember seeing it non-empty when I actually print out the matrix constructed but I'm not 100% sure. Let me try again with printing.

Although it is certainly possible that I was confused by the output in a loop since an empty matrix won't print anything?

@yuyichao
Copy link
Contributor Author

@carnaval you are totally right about empty matrix. I think I was just confused yesterday because the empty sparse matrinx prints nothing.... (which IMHO should be fixed...)

So I guess the solution is to make sure nzval has the right type by fixing map_promote?

@carnaval
Copy link
Contributor

yep @jakebolewski has a patch on the way. You're right that print(empty_sparse) also deserves a fix.

@yuyichao
Copy link
Contributor Author

Ref #11661

@yuyichao
Copy link
Contributor Author

yep @jakebolewski has a patch on the way.

Good to know. Thanks!

@jakebolewski
Copy link
Member

not really a fix, more of a band-aid

@JeffBezanson
Copy link
Sponsor Member

Should be fixed by #11663

JeffBezanson added a commit that referenced this issue Jun 17, 2015
fcard pushed a commit to fcard/julia that referenced this issue Jul 8, 2015
fcard pushed a commit to fcard/julia that referenced this issue Jul 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

6 participants