Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LU decomposition crashes Windows for 35x35 matrices #2124

Closed
andreasnoack opened this issue Jan 25, 2013 · 39 comments
Closed

LU decomposition crashes Windows for 35x35 matrices #2124

andreasnoack opened this issue Jan 25, 2013 · 39 comments
Labels
status:priority This should be addressed urgently system:windows Affects only Windows
Milestone

Comments

@andreasnoack
Copy link
Member

(edit: now named lufact(rand(33,33)))

julia>lud(rand(33)) ok!

but (see also #1543)

julia>lud(rand(34))
0x065D2EE3 (0x0062F810 0x00000000 0x004FC9F8 0x0EBD01C0), LAPACKE_csyr_work() +
0xC579E3 bytes(s)
0x05ACFCC6 (0x0062F810 0x00000000 0x00562E78 0x0EBD01C0), LAPACKE_csyr_work() +
0x1547C6 bytes(s)
0x05ACFCC6 (0x0062F810 0x00000000 0x005C92F8 0x0EBD01C0), LAPACKE_csyr_work() +
0x1547C6 bytes(s)
0x05ACFCC6 (0x0062F810 0x00000000 0x0062F778 0x0EBD01C0), LAPACKE_csyr_work() +
0x1547C6 bytes(s)
0x05ACFCC6 (0x0062F810 0x00000000 0x00000000 0x0EBD01C0), LAPACKE_csyr_work() +
0x1547C6 bytes(s)
0x05346996 (0x04309F68 0x0062F914 0x00000001 0x0FD03900), dgetrf_() + 0x146 byte
s(s)
0x02DDD8A3 (0x04309F28 0x0062F94C 0x00000001 0x00000001) <unknown module>
0x02DDD809 (0x042896E0 0x0062F9CC 0x00000001 0x6F411541) <unknown module>
0x6F40B42D (0x037864C8 0x0062F9CC 0x00000001 0x6F411858), jl_apply_generic() + 0
x5D bytes(s)
0x6F43D684 (0x00000000 0x00000000 0x0062FA58 0x6F446FBB), jl_dump_function() + 0
xF64 bytes(s)
0x6F43D07B (0x0FCFE178 0x02543050 0x00000004 0x04099D30), jl_dump_function() + 0
x95B bytes(s)
0x6F4475CB (0x0FCFE0E8 0x0062FB20 0x00000002 0x025ADE20), jl_uncompress_ast() +
0x1AEB bytes(s)
0x6F40F08B (0x037DBB20 0x0062FC40 0x00000002 0x0253CFE8), jl_enter_handler() + 0
x18B bytes(s)
0x02DDCBC9 (0x0FCFE0E8 0x00000001 0x0062FCB8 0x6F40B42D) <unknown module>
0x02DDC99E (0x033D1F80 0x0062FCF4 0x00000002 0x0349E8B0) <unknown module>
0x6F40B42D (0x033D1F30 0x0062FCF4 0x00000002 0x02DA28FF), jl_apply_generic() + 0
x5D bytes(s)
0x02DA0C8E (0x03786E8C 0x025554F8 0x00000001 0x02F44FD8) <unknown module>
0x02DA08BE (0x00000000 0x00000000 0x0062FE28 0x6F40B42D) <unknown module>
0x02DA01CB (0x03551F40 0x00000000 0x00000000 0x00000000) <unknown module>
0x6F40B42D (0x03551EF0 0x00000000 0x00000000 0x77BC6C74), jl_apply_generic() + 0
x5D bytes(s)
0x00401888 (0x00000000 0x00BE6C8C 0x0062FFE0 0x00000004)
0x6F44068F (0x00000000 0x00BE6C8C 0x00401760 0x00BE6C20), julia_trampoline() + 0
x4F bytes(s)
0x00404755 (0x00000000 0x00BE6C8C 0x00BE3B90 0x00000000), jl_readBuffer() + 0x26
85 bytes(s)
0x004013EA (0x00000000 0x00000000 0x7EFDF000 0xC00000FD)
0x7D4E7D42 (0x004014D0 0x00000000 0x000000AA 0x0000000C), BaseProcessInitPostImp
ort() + 0x8D bytes(s)
@ViralBShah
Copy link
Member

Cc: @xianyi

@Keno
Copy link
Member

Keno commented Jan 25, 2013

Might be a stack issue. Wouldn't be surprised.

@ViralBShah
Copy link
Member

Is it an openblas bug, or something in our windows port?

@Keno
Copy link
Member

Keno commented Jan 25, 2013

Can't tell yet. I can give you a windows VM if you want to debug.

@andreasnoack
Copy link
Member Author

I should add that this seems to be limited to older Windows machines. The example is from Windows Server 2003.

@Keno
Copy link
Member

Keno commented Jan 25, 2013

Don't have a test machine for that yet. Will set up a couple of VMs for it on julia.mit.edu

@andreasnoack
Copy link
Member Author

I am able to get the same crash on a Windows Server 2008 but to do so I need a 65x65 matrix. I cannot crash Julia on Windows 8.

@xianyi
Copy link

xianyi commented Jan 25, 2013

What's your compiler version? GCC 4.7?

Xianyi

@alanedelman
Copy link
Contributor

+1
Running on vista lu(randn(33,33)) is ok lu(randn(34,34)) breaks

@ViralBShah
Copy link
Member

@xianyi Is there a way to get openblas to work reliably on Windows? Any specific compiler versions that you recommend?

@vtjnash Do you think we can build julia 0.1 with ATLAS as a backup, until some of these things are sorted out?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 11, 2013

can someone tell me how to run the equivalent of lud(rand(33)) on the current version of julia? i'll bundle something that works when I make the 0.1 binaries for windows. however, this shouldn't be critical for the ubuntu release.

@ViralBShah
Copy link
Member

lufact(rand(33,33))

@xianyi
Copy link

xianyi commented Feb 12, 2013

Hi @ViralBShah ,

We are in Chinese New Year holiday. I think we can address this issue next week.

Xianyi

@ViralBShah
Copy link
Member

Ok. Have fun! Let me know if I should file this as an issue on openblas.

@ViralBShah
Copy link
Member

I think we should ship the Windows version with Reference BLAS, if we can't get ATLAS working in the meanwhile, and until OpenBLAS can be stabilized.

@ViralBShah
Copy link
Member

@andreasnoackjensen We should probably add some of these windows crashes as tests in test/linalg.jl once we resolve them.

@xianyi
Copy link

xianyi commented Feb 12, 2013

Hi all,

I don't know why it calls csyr in lufact(rand(33,33)) . I thinks it is the double precision real matrix.

I just uploaded a simple dgetrf sample to gist https://gist.github.com/xianyi/4771129
It works fine with OpenBLAS develop branch (gcc-4.7) on my Win7 64-bit box.

Xianyi

@andreasnoack
Copy link
Member Author

No it is not so obvious why csyr is called. However, the problem seems again to be be related to multithreading. If I set the number of threads to one I don't get the error.

@xianyi
Copy link

xianyi commented Feb 13, 2013

Hi @andreasnoackjensen ,

Is it 32 bit or 64 bit? Could you try OpenBLAS develop branch?

Could you try my dgetrf test https://gist.github.com/xianyi/4771129 ?

Thank you

Xianyi

@andreasnoack
Copy link
Member Author

Hi @xianyi,

It was on a Windows Server 2008 64 bit machine, but I don't know much about the Windows build of Julia. Therefore I cannot try a build with the develop branch. Maybe @loladiro and @vtjnash can help here. I'll see if I can run your example, but I don't have access to a Windows machine with privileges to install programs.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 13, 2013

i added comments to xianyi's gist.

current workaround for julia may be to add export OPENBLAS_NUM_THREADS=1 to prepare-julia-env.bat

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 13, 2013

@xianyi I've narrowed this down to the stack being corrupted by the line in your gist:
LAPACK_dgetrf(&N, &N, m, &LDA,ipiv, &info);
somewhere in _zpotrf. the apparent stack trace is

Program received signal SIGSEGV, Segmentation fault.
0x6d7e9243 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
(gdb) bt
#0  0x6d7e9243 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#1  0x6cc2c9f6 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#2  0x6cc2c9f6 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#3  0x6cc2c9f6 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#4  0x6cc2c9f6 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#5  0x6c4d6996 in libopenblas!DLANSB () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#6  0x0028fdf0 in ?? ()
#7  0x004013fa in __tmainCRTStartup ()
#8  0x749033aa in KERNEL32!BaseCleanupAppcompatCacheSupport () from C:\Windows\syswow64\kernel32.dll
#9  0x0028ffd4 in ?? ()
#10 0x77149ef2 in ntdll!RtlpNtSetValueKey () from C:\Windows\system32\ntdll.dll
#11 0x7efde000 in ?? ()
#12 0x77149ec5 in ntdll!RtlpNtSetValueKey () from C:\Windows\system32\ntdll.dll
#13 0x004014e0 in WinMainCRTStartup ()
#14 0x7efde000 in ?? ()
#15 0x00000000 in ?? ()
0x6d7e9243 in zupmtr_ () from c:\users\jameson\desktop\julia-
(gdb) info reg
eax            0x3440   13376
ecx            0x92b80  600960
edx            0x8      8
ebx            0x8      8
esp            0xf6b74  0xf6b74
ebp            0xf6bb8  0xf6bb8
esi            0x28fdf0 2686448
edi            0xffffc000       -16384
eip            0x6d7e9243       0x6d7e9243 <zupmtr_+13701139>
eflags         0x10202  [ IF RF ]
cs             0x23     35
ss             0x2b     43
ds             0x2b     43
es             0x2b     43
fs             0x53     83
gs             0x2b     43

@ViralBShah
Copy link
Member

Does this happen only in LU, or does it happen for other decompositions too?

@andreasnoack
Copy link
Member Author

I have tested the other factorizations and the problem seems to be for LU only. However, that includes the solution of a general linear system which also crashes Julia.

@ViralBShah
Copy link
Member

@vtjnash Lets set number of threads to 1 on windows if that will solve the immediate release issue.

@xianyi
Copy link

xianyi commented Feb 13, 2013

@vtjnash ,

I also added the comment in my gist.
You narrowed down this issue to dgetrf function. Do you include cblas.h and lapacke.h?

Xianyi

@ViralBShah
Copy link
Member

CBLAS does get linked into the openblas used by julia.

@StefanKarpinski
Copy link
Sponsor Member

Bumping to post 0.1.

@ViralBShah
Copy link
Member

@xianyi Would it be possible to fix this in a few days? If so, we can build julia windows binaries with openblas now that we have released 0.1.

@xianyi
Copy link

xianyi commented Feb 14, 2013

@zchothia Could you investigate this issue? Thank you.

@xianyi
Copy link

xianyi commented Feb 27, 2013

Hi @vtjnash ,

I read your comments in my gist. However, when I built OpenBLAS on Linux and test_dgetrf on Windows, I didn't meet the SEGFAULT bug on Windows.

What's the i686-w64-mingw32-gcc version on Linux and gcc version on Windows?

Thank you

Xianyi

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 27, 2013

$ i686-w64-mingw32-gcc --version
i686-w64-mingw32-gcc (GCC) 4.6.3
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

built with max OPENBLAS_NUM_THREADS of 80

tested with

$ /c/MinGW64/bin/gcc --version
gcc.exe (Built by MinGW-builds project) 4.7.2
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

$ gcc --version
gcc.exe (GCC) 4.6.1
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and (which is really the same as the first one):

$ /c/MinGW64/bin/x86_64-w64-mingw32-gcc --version -m32
x86_64-w64-mingw32-gcc.exe (Built by MinGW-builds project) 4.7.2
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Oh, and my machine is a VMware instance with 2 processors (sometimes 4) running on a Core i7 2620m with 4 processors (all x86_64 / 64-bit).

are any of these make flags for openblas potentially at fault (or insufficient)?

make CC="i686-w64-mingw32-gcc" FC="i686-w64-mingw32-gfortran" RANLIB="i686-w64-mingw32-ranlib" \
CFLAGS="-g" FFLAGS="-g -O2 " USE_THREAD=1 TARGET= DYNAMIC_ARCH=1 OSNAME=WINNT \
CROSS=1 BINARY=32

@xianyi
Copy link

xianyi commented Mar 1, 2013

Your i686-w64-mingw32-gcc is 4.6 version. Did you use gcc 4.6 on Windows? I remember that 4.6 and 4.7 have the different calling conventions on Windows.

Xianyi

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 1, 2013

IIUC, It appears that only the calling convention of C++11 changed: https://gcc.gnu.org/gcc-4.7/changes.html.
I tried all three compilers mentioned above (4.6.1-i386, 4.7.2-i386, 4.7.2-x86_6)
I am putting together a Virtual Machine for more testing.

@xianyi
Copy link

xianyi commented Mar 1, 2013

Hi @vtjnash ,

Please give me the access to the VM. I cannot reproduce this bug on my machine :(

Xianyi

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 1, 2013

@xianyi I haven't started it yet (I think I need to find my windows install disk). However, I just identified the problem as stack overflow. The default stack on windows is 1MB, increasing it to 16MB fixes the problem (-Wl,--stack,16777216). Any idea what a good size would be and why this was a problem? (default stack on linux is 8MB, IIRC)

@JeffBezanson
Copy link
Sponsor Member

Julia itself can use quite a bit of stack space; can we bump the default to 8MB on windows (if that's enough to fix this)?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 2, 2013

16MB was enough to bump the max number of openblas threads up to somewhere between 10 and 60, then we run into some other segfault (which appears to be caused by a null pointer)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 2, 2013

note: fixing #1971 converted this segfault into a julia stack overflow exception for OPENBLAS_NUM_THREADS<30 (or so) at which point it turns into a MemoryError (or an openblas/lapack crash?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:priority This should be addressed urgently system:windows Affects only Windows
Projects
None yet
Development

No branches or pull requests

8 participants