Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing with @everywhere using #12381

Closed
sbromberger opened this issue Jul 30, 2015 · 43 comments · Fixed by #22588
Closed

Crashing with @everywhere using #12381

sbromberger opened this issue Jul 30, 2015 · 43 comments · Fixed by #22588
Labels
compiler:precompilation Precompilation of modules domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior kind:regression Regression in behavior compared to a previous version

Comments

@sbromberger
Copy link
Contributor

Per https://groups.google.com/d/msg/julia-users/FjGXSTzvfmc/j0ZDG629IwAJ

seth@schroeder:~/dev/julia/wip/LightGraphs.jl$  julia -p 4
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http:https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+6371 (2015-07-29 17:45 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 6fafbbc (0 days old master)
|__/                   |  x86_64-apple-darwin14.4.0

julia> @everywhere using LightGraphs
WARNING: replacing module LightGraphs
WARNING: replacing module LightGraphs
WARNING: replacing module LightGraphs
WARNING: replacing module LightGraphs
exception on 4:
signal (11): Segmentation fault: 11
ptrhash_peek_bp at /Users/seth/dev/julia/julia/src/support/ptrhash.c:26
jl_get_binding_ at /Users/seth/dev/julia/julia/src/module.c:172
jl_get_binding at /Users/seth/dev/julia/julia/src/module.c:406
eval at /Users/seth/dev/julia/julia/src/interpreter.c:119
fl_invoke_julia_macro at /Users/seth/dev/julia/julia/src/ast.c:76
apply_cl at /Users/seth/dev/julia/julia/src/flisp/flisp.c:1276
_applyn at /Users/seth/dev/julia/julia/src/flisp/flisp.c:729
fl_map1 at /Users/seth/dev/julia/julia/src/flisp/flisp.c:2269
apply_cl at /Users/seth/dev/julia/julia/src/flisp/flisp.c:1226
_applyn at /Users/seth/dev/julia/julia/src/flisp/flisp.c:729
fl_map1 at /Users/seth/dev/julia/julia/src/flisp/flisp.c:2269
apply_cl at /Users/seth/dev/julia/julia/src/flisp/flisp.c:1226
do_trycatch at /Users/seth/dev/julia/julia/src/flisp/flisp.c:950
apply_cl at /Users/seth/dev/julia/julia/src/flisp/flisp.c:1856
_applyn at /Users/seth/dev/julia/julia/src/flisp/flisp.c:729
fl_applyn at /Users/seth/dev/julia/julia/src/flisp/flisp.c:774
jl_expand at /Users/seth/dev/julia/julia/src/ast.c:581
jl_toplevel_eval_flex at /Users/seth/dev/julia/julia/src/toplevel.c:486
jl_eval_module_expr at /Users/seth/dev/julia/julia/src/toplevel.c:166
jl_parse_eval_all at /Users/seth/dev/julia/julia/src/toplevel.c:574
jl_load_file_string at /Users/seth/dev/julia/julia/src/ast.c:573
include_string at loading.jl:158
jl_apply at /Users/seth/dev/julia/julia/src/gf.c:1658
include_from_node1 at /usr/local/julia-latest/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/seth/dev/julia/julia/src/interpreter.c:55
eval at /Users/seth/dev/julia/julia/src/interpreter.c:212
jl_toplevel_eval_flex at /Users/seth/dev/julia/julia/src/toplevel.c:524
jl_toplevel_eval_in at /Users/seth/dev/julia/julia/src/builtins.c:552
eval at sysimg.jl:14
anonymous at multi.jl:1303
jl_apply at /Users/seth/dev/julia/julia/src/./julia.h:1262
anonymous at multi.jl:877
run_work_thunk at multi.jl:619
run_work_thunk at multi.jl:628
jlcall_run_work_thunk_21133 at  (unknown line)
jl_apply at /Users/seth/dev/julia/julia/src/gf.c:1658
anonymous at task.jl:11
jl_apply at /Users/seth/dev/julia/julia/src/task.c:233

signal (11): Segmentation fault: 11
ptrhash_peek_bp at /Users/seth/dev/julia/julia/src/support/ptrhash.c:26
jl_get_binding_ at /Users/seth/dev/julia/julia/src/module.c:172
jl_get_binding at /Users/seth/dev/julia/julia/src/module.c:406
eval at /Users/seth/dev/julia/julia/src/interpreter.c:119
fl_invoke_julia_macro at /Users/seth/dev/julia/julia/src/ast.c:76
apply_cl at /Users/seth/dev/julia/julia/src/flisp/flisp.c:1276
_applyn at /Users/seth/dev/julia/julia/src/flisp/flisp.c:729
fl_map1 at /Users/seth/dev/julia/julia/src/flisp/flisp.c:2269
apply_cl at /Users/seth/dev/julia/julia/src/flisp/flisp.c:1226
_applyn at /Users/seth/dev/julia/julia/src/flisp/flisp.c:729
fl_map1 at /Users/seth/dev/julia/julia/src/flisp/flisp.c:2269
apply_cl at /Users/seth/dev/julia/julia/src/flisp/flisp.c:1226
do_trycatch at /Users/seth/dev/julia/julia/src/flisp/flisp.c:950
apply_cl at /Users/seth/dev/julia/julia/src/flisp/flisp.c:1856
_applyn at /Users/seth/dev/julia/julia/src/flisp/flisp.c:729
fl_applyn at /Users/seth/dev/julia/julia/src/flisp/flisp.c:774
jl_expand at /Users/seth/dev/julia/julia/src/ast.c:581
jl_toplevel_eval_flex at /Users/seth/dev/julia/julia/src/toplevel.c:486
jl_eval_module_expr at /Users/seth/dev/julia/julia/src/toplevel.c:166
jl_parse_eval_all at /Users/seth/dev/julia/julia/src/toplevel.c:574
jl_load_file_string at /Users/seth/dev/julia/julia/src/ast.c:573
include_string at loading.jl:158
jl_apply at /Users/seth/dev/julia/julia/src/gf.c:1658
include_from_node1 at /usr/local/julia-latest/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/seth/dev/julia/julia/src/interpreter.c:55
eval at /Users/seth/dev/julia/julia/src/interpreter.c:212
jl_toplevel_eval_flex at /Users/seth/dev/julia/julia/src/toplevel.c:524
jl_toplevel_eval_in at /Users/seth/dev/julia/julia/src/builtins.c:552
eval at sysimg.jl:14
jl_apply at /Users/seth/dev/julia/julia/src/gf.c:1658
anonymous at multi.jl:1303
jl_apply at /Users/seth/dev/julia/julia/src/./julia.h:1262
anonymous at multi.jl:877
run_work_thunk at multi.jl:619
run_work_thunk at multi.jl:628
jlcall_run_work_thunk_21054 at  (unknown line)
jl_apply at /Users/seth/dev/julia/julia/src/gf.c:1658
anonymous at task.jl:11
jl_apply at /Users/seth/dev/julia/julia/src/task.c:233
ERROR: LoadError: ReadOnlyMemoryError()
 in include_string at loading.jl:158
 in include_from_node1 at /usr/local/julia-latest/lib/julia/sys.dylib
 in eval at sysimg.jl:14
 in anonymous at multi.jl:1303
 in anonymous at multi.jl:877
 in run_work_thunk at multi.jl:619
 in run_work_thunk at multi.jl:628
 in anonymous at task.jl:11
while loading /Users/seth/.julia/v0.4/LightGraphs/src/LightGraphs.jl, in expression starting on line 6
exception on 3: ERROR: LoadError: ReadOnlyMemoryError()
 in include_string at loading.jl:158
 in include_from_node1 at /usr/local/julia-latest/lib/julia/sys.dylib
 in eval at sysimg.jl:14
 in anonymous at multi.jl:1303
 in anonymous at multi.jl:877
 in run_work_thunk at multi.jl:619
 in run_work_thunk at multi.jl:628
 in anonymous at task.jl:11
while loading /Users/seth/.julia/v0.4/LightGraphs/src/LightGraphs.jl, in expression starting on line 6
Worker 5 terminated.
ERROR (unhandled task failure): ProcessExitedException()
 in yieldto at /usr/local/julia-latest/lib/julia/sys.dylib
 in wait at /usr/local/julia-latest/lib/julia/sys.dylib (repeats 2 times)
 in wait_full at /usr/local/julia-latest/lib/julia/sys.dylib
 in remotecall_fetch at multi.jl:702
 in remotecall_fetch at multi.jl:707
 in anonymous at task.jl:369
ERROR (unhandled task failure): EOFError: read end of file
 in yieldto at /usr/local/julia-latest/lib/julia/sys.dylib
 in wait at /usr/local/julia-latest/lib/julia/sys.dylib (repeats 2 times)
 in wait_full at /usr/local/julia-latest/lib/julia/sys.dylib
 in remotecall_fetch at multi.jl:702
 in remotecall_fetch at multi.jl:707
 in anonymous at task.jl:369
Worker 2 terminated.exception on ERROR (unhandled task failure): ProcessExitedException()
 in yieldto at /usr/local/julia-latest/lib/julia/sys.dylib
 in wait at /usr/local/julia-latest/lib/julia/sys.dylib (repeats 2 times)
 in wait_full at /usr/local/julia-latest/lib/julia/sys.dylib
 in remotecall_fetch at multi.jl:702
 in remotecall_fetch at multi.jl:707
 in anonymous at task.jl:369

1: ERROR (unhandled task failure): EOFError: read end of file
 in yieldto at /usr/local/julia-latest/lib/julia/sys.dylib
 in wait at /usr/local/julia-latest/lib/julia/sys.dylib (repeats 2 times)
 in wait_full at /usr/local/julia-latest/lib/julia/sys.dylib
 in remotecall_fetch at multi.jl:702
 in call_on_owner at /usr/local/julia-latest/lib/julia/sys.dylib
 in wait at /usr/local/julia-latest/lib/julia/sys.dylib
 in require at /usr/local/julia-latest/lib/julia/sys.dylib
 in eval at sysimg.jl:14
 in anonymous at multi.jl:1324
 in run_work_thunk at multi.jl:619
 in remotecall_fetch at multi.jl:692
 in remotecall_fetch at multi.jl:707
 in anonymous at task.jl:369
ERROR: ProcessExitedException()
 in yieldto at /usr/local/julia-latest/lib/julia/sys.dylib
 in wait at /usr/local/julia-latest/lib/julia/sys.dylib (repeats 2 times)
 in wait_full at /usr/local/julia-latest/lib/julia/sys.dylib
 in remotecall_fetch at multi.jl:702
 in call_on_owner at /usr/local/julia-latest/lib/julia/sys.dylib
 in wait at /usr/local/julia-latest/lib/julia/sys.dylib
 in require at /usr/local/julia-latest/lib/julia/sys.dylib
 in eval at sysimg.jl:14
 in anonymous at multi.jl:1324
 in run_work_thunk at multi.jl:619
 in remotecall_fetch at multi.jl:692
 in remotecall_fetch at multi.jl:707
 in anonymous at task.jl:369
ERROR: ProcessExitedException()
 in wait at /usr/local/julia-latest/lib/julia/sys.dylib
 in sync_end at /usr/local/julia-latest/lib/julia/sys.dylib
 in anonymous at multi.jl:348

This works on

0.4.0-dev+5008 (2015-05-26 16:08 UTC) Commit 0855ec9

Next step: git bisect. Stand by.

@sbromberger
Copy link
Contributor Author

... and bisect is broken:

error: Your local changes to the following files would be overwritten by checkout:
    CMakeLists.txt
    src/openssl_stream.c
Please, commit your changes or stash them before you can switch branches.
Aborting
make[1]: *** [libgit2/CMakeLists.txt] Error 1
make: *** [julia-deps] Error 2

Will try some brute-force.

@sbromberger
Copy link
Contributor Author

No crashes, but errors here:

 | | |_| | | | (_| |  |  Version 0.4.0-dev+6033 (2015-07-17 02:56 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit efcc709 (12 days old master)
|__/                   |  x86_64-apple-darwin14.4.0

julia> @everywhere using LightGraphs
exception on 1: exception on 3: exception on 5: exception on 4: exception on 2: ERROR: MethodError: `base_include` has no method matching base_include(::UTF8String, ::ASCIIString, ::Tuple{Int64,ASCIIString})
Closest candidates are:
  base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString, ::Any)
  base_include(::AbstractString, ::Any)
  base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString)
  ...
 in eval at sysimg.jl:14
 in anonymous at multi.jl:1297
 in run_work_thunk at multi.jl:584
 in run_work_thunk at multi.jl:593
 in anonymous at task.jl:8

I will assume that this is a separate issue and will attempt to pinpoint the version that crashes.

@jakebolewski
Copy link
Member

did you precompile this package?

@sbromberger
Copy link
Contributor Author

@jakebolewski at one time, yes, but ~/.julia/lib/v0.4 is currently empty.

@jakebolewski jakebolewski added kind:bug Indicates an unexpected problem or unintended behavior domain:parallelism Parallel or distributed computation labels Jul 30, 2015
@sbromberger
Copy link
Contributor Author

According to bisect results,

e8a1c7440f47707be3329775fac91f0c4bf9c27d is the first bad commit
commit e8a1c7440f47707be3329775fac91f0c4bf9c27d
Author: Tim Holy <[email protected]>
Date:   Sat Jul 11 10:13:49 2015 -0500

    Add missing base_include method

    This fixes errors that crop up with multiple workers, e.g.,
    ERROR: MethodError: `base_include` has no method matching base_include(::ASCIIString, ::ASCIIString, ::Tuple{Int64,ASCIIString})
    Closest candidates are:
      base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString, ::Any)
      base_include(::AbstractString, ::Any)
      base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString)
      ...
     in eval at sysimg.jl:14
     in anonymous at multi.jl:1303
     in run_work_thunk at multi.jl:584
    ...

    Perhaps you'd prefer a call-site fix?

:040000 040000 dc8f6d703af0bc05b1dc811013d73a713fdf05dc 21a496ea73cbb8fcedeecb601d7a978e9067e283 M  base

Validating that the previous version actually works... there were several errors encountered throughout.

@sbromberger
Copy link
Contributor Author

The version prior to bisect's "first bad commit" turns out to be #12381 (comment) so it looks like Tim's commit fixed that bug but may have uncovered the crash bug.

cc @timholy

@timholy
Copy link
Sponsor Member

timholy commented Jul 30, 2015

There doesn't appear to be any way e8a1c74 is the real culprit. CC @vtjnash?

@tkelman
Copy link
Contributor

tkelman commented Jul 30, 2015

re: #12381 (comment), if you go back far enough that dependencies changed versions you'll often need to do make -C deps distclean-libgit2, or similar for pcre. Make sure you're doing make cleanall at each step of bisect just to be sure (the only deps that cleanall deletes by default are small ones).

@jlapeyre
Copy link
Contributor

Following is with no cleaning before rebuild. I get different behavior on two machines

Julia Version 0.4.0-dev+6033
Commit efcc709* (2015-07-17 02:56 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
> julia -p 4

Hangs (for at least several minutes) with:

ERROR: AssertionError: 
 in init_worker at ./multi.jl:1051
 in start_worker at multi.jl:964
 in process_options at ./client.jl:265
 in _start at ./client.jl:411

Different machine

Julia Version 0.4.0-dev+6033
Commit efcc709* (2015-07-17 02:56 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

> julia -p 4
> @everywhere using PowerSeries
exception on 1: exception on 4: exception on 2: ERROR: MethodError: `base_include` has no method matching base_include(::ASCIIString, ::ASCIIString, ::Tuple{Int64,ASCIIString})
Closest candidates are:
  base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString, ::Any)
  base_include(::AbstractString, ::Any)
  base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString)
  ...
 in eval at sysimg.jl:14
 in anonymous at multi.jl:1297
 in run_work_thunk at multi.jl:584
 in run_work_thunk at multi.jl:593
 in anonymous at task.jl:8
   ...
Warning: requiring "PowerSeries" did not define a corresponding module.

@rened
Copy link
Member

rened commented Jul 31, 2015

A git bisect with make cleanall leads me to 416a23e as the first bad commit. 88bb2e9 is the last commit where

./julia -e "addprocs(2); @everywhere using FactCheck"

succeeds for me. (No pre-compilation used anywhere).

@timholy
Copy link
Sponsor Member

timholy commented Jul 31, 2015

I also find that one to be strange as a culprit, but CCing @ScottPJones anyway.

@ScottPJones
Copy link
Contributor

I'm not at home now, but I can't see how it would have an effect, unless there were invalid UTF-8 data that wasn't detected before, but you would have seen a UnicodeError then

@rened
Copy link
Member

rened commented Jul 31, 2015

Interesting. What I posted above was performed on OSX. On Linux, both commits work just fine. Current master on Linux fails, though, as on OSX. If nobody beats me to it I will bisect again on both systems tomorrow.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 31, 2015

i don't think a bisect is entirely required for this one: from the issue description above, it's apparent that there's a race condition between the call to using on node 1 and the call to using on the other nodes that is not being properly accounted for in the changes to the require logic.

@ScottPJones
Copy link
Contributor

bisects are not at all accurate for things like race conditions - they can only give you some idea as to some bounds where the bug was introduced.

@jlapeyre
Copy link
Contributor

jlapeyre commented Aug 1, 2015

Yes, it looks like a race condition. The type of error produced (and whether they occur at all) changes from run to run.

@jlapeyre
Copy link
Contributor

jlapeyre commented Aug 1, 2015

For testing, I find the bug more likely to occur with four processes than two.

@jlapeyre
Copy link
Contributor

jlapeyre commented Aug 1, 2015

If I do

make cleanall
make distclean
make -C deps distcleanall

Then building commit 88bb2e9 shows the bug.

@rened
Copy link
Member

rened commented Aug 3, 2015

FWIW, sleeping a little makes it pass for me again on current master:

./julia -e "addprocs(6); @everywhere sleep(0.1); using JSON, FactCheck, Compat"

Calling it without @everywhere altogether works as well:

./julia -e "addprocs(6); using JSON, FactCheck, Compat"

@carnaval
Copy link
Contributor

carnaval commented Aug 3, 2015

sleeping a little makes everything pass

@amitmurthy
Copy link
Contributor

#12581 does seem to fix the segfault on my local machine. Request other folks to tests it out.

This is what I get.

amitm@amitm-macbookpro:~/Work/julia/julia$ julia -p 4
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http:https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+6662 (2015-08-13 16:13 UTC)
 _/ |\__'_|_|_|\__'_|  |  amitm/loading_fix/296a7db* (fork: 1 commits, 2 days)
|__/                   |  x86_64-linux-gnu

julia> @everywhere using LightGraphs
INFO: Precompiling module LightGraphs...
WARNING: Module StatsFuns uuid did not match cache file
WARNING: Module StatsFuns uuid did not match cache file
WARNING: Module StatsFuns uuid did not match cache file
WARNING: Module StatsFuns uuid did not match cache file
WARNING: node state is inconsistent: node 2 failed to load cache from /home/amitm/.julia/lib/v0.4/LightGraphs.ji
WARNING: node state is inconsistent: node 3 failed to load cache from /home/amitm/.julia/lib/v0.4/LightGraphs.ji
WARNING: node state is inconsistent: node 4 failed to load cache from /home/amitm/.julia/lib/v0.4/LightGraphs.ji
WARNING: node state is inconsistent: node 5 failed to load cache from /home/amitm/.julia/lib/v0.4/LightGraphs.ji

and

julia> 
amitm@amitm-macbookpro:~/Work/julia/julia$ ./julia -e "addprocs(6); using JSON, FactCheck, Compat"
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base

Warnings and errors but no segfault.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Aug 14, 2015

why are you getting "WARNING: node state is inconsistent" there? that generally is going to be really, really bad.

@amitmurthy
Copy link
Contributor

I cleaned .cache.

Now with julia -p4, I see

julia> @everywhere using LightGraphs
WARNING: replacing module LightGraphs
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35.
WARNING: replacing module LightGraphs
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35.
WARNING: replacing module LightGraphs
WARNING: replacing module LightGraphs
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35.
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35.

Why did it precompile the first time around and not now?

@JeffBezanson
Copy link
Sponsor Member

The package_locks mechanism was supposed to (used to?) solve this. If you try to do using X multiple times at once on a worker, it should actually happen only once.

@JeffBezanson
Copy link
Sponsor Member

@stevengj It makes sense that your change would fix the segfault. Is there evidence of some other problem as well, or are we done here?

@stevengj
Copy link
Member

We haven't had any indication that it is still segfaulting. It would be good to have an issue for eliminating the warning, but probably that should be a separate issue.

@ViralBShah
Copy link
Member

Should we then take this off the 0.4 milestone list?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Aug 21, 2015

i think there are a few improvements that can be made:

  1. to reduce the window of inconsistency, find_all_in_cache_path should block if package_locks[mod] indicates that the node is in the process of calling compilecache for that module
  2. to work harder to reduce this window for inconsistencies, __precompile__ should be handled on worker nodes by first attempting to convince node 1 to cachecompile the package (instead of ignoring this directive on worker nodes) before deciding whether to abort or continue running the source file
  3. the broadcast of top-level import from node 1 should include a conditional check of isdefined(Main, mod) to block accidental redefinition (unless the user explicitly does @everywhere reload("Mod"))
  4. to reduce potential confusion, rename require to reload and deprecate the old name entirely

@oxinabox
Copy link
Contributor

@izarov
Copy link

izarov commented Mar 28, 2016

Is the following deserialization error on workers a manifestation of this race condition?

# higher number of workers relative to available cores seems to make it easier to reproduce
# e.g., try with 8 if 4x doesn’t work
workers = 4*Sys.CPU_CORES;
addprocs(workers);

@everywhere begin
  import Distributions

  immutable ParameterUnivariate{U<:Distributions.UnivariateDistribution}
    dist::U
  end
end

param = ParameterUnivariate(Distributions.Normal());
pmap(x->x, fill(param, 100));

results in reloading module warnings and then a large number of workers exiting with error:

ERROR: TypeError: ParameterUnivariate: in U, expected U<:Distributions.Distribution{Distributions.Univariate,S<:Distributions.ValueSupport}, got Type{Distributions.Normal}
 in deserialize_datatype at serialize.jl:646
...

Moving import Distributions outside of the @everywhere block as using Distributions seems to fix it. Reproducible on 0.4.5 and 0.5.

@cstjean
Copy link
Contributor

cstjean commented Jul 12, 2016

We faced the same issue in DecisionTree.jl, and I've boiled it down to this. No precompilation necessary (on Julia 0.4, OSX)

# B.jl
module B
end
# C.jl
module C
abstract AbstractAbstract
end
# A.jl
module A
using B             # can be any module
include("incl.jl")  # problem disappears if the import is done in A.jl
end
# incl.jl
import C: AbstractAbstract
type Obj <: AbstractAbstract end

then interactively:

addprocs(3)
@everywhere using A
> On worker 2: UndefVarError: AbstractAbstract not defined

@stevengj
Copy link
Member

import A; @everywhere using A is the best way to do this at the moment, I think.

@pearcemc
Copy link

pearcemc commented Oct 28, 2016

I'm not sure this is a closed issue (I'm on 0.5.0).
This was my workaround (I like reload for debugging purposes.):

for p in procs()
    @fetchfrom p reload("Package")
end

@stevengj
Copy link
Member

@pearcemc, do import Package; @everywhere using Package.

@stevengj
Copy link
Member

Fixed by #21718?

andreasnoack pushed a commit that referenced this issue Jul 17, 2017
…ove include_from_node1 (#22588)

* include: assume that shared file systems exist for clusters

remove buggy support for emulating shared file-systems from Julia:
the kernel is much better at this, and can do it transparently

* only broadcast using/import to nodes which need it

fix #12381
fix #13999
jeffwong pushed a commit to jeffwong/julia that referenced this issue Jul 24, 2017
…ove include_from_node1 (JuliaLang#22588)

* include: assume that shared file systems exist for clusters

remove buggy support for emulating shared file-systems from Julia:
the kernel is much better at this, and can do it transparently

* only broadcast using/import to nodes which need it

fix JuliaLang#12381
fix JuliaLang#13999
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:precompilation Precompilation of modules domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior kind:regression Regression in behavior compared to a previous version
Projects
None yet