Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Serialization segfault on 1.5.1 #37545

Closed
mmattocks opened this issue Sep 12, 2020 · 2 comments
Closed

Distributed Serialization segfault on 1.5.1 #37545

mmattocks opened this issue Sep 12, 2020 · 2 comments

Comments

@mmattocks
Copy link

I have recently begun experiencing a strange issue where a long-running process with many workers will segfault after about an hour with the following output:

signal (11): Segmentation fault
in expression starting at /srv/git/rys_nucleosomes/nested_sampling/dif_pos_learner.jl:82
sig_match_fast at /buildworker/worker/package_linux64/build/src/gf.c:2250 [inlined]
jl_lookup_generic_ at /buildworker/worker/package_linux64/build/src/gf.c:2332 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2394
serialize_any at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:648
serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:627 [inlined]
serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:272
serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:2000
unknown function (ip: 0x7f303d394df5)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
serialize_msg at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:90
unknown function (ip: 0x7f303d392e05)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:655
jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:705
#invokelatest#1 at ./essentials.jl:710 [inlined]    
invokelatest at ./essentials.jl:709 [inlined]     
send_msg_ at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:185
send_msg_now at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:130 [inlined]   
send_msg_now at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:125
deliver_result at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:111    
unknown function (ip: 0x7f303d394ab8)              
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:302 [inlined]
#105 at ./task.jl:356
unknown function (ip: 0x7f303d390b6c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:707
unknown function (ip: (nil))
Allocations: 859407961 (Pool: 859209782; Big: 198179); GC: 217
fish: “julia” terminated by signal SIGSEGV (Address boundary error)

I really have no clue where to start producing a MWE because of the "unknown function" stuff. Does anyone have any hint as to what might be happening here? I am confused by "serialize" in this report, as none of the code the remote workers are executing makes calls to serialize(). Could this be arising as a result of remote workers calling remotecall_fetch(deserialize,...)? That's the only thing in my code the remote workers are executing that has anything to do with Serialization, so maybe it's something lower level than that.

julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 2
@vtjnash
Copy link
Sponsor Member

vtjnash commented Sep 15, 2020

perhaps could be related to @yuyichao's recent findings in #37511 or #37557? do you have any additional context what you were working on doing? it looks like it was trying to send the result back to thread 1, and got garbage.

@mmattocks
Copy link
Author

I no longer experience this issue after removing "===" checks in worker functions, appears to be a dupe. Thanks very much for the clue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants