Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loading Oscar causes corruption #44661

Closed
simeonschaub opened this issue Mar 17, 2022 · 12 comments
Closed

loading Oscar causes corruption #44661

simeonschaub opened this issue Mar 17, 2022 · 12 comments
Labels
kind:regression Regression in behavior compared to a previous version

Comments

@simeonschaub
Copy link
Member

See timholy/Revise.jl#675. Was bisected to #43671, but I wasn't able to really extract anything from the rr trace, so I'm not sure if that might just have surfaced the issue or Revise might be doing something sketchy.

@simeonschaub simeonschaub added the kind:regression Regression in behavior compared to a previous version label Mar 17, 2022
@simeonschaub simeonschaub added this to the 1.8 milestone Mar 17, 2022
@fingolfin
Copy link
Contributor

Of course it might also be OSCAR that does something "sketchy" -- I'd like to draw attention to its dependency GAP.jl (I am one of the principal authors), which is (AFAIK) the unique package using Julia's "foreign type" kernel interface (which was added for and by us). Who knows, maybe something in there causes a problem. One thing that caused issues in the past: foreign types are indicated by fielddesc_type == 3. But I just had a quick look through the kernel and everything seemed fine in that regard. Anyway, this could be a total red herring.

@fingolfin
Copy link
Contributor

Here is a current reproduce and backtrace (made on a Mac, but it can also be reproduced on Linux in exactly the same way):

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.0-beta2.3 (2022-03-17)
 _/ |\__'_|_|_|\__'_|  |  release-1.8/e191c6e935 (fork: 79 commits, 29 days)
|__/                   |

julia> versioninfo()
Julia Version 1.8.0-beta2.3
Commit e191c6e935 (2022-03-17 11:35 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = /usr/local/bin/bbedit

(@v1.8) pkg> update
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
  No Changes to `~/.julia/environments/v1.8/Project.toml`
  No Changes to `~/.julia/environments/v1.8/Manifest.toml`

(@v1.8) pkg> status
Status `~/.julia/environments/v1.8/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.1
  [f1435218] Oscar v0.8.2-DEV `~/.julia/dev/Oscar`
  [295af30f] Revise v3.3.3

(@v1.8) pkg> status -m
Status `~/.julia/environments/v1.8/Manifest.toml`
  [c3fe647b] AbstractAlgebra v0.25.1
  [6e4b80f9] BenchmarkTools v1.3.1
  [f01c122e] BinaryWrappers v0.1.2
  [da1fd8a2] CodeTracking v1.0.8
  [1f15a43c] CxxWrap v0.12.0
  [ffbed154] DocStringExtensions v0.8.6
  [c863536a] GAP v0.7.7
  [d5909c97] GroupsCore v0.4.0
  [3e1990a7] Hecke v0.13.3
  [692b3bcd] JLLWrappers v1.4.1
  [682c06a0] JSON v0.21.3
  [aa1ae85d] JuliaInterpreter v0.9.11
  [6f1432cf] LoweredCodeUtils v2.2.1
  [1914dd2f] MacroTools v0.5.9
⌅ [4fe8b98c] Mongoc v0.6.2
  [2edaba10] Nemo v0.30.0
  [bac558e1] OrderedCollections v1.4.1
  [f1435218] Oscar v0.8.2-DEV `~/.julia/dev/Oscar`
  [69de0a69] Parsers v2.2.3
  [d720cf60] Polymake v0.7.1
  [21216c6a] Preferences v1.2.5
  [fb686558] RandomExtensions v0.4.3
  [ae029012] Requires v1.3.0
  [295af30f] Revise v3.3.3
  [6c6a2e73] Scratch v1.1.0
  [bcd08a7b] Singular v0.9.3
  [e21ec000] Antic_jll v0.200.501+0
  [d9960996] Arb_jll v200.2200.0+0
  [fcfa6d1b] Calcium_jll v0.400.102+0
  [e134572f] FLINT_jll v200.800.401+1
⌅ [5cd7a574] GAP_jll v400.1191.1+2
⌅ [de1ad85e] GAP_lib_jll v400.1191.0+0
⌅ [ba154793] GAP_pkg_juliainterface_jll v0.700.300+1
  [e8aa6df9] GLPK_jll v5.0.1+0
  [dd4b983a] LZO_jll v2.10.1+0
  [90100e71] MongoC_jll v1.19.1+0
  [68e3532b] Ncurses_jll v6.2.0+0
  [76642167] Ninja_jll v1.10.3+0
⌅ [656ef2d0] OpenBLAS32_jll v0.3.17+0
  [458c3c95] OpenSSL_jll v1.1.14+0
  [80dd9cbb] PPL_jll v1.2.1+0
  [83958c19] Perl_jll v5.34.0+1
  [05236dd9] Readline_jll v8.1.1+1
⌅ [43d676ae] Singular_jll v403.0.100+0
  [36f60fef] TOPCOM_jll v0.17.8+0
  [3161d3a3] Zstd_jll v1.5.2+0
  [508c9074] bliss_jll v0.77.0+1
  [28df3c45] boost_jll v1.76.0+0
  [f07e07eb] cddlib_jll v0.94.13+0
  [5558cf25] cohomCalg_jll v0.32.0+0
  [1493ae25] lib4ti2_jll v1.6.10+0
  [3eaa8342] libcxxwrap_julia_jll v0.9.0+1
  [4d8266f6] libpolymake_julia_jll v0.8.0+1
  [ae4fbd8f] libsingular_julia_jll v0.21.0+1
  [3873f7d0] lrslib_jll v0.3.3+0
  [6d01cc9a] msolve_jll v0.2.3+0
  [55c6dc9b] nauty_jll v2.6.13+0
  [6690c6e9] normaliz_jll v300.900.100+0
  [7c209550] polymake_jll v400.600.0+0
  [fe1e1685] snappy_jll v1.1.9+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [b77e0a4c] InteractiveUtils
  [b27032c2] LibCURL v0.6.3
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.8.0
  [de0858da] Printf
  [9abbd945] Profile
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [fa267f1f] TOML v1.0.0
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v0.5.2+0
  [781609d7] GMP_jll v6.2.1+1
  [deac9b47] LibCURL_jll v7.81.0+0
  [29816b5a] LibSSH2_jll v1.10.2+0
  [3a97d323] MPFR_jll v4.1.1+1
  [c8ffd9c3] MbedTLS_jll v2.28.0+0
  [14a3606d] MozillaCACerts_jll v2022.2.1
  [4536629a] OpenBLAS_jll v0.3.20+0
  [83775a58] Zlib_jll v1.2.12+1
  [8e850b90] libblastrampoline_jll v5.0.1+0
  [8e850ede] nghttp2_jll v1.41.0+1
  [3f19e933] p7zip_jll v16.2.1+1
Info Packages marked with ⌅ have new versions available but cannot be upgraded. To see why use `status --outdated`

julia> using Revise, Oscar
 -----    -----    -----      -      -----
|     |  |     |  |     |    | |    |     |
|     |  |        |         |   |   |     |
|     |   -----   |        |     |  |-----
|     |        |  |        |-----|  |   |
|     |  |     |  |     |  |     |  |    |
 -----    -----    -----   -     -  -     -

...combining (and extending) ANTIC, GAP, Polymake and Singular
Version 0.8.2-DEV ...
 ... which comes with absolutely no warranty whatsoever
Type: '?Oscar' for more information
(c) 2019-2022 by The Oscar Development Team

julia> touch(abspath(pathof(Oscar),"..","Groups","GAPGroups.jl"));

julia> 1+1

signal (11): Segmentation fault: 11
in expression starting at none:1
jl_typemap_level_assoc_exact at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/typemap.c:1026
jl_typemap_assoc_exact at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/./julia_internal.h:1324 [inlined]
jl_typemap_level_assoc_exact at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/typemap.c:1027
jl_typemap_assoc_exact at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/./julia_internal.h:1324 [inlined]
jl_lookup_generic_ at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:2480 [inlined]
ijl_apply_generic at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:2536
jl_apply at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/./julia.h:1829 [inlined]
do_call at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/interpreter.c:126
eval_body at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/interpreter.c:0
jl_interpret_toplevel_thunk at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/interpreter.c:750
jl_toplevel_eval_flex at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/toplevel.c:906
ijl_toplevel_eval at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/toplevel.c:915 [inlined]
ijl_toplevel_eval_in at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/toplevel.c:965
eval at ./boot.jl:368 [inlined]
do_assignment! at /Users/mhorn/.julia/packages/JuliaInterpreter/TeQ0I/src/interpret.jl:354
step_expr! at /Users/mhorn/.julia/packages/JuliaInterpreter/TeQ0I/src/interpret.jl:470
step_expr! at /Users/mhorn/.julia/packages/JuliaInterpreter/TeQ0I/src/interpret.jl:594
finish! at /Users/mhorn/.julia/packages/JuliaInterpreter/TeQ0I/src/commands.jl:14
step_expr! at /Users/mhorn/.julia/packages/JuliaInterpreter/TeQ0I/src/interpret.jl:513
step_through_methoddef at /Users/mhorn/.julia/packages/LoweredCodeUtils/c8DBv/src/signatures.jl:84
signature at /Users/mhorn/.julia/packages/LoweredCodeUtils/c8DBv/src/signatures.jl:45
unknown function (ip: 0x2e467538b)
_jl_invoke at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:2540
#methoddef!#7 at /Users/mhorn/.julia/packages/LoweredCodeUtils/c8DBv/src/signatures.jl:493
methoddef!##kw at /Users/mhorn/.julia/packages/LoweredCodeUtils/c8DBv/src/signatures.jl:445 [inlined]
#methods_by_execution!#24 at /Users/mhorn/.julia/packages/Revise/VskYC/src/lowered.jl:272
unknown function (ip: 0x2e46acee7)
unknown function (ip: 0x2c4f0662f)
unknown function (ip: 0x2c4f06603)
methods_by_execution!##kw at /Users/mhorn/.julia/packages/Revise/VskYC/src/lowered.jl:239 [inlined]
#methods_by_execution!#20 at /Users/mhorn/.julia/packages/Revise/VskYC/src/lowered.jl:217
methods_by_execution!##kw at /Users/mhorn/.julia/packages/Revise/VskYC/src/lowered.jl:175 [inlined]
#eval_with_signatures#90 at /Users/mhorn/.julia/packages/Revise/VskYC/src/packagedef.jl:464 [inlined]
eval_with_signatures##kw at /Users/mhorn/.julia/packages/Revise/VskYC/src/packagedef.jl:462 [inlined]
#instantiate_sigs!#91 at /Users/mhorn/.julia/packages/Revise/VskYC/src/packagedef.jl:472
instantiate_sigs! at /Users/mhorn/.julia/packages/Revise/VskYC/src/packagedef.jl:469 [inlined]
maybe_extract_sigs! at /Users/mhorn/.julia/packages/Revise/VskYC/src/pkgs.jl:141 [inlined]
handle_deletions at /Users/mhorn/.julia/packages/Revise/VskYC/src/packagedef.jl:641
#revise#96 at /Users/mhorn/.julia/packages/Revise/VskYC/src/packagedef.jl:747
revise at /Users/mhorn/.julia/packages/Revise/VskYC/src/packagedef.jl:735
unknown function (ip: 0x2c4f7401b)
_jl_invoke at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:2540
jl_apply at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/./julia.h:1829 [inlined]
jl_f__call_latest at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/builtins.c:769
#invokelatest#2 at ./essentials.jl:729 [inlined]
invokelatest at ./essentials.jl:727
unknown function (ip: 0x12cc00033)
_jl_invoke at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:2540
jl_apply at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/./julia.h:1829 [inlined]
do_call at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/interpreter.c:126
eval_body at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/interpreter.c:0
jl_interpret_toplevel_thunk at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/interpreter.c:750
jl_toplevel_eval_flex at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/toplevel.c:906
jl_toplevel_eval_flex at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/toplevel.c:850
ijl_toplevel_eval at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/toplevel.c:915 [inlined]
ijl_toplevel_eval_in at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/toplevel.c:965
eval at ./boot.jl:368 [inlined]
eval_user_input at /Users/mhorn/Projekte/Julia/julia.release-1.8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:151
repl_backend_loop at /Users/mhorn/Projekte/Julia/julia.release-1.8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:247
start_repl_backend at /Users/mhorn/Projekte/Julia/julia.release-1.8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:232
#run_repl#47 at /Users/mhorn/Projekte/Julia/julia.release-1.8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:369
run_repl at /Users/mhorn/Projekte/Julia/julia.release-1.8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:356
jfptr_run_repl_64278 at /Users/mhorn/Projekte/Julia/julia.release-1.8/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:2540
#961 at ./client.jl:419
jfptr_YY.961_46947 at /Users/mhorn/Projekte/Julia/julia.release-1.8/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:2540
jl_apply at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/./julia.h:1829 [inlined]
jl_f__call_latest at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/builtins.c:769
#invokelatest#2 at ./essentials.jl:729 [inlined]
invokelatest at ./essentials.jl:727 [inlined]
run_main_repl at ./client.jl:404
exec_options at ./client.jl:318
_start at ./client.jl:522
jfptr__start_51636 at /Users/mhorn/Projekte/Julia/julia.release-1.8/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/gf.c:2540
jl_apply at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/./julia.h:1829 [inlined]
true_main at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/jlapi.c:567
jl_repl_entrypoint at /Users/mhorn/Projekte/Julia/julia.release-1.8/src/jlapi.c:711
Allocations: 36422603 (Pool: 36371362; Big: 51241); GC: 22

@simeonschaub
Copy link
Member Author

Ok, this is interesting:

julia> using JuliaInterpreter

julia> ex = :(f(x) = () -> x);

julia> JuliaInterpreter.finish_and_return!(JuliaInterpreter.finish_and_return!, Frame(Main, ex), true)
f (generic function with 1 method)

julia> JuliaInterpreter.finish_and_return!(JuliaInterpreter.finish_and_return!, Frame(Main, ex), true)
f (generic function with 1 method)

julia> import Oscar
[...]
julia> JuliaInterpreter.finish_and_return!(JuliaInterpreter.finish_and_return!, Frame(Main, ex), true)
f (generic function with 1 method)

julia> JuliaInterpreter.finish_and_return!(JuliaInterpreter.finish_and_return!, Frame(Main, ex), true)

signal (11): Segmentation fault
in expression starting at none:1
mtcache_hash_lookup at /cache/build/default-amdci4-4/julialang/julia-master/src/typemap.c:292 [inlined]
mtcache_hash_lookup at /cache/build/default-amdci4-4/julialang/julia-master/src/typemap.c:288 [inlined]
jl_typemap_level_assoc_exact at /cache/build/default-amdci4-4/julialang/julia-master/src/typemap.c:1025
jl_typemap_assoc_exact at /cache/build/default-amdci4-4/julialang/julia-master/src/julia_internal.h:1328 [inlined]
[...]

So even though Oscar is never actually used anywhere, just loading it seems to cause corruption. I wonder whether this might be related to #36770? I don't really know what jl_new_foreign_type is doing, maybe something about that doesn't play nicely with serializing these types as part of a module binding?

@simeonschaub simeonschaub changed the title segfault when loading Revise and Oscar loading Oscar causes corruption Mar 18, 2022
@simeonschaub
Copy link
Member Author

Ok, I looked more into it and if I run julia inside gdb, I already get a segfault if I just do import GAP without having to load anything else. This happens even without 7b1e454, which seems to confirm my suspicion, that #43671 just surfaced this issue. @fingolfin are you able to run GDB, can you confirm that?

The segfault in GDB
─── Assembly ───────────────────────────────────────────────────────────────────────────────────────────
 0x00007fff53b3db98  FindLiveRangeReverse+46 jmp    0x7fff53b3dba9 <SafeScanTaskStack+169>
 0x00007fff53b3db9a  FindLiveRangeReverse+48 nopw   0x0(%rax,%rax,1)
 0x00007fff53b3dba0  FindLiveRangeReverse+54 sub    $0x2,%rbx
 0x00007fff53b3dba4  FindLiveRangeReverse+58 cmp    %r12,%rbx
 0x00007fff53b3dba7  FindLiveRangeReverse+61 jb     0x7fff53b3db3a <SafeScanTaskStack+58>
 0x00007fff53b3dba9  FindLiveRangeReverse+63 mov    (%rbx),%rbp
 0x00007fff53b3dbac  FindLiveRangeReverse+66 test   %rbp,%rbp
 0x00007fff53b3dbaf  FindLiveRangeReverse+69 je     0x7fff53b3dba0 <SafeScanTaskStack+160>
 0x00007fff53b3dbb1  FindLiveRangeReverse+71 mov    %rbp,%rdi
 0x00007fff53b3dbb4  FindLiveRangeReverse+74 call   0x7fff53a3ab30 <jl_gc_internal_obj_base_ptr@plt>
─── Breakpoints ────────────────────────────────────────────────────────────────────────────────────────
─── Expressions ────────────────────────────────────────────────────────────────────────────────────────
─── History ────────────────────────────────────────────────────────────────────────────────────────────
─── Memory ─────────────────────────────────────────────────────────────────────────────────────────────
─── Registers ──────────────────────────────────────────────────────────────────────────────────────────
         rax 0x0000000000000000         rbx 0x00007fff5aa07ffe            rcx 0xf5ffffffffffffff
         rdx 0x000000000003d7ff         rsi 0x0000000000001368            rdi 0xf600000000000000
         rbp 0x0000000000000000         rsp 0x00007fffffff90a0             r8 0x0000000000000000
          r9 0x00005555558aec80         r10 0x00005555555ba8d0            r11 0x00007ffff7457530
         r12 0x00007fff5aa00000         r13 0x0000000000000008            r14 0x00007fff56933260
         r15 0x00005555577e2f20         rip 0x00007fff53b3dba9         eflags [ IF RF ]         
          cs 0x00000033                  ss 0x0000002b                     ds 0x00000000        
          es 0x00000000                  fs 0x00000000                     gs 0x00000000        
─── Source ─────────────────────────────────────────────────────────────────────────────────────────────
 499      for (Int i = 0; i < arr->len; i++) {
 500          JMark(arr->items[i]);
 501      }
 502  }
 503  
 504  #define ELEM_TYPE TaskInfo
 505  #define COMPARE CmpTaskInfo
 506  
 507  #include "baltree.h"
 508  
─── Stack ──────────────────────────────────────────────────────────────────────────────────────────────
[0] from 0x00007fff53b3dba9 in FindLiveRangeReverse+63 at src/julia_gc.c:504
[1] from 0x00007fff53b3dba9 in SafeScanTaskStack+169 at src/julia_gc.c:631
[2] from 0x00007fff53b3eda7 in ScanTaskStack+281 at src/julia_gc.c:656
[3] from 0x00007fff53b3eda7 in GapTaskScanner+391 at src/julia_gc.c:830
[4] from 0x00007ffff7452125 in gc_mark_loop+11189 at /home/simeon/Documents/Julia/julia/src/gc.c:2684
[5] from 0x00007ffff745451f in _jl_gc_collect+1375 at /home/simeon/Documents/Julia/julia/src/gc.c:3077
[6] from 0x00007ffff7455a16 in ijl_gc_collect+518 at /home/simeon/Documents/Julia/julia/src/gc.c:3306
[7] from 0x00007fffe515c1a2
[8] from 0x00007fffffff95c0
[9] from 0x00007ffff7f8a9c0 in last_environ
[+]
─── Threads ────────────────────────────────────────────────────────────────────────────────────────────
[17] id 1618377 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[16] id 1618376 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[15] id 1618375 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[14] id 1618374 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[13] id 1618373 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[12] id 1618372 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[11] id 1618371 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[10] id 1618370 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[9] id 1618369 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[8] id 1618368 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[7] id 1618367 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[6] id 1618366 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[5] id 1618365 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[4] id 1618364 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[3] id 1618363 name julia from 0x00007ffff7e1815a in __futex_abstimed_wait_common
[2] id 1618362 name julia from 0x00007ffff7dd124a in sigtimedwait
[1] id 1618358 name julia from 0x00007fff53b3dba9 in FindLiveRangeReverse+63 at src/julia_gc.c:504
─── Variables ──────────────────────────────────────────────────────────────────────────────────────────
arg end = 0x7fff5ae00000, start = <optimized out>, arr = 0x5555555f7600: {len = 0,cap = 1024,items = 0x555557897bd0}
loc addr = <optimized out>, q = 0x7fff5aa07ffe "": 0 '\000', old_safe_restore = 0x0: Cannot access memory at address 0x0, exc_buf = {[0] = {__jmpbuf = {[0] = 1, [1] = 1571102604749472072, [2] = 140734713823232, [3] = 140734718017536…
────────────────────────────────────────────────────────────────────────────────────────────────────────

@fingolfin
Copy link
Contributor

Interesting, I cannot reproduce this with just import GAP. Is that exactly as with your example that used import Oscar? Which Julia version and which OS (so that I can match more closely when trying to reproduce).

Instances of foreign types cannot be serialized at all (and I've inserted a check guarding against attempts to do so some time ago into src/dump.c).

AFAIK there is no special code at all for serializing the foreign types, yet so far I never run into this? But I'll add a guard in there now, too, to see if and where it triggers.

I don't see how to serialize a foreign type in the current system, as they contain pointers to C functions. There are of course ways to do deal with that (e.g. associate a unique global identifier with the type which can be used to restore the function pointers during runtime). But I don't understand enough of what's going on to formulate a concrete plan. In particular, why would that type be serialized and then deserialized again in the first place? And isn't there logic in place which determines during (de)serialization that the type is in fact already in the system, and just uses a pointer to the existing type? I imagine you'd not want to have two copies of, say, type Int or Vector{Int} as result of deserialization, so I'd assume there is some logic to deal with that?

Sorry for all the dumb questions.

@fingolfin
Copy link
Contributor

Inserted an assertion into jl_serialize_datatype to learn more about when and where the (unique!) foreign type is deserialized. Of course (obviously, in hindsight) it already happens when precompiling GAP.jl... And thus presumably deserialized when importing it. But so far this never caused any issues.

@fingolfin
Copy link
Contributor

@simeonschaub ahha, just noticed you posted a gdb backtrace for that "segfault" you got when loading GAP.jl. But that one is "normal" -- it is caused by GAP scanning the stack during garbage collecting for references to objects. But since it only has imprecise information about the size of the stack, it may end up accessing guard pages at the end of stack, which cause a segfault -- but those are caught, and handled.

Unfortunately, to gdb they look like a regular segfault. Unfortunately I don't know of a good way to deal with this in gdb -- one can tell it to ignore segfaults, but AFAIK there is no way to make this conditional (ie. ignore SIGSEGV but only if we are in this and that function), so this makes using gdb to debug things a pain :-(

@simeonschaub
Copy link
Member Author

I compiled the commit right before 7b1e454 from source. I'm on Manjaro Linux and the following versioninfo:

julia> versioninfo()
Julia Version 1.8.0-DEV.1464
Commit 942697f949* (2022-02-08 18:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 7 PRO 4750U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)

@simeonschaub ahha, just noticed you posted a gdb backtrace for that "segfault" you got when loading GAP.jl. But that one is "normal" -- it is caused by GAP scanning the stack during garbage collecting for references to objects. But since it only has imprecise information about the size of the stack, it may end up accessing guard pages at the end of stack, which cause a segfault -- but those are caught, and handled.

Interesting, that's the first time I heard about "normal" segfaults! 😆 It does seem to be a literal null pointer though, I would have expected there to be some kind of check?

Inserted an assertion into jl_serialize_datatype to learn more about when and where the (unique!) foreign type is deserialized. Of course (obviously, in hindsight) it already happens when precompiling GAP.jl... And thus presumably deserialized when importing it. But so far this never caused any issues.

That could have changed after #43671, but that PR didn't actually use any typed globals yet, so the type field of a binding should only ever be Any. I assume nothing in Oscar already uses typed globals?

@fingolfin
Copy link
Contributor

Interesting, that's the first time I heard about "normal" segfaults! 😆

It's pretty much standard for conservative stack scanning GC. And doc/src/devdocs/debuggingtips.md says

the [Julia] garbage collector uses SIGSEGV for threads synchronization

Although I don't know whether that's still true. In any case, Julia's C kernel has facilities to deal with SIGSEGV which we use for GAP.jl (specifically, jl_get_safe_restore and jl_set_safe_restore).

It does seem to be a literal null pointer though, I would have expected there to be some kind of check?

Not sure what makes you say it's a null pointer? This is the code in question:

static void FindLiveRangeReverse(PtrArray * arr, void * start, void * end)
{
    if (lt_ptr(end, start)) {
        SWAP(void *, start, end);
    }
    char * p = (char *)(align_ptr(start));
    char * q = (char *)end - sizeof(void *);
    while (!lt_ptr(q, p)) {
        void * addr = *(void **)q;    // <- line 504
        if (addr && jl_gc_internal_obj_base_ptr(addr) == addr &&
            jl_typeis(addr, datatype_mptr)) {
            PtrArrayAdd(arr, addr);
        }
        q -= C_STACK_ALIGN;
    }
}

It segfaults dereferencing q, which is 0x7fff5aa07ffe in your gdb log.

I assume nothing in Oscar already uses typed globals?

Correct

@fingolfin
Copy link
Contributor

This might be a total red herring, but: in the "real" backtraces, I see functions named jl_typemap_level_assoc_exact and jl_typemap_assoc_exact pop up. I know nothing about that code, but these function names to me sound as if they are dealing with a hashtable / associative map storing types. I wonder if our foreign type might get stored in there and this somehow causes a problem?! But why, and why now and not before?!?

Going back to jl_serialize_datatype, when I inserted my check to see whether and when "my" foreign type is serialized, I learned that it is deal with as tag = 6; // external primary type. I mainly write this to remind me if/when I need to dig further on this. It's probably useless for you, sorry.

Finally, fully disclosure: current GAP.jl does something real evil, which I suspected for a time to be the cause of this, but I can remove this evilness and it still crashes: namely, it takes the type GAP_jll.MPtr and hacks it into GAP.GapObj by changing the parentmodule and typename. We've been doing this for a very long time without a hitch (the motivation: to get the type printed in a more user friendly fashion -- unfortunately all attempts to achieve this in a "clean" way, e.g. by setting const GapObj = GAP_jll.MPtr in module GAP, then install a show method, did not pan out, because e.g. Vector{GAP.GapObj) would still show GAP_jll.MPtr, no matter what).. But of course now these fields are made const, which was a clear signal this should not be done anymore. But for now, the current GAP.jl release still contains this hack. But I've verified that the crash still happens if I disable it (by commenting out the line ccall((:OverrideTypeNameAndModule, JuliaInterface_path()), Cvoid, (Any, Any, Any), GapObj, GAP, :GapObj) in src/GAP.jl).

@KristofferC
Copy link
Sponsor Member

I don't really think this should be on the milestone. The discussion here seems to be way and beyond what is considered as Julia's public interface. Of course, it would be good to fix but it should not hold up a release.

@KristofferC KristofferC removed this from the 1.8 milestone Apr 8, 2022
@fingolfin
Copy link
Contributor

The crash is gone now. I have no idea what change caused it (maybe something we changed in GAP.jl or Oscar.jl). I tried all kinds of variations and both the 1.8 (release-1.8) and 1.9 (master) branches.

So from my POV this could be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

3 participants