-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load path #7
Comments
Can be done as part of the redesign discussed in issue #65. |
Yeah, I can start looking at how to do these two together. Would be a big combined win. |
old load() is now called include() this also makes it easy to work on #7 array constructor tweaks
This is already in |
Ah, ok. This is pretty easy now. I think should just have
and always use that to look for files that get loaded. People can just manipulate the contents of that array like any other array. It's pretty straightforward. Does seem reasonable to you? |
Closes JuliaLang#7 Closes JuliaLang#10 Closes JuliaLang#13
When calling `jl_error()` or `jl_errorf()`, we must check to see if we are so early in the bringup process that it is dangerous to attempt to construct a backtrace because the data structures used to provide line information are not properly setup. This can be easily triggered by running: ``` julia -C invalid ``` On an `i686-linux-gnu` build, this will hit the "Invalid CPU Name" branch in `jitlayers.cpp`, which calls `jl_errorf()`. This in turn calls `jl_throw()`, which will eventually call `jl_DI_for_fptr` as part of the backtrace printing process, which fails as the object maps are not fully initialized. See the below `gdb` stacktrace for details: ``` $ gdb -batch -ex 'r' -ex 'bt' --args ./julia -C invalid ... fatal: error thrown and no exception handler available. ErrorException("Invalid CPU name "invalid".") Thread 1 "julia" received signal SIGSEGV, Segmentation fault. 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 1277 /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h: No such file or directory. #0 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 #1 std::map<unsigned int, JITDebugInfoRegistry::ObjectInfo, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__x=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_map.h:1258 #2 jl_DI_for_fptr (fptr=4155049385, symsize=symsize@entry=0xffffcfa8, slide=slide@entry=0xffffcfa0, Section=Section@entry=0xffffcfb8, context=context@entry=0xffffcf94) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1181 #3 0xf75c056a in jl_getFunctionInfo_impl (frames_out=0xffffd03c, pointer=4155049385, skipC=0, noInline=0) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1210 #4 0xf7a6ca98 in jl_print_native_codeloc (ip=4155049385) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:636 #5 0xf7a6cd54 in jl_print_bt_entry_codeloc (bt_entry=0xf0798018) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:657 #6 jlbacktrace () at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:1090 #7 0xf7a3cd2b in ijl_no_exc_handler (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:605 #8 0xf7a3d10a in throw_internal (ct=ct@entry=0xf070c010, exception=<optimized out>, exception@entry=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:638 #9 0xf7a3d330 in ijl_throw (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:654 #10 0xf7a905aa in ijl_errorf (fmt=fmt@entry=0xf7647cd4 "Invalid CPU name \"%s\".") at /cache/build/default-amdci5-4/julialang/julia-master/src/rtutils.c:77 #11 0xf75a4b22 in (anonymous namespace)::createTargetMachine () at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:823 #12 JuliaOJIT::JuliaOJIT (this=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:1044 #13 0xf7531793 in jl_init_llvm () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8585 #14 0xf75318a8 in jl_init_codegen_impl () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8648 #15 0xf7a51a52 in jl_restore_system_image_from_stream (f=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2131 #16 0xf7a55c03 in ijl_restore_system_image_data (buf=0xe859c1c0 <jl_system_image_data> "8'\031\003", len=125161105) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2184 #17 0xf7a55cf9 in jl_load_sysimg_so () at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:424 #18 ijl_restore_system_image (fname=0x80a0900 "/build/bk_download/julia-d78fdad601/lib/julia/sys.so") at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2157 #19 0xf7a3bdfc in _finish_julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME, ct=<optimized out>, ptls=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:741 #20 0xf7a3c8ac in julia_init (rel=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:728 #21 0xf7a7f61d in jl_repl_entrypoint (argc=<optimized out>, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/src/jlapi.c:705 #22 0x080490a7 in main (argc=3, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/cli/loader_exe.c:59 ``` To prevent this, we simply avoid calling `jl_errorf` this early in the process, punting the problem to a later PR that can update guard conditions within `jl_error*`.
When calling `jl_error()` or `jl_errorf()`, we must check to see if we are so early in the bringup process that it is dangerous to attempt to construct a backtrace because the data structures used to provide line information are not properly setup. This can be easily triggered by running: ``` julia -C invalid ``` On an `i686-linux-gnu` build, this will hit the "Invalid CPU Name" branch in `jitlayers.cpp`, which calls `jl_errorf()`. This in turn calls `jl_throw()`, which will eventually call `jl_DI_for_fptr` as part of the backtrace printing process, which fails as the object maps are not fully initialized. See the below `gdb` stacktrace for details: ``` $ gdb -batch -ex 'r' -ex 'bt' --args ./julia -C invalid ... fatal: error thrown and no exception handler available. ErrorException("Invalid CPU name "invalid".") Thread 1 "julia" received signal SIGSEGV, Segmentation fault. 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 1277 /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h: No such file or directory. #0 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 #1 std::map<unsigned int, JITDebugInfoRegistry::ObjectInfo, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__x=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_map.h:1258 #2 jl_DI_for_fptr (fptr=4155049385, symsize=symsize@entry=0xffffcfa8, slide=slide@entry=0xffffcfa0, Section=Section@entry=0xffffcfb8, context=context@entry=0xffffcf94) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1181 #3 0xf75c056a in jl_getFunctionInfo_impl (frames_out=0xffffd03c, pointer=4155049385, skipC=0, noInline=0) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1210 #4 0xf7a6ca98 in jl_print_native_codeloc (ip=4155049385) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:636 #5 0xf7a6cd54 in jl_print_bt_entry_codeloc (bt_entry=0xf0798018) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:657 #6 jlbacktrace () at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:1090 #7 0xf7a3cd2b in ijl_no_exc_handler (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:605 #8 0xf7a3d10a in throw_internal (ct=ct@entry=0xf070c010, exception=<optimized out>, exception@entry=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:638 #9 0xf7a3d330 in ijl_throw (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:654 #10 0xf7a905aa in ijl_errorf (fmt=fmt@entry=0xf7647cd4 "Invalid CPU name \"%s\".") at /cache/build/default-amdci5-4/julialang/julia-master/src/rtutils.c:77 #11 0xf75a4b22 in (anonymous namespace)::createTargetMachine () at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:823 #12 JuliaOJIT::JuliaOJIT (this=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:1044 #13 0xf7531793 in jl_init_llvm () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8585 #14 0xf75318a8 in jl_init_codegen_impl () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8648 #15 0xf7a51a52 in jl_restore_system_image_from_stream (f=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2131 #16 0xf7a55c03 in ijl_restore_system_image_data (buf=0xe859c1c0 <jl_system_image_data> "8'\031\003", len=125161105) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2184 #17 0xf7a55cf9 in jl_load_sysimg_so () at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:424 #18 ijl_restore_system_image (fname=0x80a0900 "/build/bk_download/julia-d78fdad601/lib/julia/sys.so") at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2157 #19 0xf7a3bdfc in _finish_julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME, ct=<optimized out>, ptls=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:741 #20 0xf7a3c8ac in julia_init (rel=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:728 #21 0xf7a7f61d in jl_repl_entrypoint (argc=<optimized out>, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/src/jlapi.c:705 #22 0x080490a7 in main (argc=3, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/cli/loader_exe.c:59 ``` To prevent this, we simply avoid calling `jl_errorf` this early in the process, punting the problem to a later PR that can update guard conditions within `jl_error*`. (cherry picked from commit 21ab24e)
…Lang#45790) Currently the `@nospecialize`-d `push!(::Vector{Any}, ...)` can only take a single item and we will end up with runtime dispatch when we try to call it with multiple items: ```julia julia> code_typed(push!, (Vector{Any}, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000001, 0x0000000000000001))::Nothing │ %2 = Base.arraylen(a)::Int64 │ Base.arrayset(true, a, item, %2)::Vector{Any} └── return a ) => Vector{Any} julia> code_typed(push!, (Vector{Any}, Any, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ %1 = Base.append!(a, iter)::Vector{Any} └── return %1 ) => Vector{Any} ``` This commit adds a new specialization that it can take arbitrary-length items. Our compiler should still be able to optimize the single-input case as before via the dispatch mechanism. ```julia julia> code_typed(push!, (Vector{Any}, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000001, 0x0000000000000001))::Nothing │ %2 = Base.arraylen(a)::Int64 │ Base.arrayset(true, a, item, %2)::Vector{Any} └── return a ) => Vector{Any} julia> code_typed(push!, (Vector{Any}, Any, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ %1 = Base.arraylen(a)::Int64 │ $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000002, 0x0000000000000002))::Nothing └── goto JuliaLang#7 if not true 2 ┄ %4 = φ (JuliaLang#1 => 1, JuliaLang#6 => %14)::Int64 │ %5 = φ (JuliaLang#1 => 1, JuliaLang#6 => %15)::Int64 │ %6 = Base.getfield(x, %4, true)::Any │ %7 = Base.add_int(%1, %4)::Int64 │ Base.arrayset(true, a, %6, %7)::Vector{Any} │ %9 = (%5 === 2)::Bool └── goto JuliaLang#4 if not %9 3 ─ goto JuliaLang#5 4 ─ %12 = Base.add_int(%5, 1)::Int64 └── goto JuliaLang#5 5 ┄ %14 = φ (JuliaLang#4 => %12)::Int64 │ %15 = φ (JuliaLang#4 => %12)::Int64 │ %16 = φ (JuliaLang#3 => true, JuliaLang#4 => false)::Bool │ %17 = Base.not_int(%16)::Bool └── goto JuliaLang#7 if not %17 6 ─ goto JuliaLang#2 7 ┄ return a ) => Vector{Any} ``` This commit also adds the equivalent implementations for `pushfirst!`.
When calling `jl_error()` or `jl_errorf()`, we must check to see if we are so early in the bringup process that it is dangerous to attempt to construct a backtrace because the data structures used to provide line information are not properly setup. This can be easily triggered by running: ``` julia -C invalid ``` On an `i686-linux-gnu` build, this will hit the "Invalid CPU Name" branch in `jitlayers.cpp`, which calls `jl_errorf()`. This in turn calls `jl_throw()`, which will eventually call `jl_DI_for_fptr` as part of the backtrace printing process, which fails as the object maps are not fully initialized. See the below `gdb` stacktrace for details: ``` $ gdb -batch -ex 'r' -ex 'bt' --args ./julia -C invalid ... fatal: error thrown and no exception handler available. ErrorException("Invalid CPU name "invalid".") Thread 1 "julia" received signal SIGSEGV, Segmentation fault. 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 1277 /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h: No such file or directory. #0 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 JuliaLang#1 std::map<unsigned int, JITDebugInfoRegistry::ObjectInfo, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__x=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_map.h:1258 JuliaLang#2 jl_DI_for_fptr (fptr=4155049385, symsize=symsize@entry=0xffffcfa8, slide=slide@entry=0xffffcfa0, Section=Section@entry=0xffffcfb8, context=context@entry=0xffffcf94) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1181 JuliaLang#3 0xf75c056a in jl_getFunctionInfo_impl (frames_out=0xffffd03c, pointer=4155049385, skipC=0, noInline=0) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1210 JuliaLang#4 0xf7a6ca98 in jl_print_native_codeloc (ip=4155049385) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:636 JuliaLang#5 0xf7a6cd54 in jl_print_bt_entry_codeloc (bt_entry=0xf0798018) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:657 JuliaLang#6 jlbacktrace () at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:1090 JuliaLang#7 0xf7a3cd2b in ijl_no_exc_handler (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:605 JuliaLang#8 0xf7a3d10a in throw_internal (ct=ct@entry=0xf070c010, exception=<optimized out>, exception@entry=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:638 JuliaLang#9 0xf7a3d330 in ijl_throw (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:654 JuliaLang#10 0xf7a905aa in ijl_errorf (fmt=fmt@entry=0xf7647cd4 "Invalid CPU name \"%s\".") at /cache/build/default-amdci5-4/julialang/julia-master/src/rtutils.c:77 JuliaLang#11 0xf75a4b22 in (anonymous namespace)::createTargetMachine () at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:823 JuliaLang#12 JuliaOJIT::JuliaOJIT (this=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:1044 JuliaLang#13 0xf7531793 in jl_init_llvm () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8585 JuliaLang#14 0xf75318a8 in jl_init_codegen_impl () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8648 JuliaLang#15 0xf7a51a52 in jl_restore_system_image_from_stream (f=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2131 JuliaLang#16 0xf7a55c03 in ijl_restore_system_image_data (buf=0xe859c1c0 <jl_system_image_data> "8'\031\003", len=125161105) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2184 JuliaLang#17 0xf7a55cf9 in jl_load_sysimg_so () at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:424 JuliaLang#18 ijl_restore_system_image (fname=0x80a0900 "/build/bk_download/julia-d78fdad601/lib/julia/sys.so") at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2157 JuliaLang#19 0xf7a3bdfc in _finish_julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME, ct=<optimized out>, ptls=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:741 JuliaLang#20 0xf7a3c8ac in julia_init (rel=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:728 JuliaLang#21 0xf7a7f61d in jl_repl_entrypoint (argc=<optimized out>, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/src/jlapi.c:705 JuliaLang#22 0x080490a7 in main (argc=3, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/cli/loader_exe.c:59 ``` To prevent this, we simply avoid calling `jl_errorf` this early in the process, punting the problem to a later PR that can update guard conditions within `jl_error*`.
This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - add special SROA handling for `NamedTuple` generated by keyword sorter: With the change on `structdiff`, IR for a call with type-unstable keyword arguments after inlining would look like: ``` %1 = tuple(a, b, c)::Tuple{Any, Any, Any} %2 = NamedTuple{(:a, :b, :c)(%1)::NamedTuple{(:a, :b, :c), _A} where _A<:Tuple{Any, Any, Any} %3 = Core.getfield(%2, :a)::Any %4 = Core.getfield(%2, :b)::Any %5 = Core.getfield(%2, :c)::Any [... other body of the keyword func ...] ``` We can implement a bit hacky special handling within our SROA pass that checks if this definition (%2) is partly well-known `NamedTuple` construction, where its names are fully known, and also checks if its call argument (%1) is fully-known `tuple` call. In a case when the length of the `NamedTuple` names and the length of the arguments for the `tuple` call, we can safely replace those `getfield` calls with the corresponding `tuple` call argument, while letting the later DCE pass to delete the constructions of tuple and named-tuple altogether. With these changes, the IR for the example `NewInstruction` constructor is fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```
This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - add special SROA handling for `NamedTuple` generated by keyword sorter: With the change on `structdiff`, IR for a call with type-unstable keyword arguments after inlining would look like: ``` %1 = tuple(a, b, c)::Tuple{Any, Any, Any} %2 = NamedTuple{(:a, :b, :c)(%1)::NamedTuple{(:a, :b, :c), _A} where _A<:Tuple{Any, Any, Any} %3 = Core.getfield(%2, :a)::Any %4 = Core.getfield(%2, :b)::Any %5 = Core.getfield(%2, :c)::Any [... other body of the keyword func ...] ``` We can implement a bit hacky special handling within our SROA pass that checks if this definition (%2) is partly well-known `NamedTuple` construction, where its names are fully known, and also checks if its call argument (%1) is fully-known `tuple` call. In a case when the length of the `NamedTuple` names and the length of the arguments for the `tuple` call, we can safely replace those `getfield` calls with the corresponding `tuple` call argument, while letting the later DCE pass to delete the constructions of tuple and named-tuple altogether. With these changes, the IR for the example `NewInstruction` constructor is fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```
This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - add special SROA handling for `NamedTuple` generated by keyword sorter: With the change on `structdiff`, IR for a call with type-unstable keyword arguments after inlining would look like: ``` %1 = tuple(a, b, c)::Tuple{Any, Any, Any} %2 = NamedTuple{(:a, :b, :c)(%1)::NamedTuple{(:a, :b, :c), _A} where _A<:Tuple{Any, Any, Any} %3 = Core.getfield(%2, :a)::Any %4 = Core.getfield(%2, :b)::Any %5 = Core.getfield(%2, :c)::Any [... other body of the keyword func ...] ``` We can implement a bit hacky special handling within our SROA pass that checks if this definition (%2) is partly well-known `NamedTuple` construction, where its names are fully known, and also checks if its call argument (%1) is fully-known `tuple` call. In a case when the length of the `NamedTuple` names and the length of the arguments for the `tuple` call, we can safely replace those `getfield` calls with the corresponding `tuple` call argument, while letting the later DCE pass to delete the constructions of tuple and named-tuple altogether. With these changes, the IR for the example `NewInstruction` constructor is fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```
This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```
This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```
This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```
This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```
This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```
This makes it easier to correlate LLVM IR with the originating source code by including both argument name and argument type in the LLVM argument variable. <details> <summary>Example 1</summary> ```julia julia> function f(a, b, c, d, g...) e = a + b + c + d f = does_not_exist(e) + e f end f (generic function with 1 method) julia> @code_llvm f(0,0,0,0,0) ``` ```llvm ; @ REPL[1]:1 within `f` define nonnull {}* @julia_f_141(i64 signext %"a::Int64", i64 signext %"b::Int64", i64 signext %"c::Int64", i64 signext %"d::Int64", i64 signext %"g[0]::Int64") #0 { top: %0 = alloca [2 x {}*], align 8 %gcframe3 = alloca [4 x {}*], align 16 %gcframe3.sub = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 0 %1 = bitcast [4 x {}*]* %gcframe3 to i8* call void @llvm.memset.p0i8.i64(i8* align 16 %1, i8 0, i64 32, i1 true) %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #7 %tls_ppgcstack = getelementptr i8, i8* %thread_ptr, i64 -8 %2 = bitcast i8* %tls_ppgcstack to {}**** %tls_pgcstack = load {}***, {}**** %2, align 8 ; @ REPL[1]:3 within `f` %3 = bitcast [4 x {}*]* %gcframe3 to i64* store i64 8, i64* %3, align 16 %4 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 1 %5 = bitcast {}** %4 to {}*** %6 = load {}**, {}*** %tls_pgcstack, align 8 store {}** %6, {}*** %5, align 8 %7 = bitcast {}*** %tls_pgcstack to {}*** store {}** %gcframe3.sub, {}*** %7, align 8 %Main.does_not_exist.cached = load atomic {}*, {}** @0 unordered, align 8 %iscached.not = icmp eq {}* %Main.does_not_exist.cached, null br i1 %iscached.not, label %notfound, label %found notfound: ; preds = %top %Main.does_not_exist.found = call {}* @ijl_get_binding_or_error({}* nonnull inttoptr (i64 139831437630272 to {}*), {}* nonnull inttoptr (i64 139831600565400 to {}*)) store atomic {}* %Main.does_not_exist.found, {}** @0 release, align 8 br label %found found: ; preds = %notfound, %top %Main.does_not_exist = phi {}* [ %Main.does_not_exist.cached, %top ], [ %Main.does_not_exist.found, %notfound ] %8 = bitcast {}* %Main.does_not_exist to {}** %does_not_exist.checked = load atomic {}*, {}** %8 unordered, align 8 %.not = icmp eq {}* %does_not_exist.checked, null br i1 %.not, label %err, label %ok err: ; preds = %found call void @ijl_undefined_var_error({}* inttoptr (i64 139831600565400 to {}*)) unreachable ok: ; preds = %found %.sub = getelementptr inbounds [2 x {}*], [2 x {}*]* %0, i64 0, i64 0 ; @ REPL[1]:2 within `f` ; ┌ @ operators.jl:587 within `+` @ int.jl:87 %9 = add i64 %"b::Int64", %"a::Int64" %10 = add i64 %9, %"c::Int64" ; │ @ operators.jl:587 within `+` ; │┌ @ operators.jl:544 within `afoldl` ; ││┌ @ int.jl:87 within `+` %11 = add i64 %10, %"d::Int64" %12 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 3 store {}* %does_not_exist.checked, {}** %12, align 8 ; └└└ ; @ REPL[1]:3 within `f` %13 = call nonnull {}* @ijl_box_int64(i64 signext %11) %14 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 2 store {}* %13, {}** %14, align 16 store {}* %13, {}** %.sub, align 8 %15 = call nonnull {}* @ijl_apply_generic({}* nonnull %does_not_exist.checked, {}** nonnull %.sub, i32 1) store {}* %15, {}** %12, align 8 %16 = call nonnull {}* @ijl_box_int64(i64 signext %11) store {}* %16, {}** %14, align 16 store {}* %15, {}** %.sub, align 8 %17 = getelementptr inbounds [2 x {}*], [2 x {}*]* %0, i64 0, i64 1 store {}* %16, {}** %17, align 8 %18 = call nonnull {}* @ijl_apply_generic({}* inttoptr (i64 139831370516384 to {}*), {}** nonnull %.sub, i32 2) %19 = load {}*, {}** %4, align 8 %20 = bitcast {}*** %tls_pgcstack to {}** store {}* %19, {}** %20, align 8 ; @ REPL[1]:4 within `f` ret {}* %18 } ``` </details> <details> <summary>Example 2</summary> ```julia julia> function g(a, b, c, d; kwarg=0) a + b + c + d + kwarg end g (generic function with 1 method) julia> @code_llvm g(0,0,0,0,kwarg=0) ``` ```llvm ; @ REPL[3]:1 within `g` define i64 @julia_g_160([1 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(8) %"#1::NamedTuple", i64 signext %"a::Int64", i64 signext %"b::Int64", i64 signext %"c::Int64", i64 signext %"d::Int64") #0 { top: %0 = getelementptr inbounds [1 x i64], [1 x i64]* %"#1::NamedTuple", i64 0, i64 0 ; ┌ @ REPL[3]:2 within `#g#1` ; │┌ @ operators.jl:587 within `+` @ int.jl:87 %1 = add i64 %"b::Int64", %"a::Int64" %2 = add i64 %1, %"c::Int64" ; ││ @ operators.jl:587 within `+` ; ││┌ @ operators.jl:544 within `afoldl` ; │││┌ @ int.jl:87 within `+` %3 = add i64 %2, %"d::Int64" ; │││└ ; │││ @ operators.jl:545 within `afoldl` ; │││┌ @ int.jl:87 within `+` %unbox = load i64, i64* %0, align 8 %4 = add i64 %3, %unbox ; └└└└ ret i64 %4 } ``` </details>
Pass the types to the allocator functions. ------- Before this PR, we were missing the types for allocations in two cases: 1. allocations from codegen 2. allocations in `gc_managed_realloc_` The second one is easy: those are always used for buffers, right? For the first one: we extend the allocation functions called from codegen, to take the type as a parameter, and set the tag there. I kept the old interfaces around, since I think that they cannot be removed due to supporting legacy code? ------ An example of the generated code: ```julia %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2 %13 = bitcast {}*** %ptls_field6 to i8** %ptls_load78 = load i8*, i8** %13, align 8 %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7 ``` Fixes #43688. Fixes #45268. Co-authored-by: Valentin Churavy <[email protected]>
Pass the types to the allocator functions. ------- Before this PR, we were missing the types for allocations in two cases: 1. allocations from codegen 2. allocations in `gc_managed_realloc_` The second one is easy: those are always used for buffers, right? For the first one: we extend the allocation functions called from codegen, to take the type as a parameter, and set the tag there. I kept the old interfaces around, since I think that they cannot be removed due to supporting legacy code? ------ An example of the generated code: ```julia %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2 %13 = bitcast {}*** %ptls_field6 to i8** %ptls_load78 = load i8*, i8** %13, align 8 %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7 ``` Fixes #43688. Fixes #45268. Co-authored-by: Valentin Churavy <[email protected]>
Backports PR #50337 for RAI julia v1.9.2 Original description: =================== Pass the types to the allocator functions. ------- Before this PR, we were missing the types for allocations in two cases: 1. allocations from codegen 2. allocations in `gc_managed_realloc_` The second one is easy: those are always used for buffers, right? For the first one: we extend the allocation functions called from codegen, to take the type as a parameter, and set the tag there. I kept the old interfaces around, since I think that they cannot be removed due to supporting legacy code? ------ An example of the generated code: ```julia %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2 %13 = bitcast {}*** %ptls_field6 to i8** %ptls_load78 = load i8*, i8** %13, align 8 %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7 ``` Fixes #43688. Fixes #45268. Co-authored-by: Valentin Churavy <[email protected]>
Fix `getproperty` calls and handle Core.Typeof(Vararg)
This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. ## Current Status This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. ## Obligatory benchmark Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64},🅰️ :Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```
This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64},🅰️ :Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```
This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64},🅰️ :Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```
This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64},🅰️ :Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```
This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. ## Current Status This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. ## Obligatory benchmark Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64},🅰️ :Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```
This is part of the work to address JuliaLang#51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. ## Current Status This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. ## Obligatory benchmark Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto JuliaLang#5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto JuliaLang#4 if not %13 3 ─ goto JuliaLang#5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto JuliaLang#6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto JuliaLang#7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto JuliaLang#8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64},🅰️ :Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```
Pass the types to the allocator functions. ------- Before this PR, we were missing the types for allocations in two cases: 1. allocations from codegen 2. allocations in `gc_managed_realloc_` The second one is easy: those are always used for buffers, right? For the first one: we extend the allocation functions called from codegen, to take the type as a parameter, and set the tag there. I kept the old interfaces around, since I think that they cannot be removed due to supporting legacy code? ------ An example of the generated code: ```julia %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2 %13 = bitcast {}*** %ptls_field6 to i8** %ptls_load78 = load i8*, i8** %13, align 8 %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7 ``` Fixes #43688. Fixes #45268. Co-authored-by: Valentin Churavy <[email protected]>
Currently we have a half-baked, implicit load path for
.j
files loaded via the builtinload()
function. It should be possible to add a list of directories to look in for.j
files. The rule should probably be that names without/
in them are looked up via the load path, while names beginning with/
are considered to be absolute and names with/
anywhere else are relative to the current directory.The text was updated successfully, but these errors were encountered: