load path #7

StefanKarpinski · 2011-04-27T05:43:54Z

Currently we have a half-baked, implicit load path for .j files loaded via the builtin load() function. It should be possible to add a list of directories to look in for .j files. The rule should probably be that names without / in them are looked up via the load path, while names beginning with / are considered to be absolute and names with / anywhere else are relative to the current directory.

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2011-06-28T01:14:05Z

Can be done as part of the redesign discussed in issue #65.

StefanKarpinski · 2011-06-28T01:24:21Z

Yeah, I can start looking at how to do these two together. Would be a big combined win.

old load() is now called include() this also makes it easy to work on #7 array constructor tweaks

JeffBezanson · 2012-02-10T03:38:46Z

This is already in util.j, but there is no user interface to it. What should it be?

StefanKarpinski · 2012-02-10T04:13:11Z

Ah, ok. This is pretty easy now. I think should just have

const LOAD_PATH = String["", "$JULIA_HOME/", "$JULIA_HOME/j/"]

and always use that to look for files that get loaded. People can just manipulate the contents of that array like any other array. It's pretty straightforward. Does seem reasonable to you?

Closes JuliaLang#7 Closes JuliaLang#10 Closes JuliaLang#13

When calling `jl_error()` or `jl_errorf()`, we must check to see if we are so early in the bringup process that it is dangerous to attempt to construct a backtrace because the data structures used to provide line information are not properly setup. This can be easily triggered by running: ``` julia -C invalid ``` On an `i686-linux-gnu` build, this will hit the "Invalid CPU Name" branch in `jitlayers.cpp`, which calls `jl_errorf()`. This in turn calls `jl_throw()`, which will eventually call `jl_DI_for_fptr` as part of the backtrace printing process, which fails as the object maps are not fully initialized. See the below `gdb` stacktrace for details: ``` $ gdb -batch -ex 'r' -ex 'bt' --args ./julia -C invalid ... fatal: error thrown and no exception handler available. ErrorException("Invalid CPU name "invalid".") Thread 1 "julia" received signal SIGSEGV, Segmentation fault. 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 1277 /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h: No such file or directory. #0 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 #1 std::map<unsigned int, JITDebugInfoRegistry::ObjectInfo, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__x=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_map.h:1258 #2 jl_DI_for_fptr (fptr=4155049385, symsize=symsize@entry=0xffffcfa8, slide=slide@entry=0xffffcfa0, Section=Section@entry=0xffffcfb8, context=context@entry=0xffffcf94) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1181 #3 0xf75c056a in jl_getFunctionInfo_impl (frames_out=0xffffd03c, pointer=4155049385, skipC=0, noInline=0) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1210 #4 0xf7a6ca98 in jl_print_native_codeloc (ip=4155049385) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:636 #5 0xf7a6cd54 in jl_print_bt_entry_codeloc (bt_entry=0xf0798018) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:657 #6 jlbacktrace () at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:1090 #7 0xf7a3cd2b in ijl_no_exc_handler (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:605 #8 0xf7a3d10a in throw_internal (ct=ct@entry=0xf070c010, exception=<optimized out>, exception@entry=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:638 #9 0xf7a3d330 in ijl_throw (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:654 #10 0xf7a905aa in ijl_errorf (fmt=fmt@entry=0xf7647cd4 "Invalid CPU name \"%s\".") at /cache/build/default-amdci5-4/julialang/julia-master/src/rtutils.c:77 #11 0xf75a4b22 in (anonymous namespace)::createTargetMachine () at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:823 #12 JuliaOJIT::JuliaOJIT (this=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:1044 #13 0xf7531793 in jl_init_llvm () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8585 #14 0xf75318a8 in jl_init_codegen_impl () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8648 #15 0xf7a51a52 in jl_restore_system_image_from_stream (f=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2131 #16 0xf7a55c03 in ijl_restore_system_image_data (buf=0xe859c1c0 <jl_system_image_data> "8'\031\003", len=125161105) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2184 #17 0xf7a55cf9 in jl_load_sysimg_so () at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:424 #18 ijl_restore_system_image (fname=0x80a0900 "/build/bk_download/julia-d78fdad601/lib/julia/sys.so") at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2157 #19 0xf7a3bdfc in _finish_julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME, ct=<optimized out>, ptls=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:741 #20 0xf7a3c8ac in julia_init (rel=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:728 #21 0xf7a7f61d in jl_repl_entrypoint (argc=<optimized out>, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/src/jlapi.c:705 #22 0x080490a7 in main (argc=3, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/cli/loader_exe.c:59 ``` To prevent this, we simply avoid calling `jl_errorf` this early in the process, punting the problem to a later PR that can update guard conditions within `jl_error*`.

When calling `jl_error()` or `jl_errorf()`, we must check to see if we are so early in the bringup process that it is dangerous to attempt to construct a backtrace because the data structures used to provide line information are not properly setup. This can be easily triggered by running: ``` julia -C invalid ``` On an `i686-linux-gnu` build, this will hit the "Invalid CPU Name" branch in `jitlayers.cpp`, which calls `jl_errorf()`. This in turn calls `jl_throw()`, which will eventually call `jl_DI_for_fptr` as part of the backtrace printing process, which fails as the object maps are not fully initialized. See the below `gdb` stacktrace for details: ``` $ gdb -batch -ex 'r' -ex 'bt' --args ./julia -C invalid ... fatal: error thrown and no exception handler available. ErrorException("Invalid CPU name "invalid".") Thread 1 "julia" received signal SIGSEGV, Segmentation fault. 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 1277 /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h: No such file or directory. #0 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 #1 std::map<unsigned int, JITDebugInfoRegistry::ObjectInfo, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__x=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_map.h:1258 #2 jl_DI_for_fptr (fptr=4155049385, symsize=symsize@entry=0xffffcfa8, slide=slide@entry=0xffffcfa0, Section=Section@entry=0xffffcfb8, context=context@entry=0xffffcf94) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1181 #3 0xf75c056a in jl_getFunctionInfo_impl (frames_out=0xffffd03c, pointer=4155049385, skipC=0, noInline=0) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1210 #4 0xf7a6ca98 in jl_print_native_codeloc (ip=4155049385) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:636 #5 0xf7a6cd54 in jl_print_bt_entry_codeloc (bt_entry=0xf0798018) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:657 #6 jlbacktrace () at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:1090 #7 0xf7a3cd2b in ijl_no_exc_handler (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:605 #8 0xf7a3d10a in throw_internal (ct=ct@entry=0xf070c010, exception=<optimized out>, exception@entry=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:638 #9 0xf7a3d330 in ijl_throw (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:654 #10 0xf7a905aa in ijl_errorf (fmt=fmt@entry=0xf7647cd4 "Invalid CPU name \"%s\".") at /cache/build/default-amdci5-4/julialang/julia-master/src/rtutils.c:77 #11 0xf75a4b22 in (anonymous namespace)::createTargetMachine () at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:823 #12 JuliaOJIT::JuliaOJIT (this=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:1044 #13 0xf7531793 in jl_init_llvm () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8585 #14 0xf75318a8 in jl_init_codegen_impl () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8648 #15 0xf7a51a52 in jl_restore_system_image_from_stream (f=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2131 #16 0xf7a55c03 in ijl_restore_system_image_data (buf=0xe859c1c0 <jl_system_image_data> "8'\031\003", len=125161105) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2184 #17 0xf7a55cf9 in jl_load_sysimg_so () at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:424 #18 ijl_restore_system_image (fname=0x80a0900 "/build/bk_download/julia-d78fdad601/lib/julia/sys.so") at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2157 #19 0xf7a3bdfc in _finish_julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME, ct=<optimized out>, ptls=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:741 #20 0xf7a3c8ac in julia_init (rel=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:728 #21 0xf7a7f61d in jl_repl_entrypoint (argc=<optimized out>, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/src/jlapi.c:705 #22 0x080490a7 in main (argc=3, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/cli/loader_exe.c:59 ``` To prevent this, we simply avoid calling `jl_errorf` this early in the process, punting the problem to a later PR that can update guard conditions within `jl_error*`. (cherry picked from commit 21ab24e)

…Lang#45790) Currently the `@nospecialize`-d `push!(::Vector{Any}, ...)` can only take a single item and we will end up with runtime dispatch when we try to call it with multiple items: ```julia julia> code_typed(push!, (Vector{Any}, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000001, 0x0000000000000001))::Nothing │ %2 = Base.arraylen(a)::Int64 │ Base.arrayset(true, a, item, %2)::Vector{Any} └── return a ) => Vector{Any} julia> code_typed(push!, (Vector{Any}, Any, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ %1 = Base.append!(a, iter)::Vector{Any} └── return %1 ) => Vector{Any} ``` This commit adds a new specialization that it can take arbitrary-length items. Our compiler should still be able to optimize the single-input case as before via the dispatch mechanism. ```julia julia> code_typed(push!, (Vector{Any}, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000001, 0x0000000000000001))::Nothing │ %2 = Base.arraylen(a)::Int64 │ Base.arrayset(true, a, item, %2)::Vector{Any} └── return a ) => Vector{Any} julia> code_typed(push!, (Vector{Any}, Any, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ %1 = Base.arraylen(a)::Int64 │ $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000002, 0x0000000000000002))::Nothing └── goto JuliaLang#7 if not true 2 ┄ %4 = φ (JuliaLang#1 => 1, JuliaLang#6 => %14)::Int64 │ %5 = φ (JuliaLang#1 => 1, JuliaLang#6 => %15)::Int64 │ %6 = Base.getfield(x, %4, true)::Any │ %7 = Base.add_int(%1, %4)::Int64 │ Base.arrayset(true, a, %6, %7)::Vector{Any} │ %9 = (%5 === 2)::Bool └── goto JuliaLang#4 if not %9 3 ─ goto JuliaLang#5 4 ─ %12 = Base.add_int(%5, 1)::Int64 └── goto JuliaLang#5 5 ┄ %14 = φ (JuliaLang#4 => %12)::Int64 │ %15 = φ (JuliaLang#4 => %12)::Int64 │ %16 = φ (JuliaLang#3 => true, JuliaLang#4 => false)::Bool │ %17 = Base.not_int(%16)::Bool └── goto JuliaLang#7 if not %17 6 ─ goto JuliaLang#2 7 ┄ return a ) => Vector{Any} ``` This commit also adds the equivalent implementations for `pushfirst!`.

When calling `jl_error()` or `jl_errorf()`, we must check to see if we are so early in the bringup process that it is dangerous to attempt to construct a backtrace because the data structures used to provide line information are not properly setup. This can be easily triggered by running: ``` julia -C invalid ``` On an `i686-linux-gnu` build, this will hit the "Invalid CPU Name" branch in `jitlayers.cpp`, which calls `jl_errorf()`. This in turn calls `jl_throw()`, which will eventually call `jl_DI_for_fptr` as part of the backtrace printing process, which fails as the object maps are not fully initialized. See the below `gdb` stacktrace for details: ``` $ gdb -batch -ex 'r' -ex 'bt' --args ./julia -C invalid ... fatal: error thrown and no exception handler available. ErrorException("Invalid CPU name "invalid".") Thread 1 "julia" received signal SIGSEGV, Segmentation fault. 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 1277 /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h: No such file or directory. #0 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 JuliaLang#1 std::map<unsigned int, JITDebugInfoRegistry::ObjectInfo, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__x=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_map.h:1258 JuliaLang#2 jl_DI_for_fptr (fptr=4155049385, symsize=symsize@entry=0xffffcfa8, slide=slide@entry=0xffffcfa0, Section=Section@entry=0xffffcfb8, context=context@entry=0xffffcf94) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1181 JuliaLang#3 0xf75c056a in jl_getFunctionInfo_impl (frames_out=0xffffd03c, pointer=4155049385, skipC=0, noInline=0) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1210 JuliaLang#4 0xf7a6ca98 in jl_print_native_codeloc (ip=4155049385) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:636 JuliaLang#5 0xf7a6cd54 in jl_print_bt_entry_codeloc (bt_entry=0xf0798018) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:657 JuliaLang#6 jlbacktrace () at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:1090 JuliaLang#7 0xf7a3cd2b in ijl_no_exc_handler (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:605 JuliaLang#8 0xf7a3d10a in throw_internal (ct=ct@entry=0xf070c010, exception=<optimized out>, exception@entry=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:638 JuliaLang#9 0xf7a3d330 in ijl_throw (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:654 JuliaLang#10 0xf7a905aa in ijl_errorf (fmt=fmt@entry=0xf7647cd4 "Invalid CPU name \"%s\".") at /cache/build/default-amdci5-4/julialang/julia-master/src/rtutils.c:77 JuliaLang#11 0xf75a4b22 in (anonymous namespace)::createTargetMachine () at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:823 JuliaLang#12 JuliaOJIT::JuliaOJIT (this=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:1044 JuliaLang#13 0xf7531793 in jl_init_llvm () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8585 JuliaLang#14 0xf75318a8 in jl_init_codegen_impl () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8648 JuliaLang#15 0xf7a51a52 in jl_restore_system_image_from_stream (f=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2131 JuliaLang#16 0xf7a55c03 in ijl_restore_system_image_data (buf=0xe859c1c0 <jl_system_image_data> "8'\031\003", len=125161105) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2184 JuliaLang#17 0xf7a55cf9 in jl_load_sysimg_so () at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:424 JuliaLang#18 ijl_restore_system_image (fname=0x80a0900 "/build/bk_download/julia-d78fdad601/lib/julia/sys.so") at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2157 JuliaLang#19 0xf7a3bdfc in _finish_julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME, ct=<optimized out>, ptls=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:741 JuliaLang#20 0xf7a3c8ac in julia_init (rel=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:728 JuliaLang#21 0xf7a7f61d in jl_repl_entrypoint (argc=<optimized out>, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/src/jlapi.c:705 JuliaLang#22 0x080490a7 in main (argc=3, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/cli/loader_exe.c:59 ``` To prevent this, we simply avoid calling `jl_errorf` this early in the process, punting the problem to a later PR that can update guard conditions within `jl_error*`.

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - add special SROA handling for `NamedTuple` generated by keyword sorter: With the change on `structdiff`, IR for a call with type-unstable keyword arguments after inlining would look like: ``` %1 = tuple(a, b, c)::Tuple{Any, Any, Any} %2 = NamedTuple{(:a, :b, :c)(%1)::NamedTuple{(:a, :b, :c), _A} where _A<:Tuple{Any, Any, Any} %3 = Core.getfield(%2, :a)::Any %4 = Core.getfield(%2, :b)::Any %5 = Core.getfield(%2, :c)::Any [... other body of the keyword func ...] ``` We can implement a bit hacky special handling within our SROA pass that checks if this definition (%2) is partly well-known `NamedTuple` construction, where its names are fully known, and also checks if its call argument (%1) is fully-known `tuple` call. In a case when the length of the `NamedTuple` names and the length of the arguments for the `tuple` call, we can safely replace those `getfield` calls with the corresponding `tuple` call argument, while letting the later DCE pass to delete the constructions of tuple and named-tuple altogether. With these changes, the IR for the example `NewInstruction` constructor is fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - add special SROA handling for `NamedTuple` generated by keyword sorter: With the change on `structdiff`, IR for a call with type-unstable keyword arguments after inlining would look like: ``` %1 = tuple(a, b, c)::Tuple{Any, Any, Any} %2 = NamedTuple{(:a, :b, :c)(%1)::NamedTuple{(:a, :b, :c), _A} where _A<:Tuple{Any, Any, Any} %3 = Core.getfield(%2, :a)::Any %4 = Core.getfield(%2, :b)::Any %5 = Core.getfield(%2, :c)::Any [... other body of the keyword func ...] ``` We can implement a bit hacky special handling within our SROA pass that checks if this definition (%2) is partly well-known `NamedTuple` construction, where its names are fully known, and also checks if its call argument (%1) is fully-known `tuple` call. In a case when the length of the `NamedTuple` names and the length of the arguments for the `tuple` call, we can safely replace those `getfield` calls with the corresponding `tuple` call argument, while letting the later DCE pass to delete the constructions of tuple and named-tuple altogether. With these changes, the IR for the example `NewInstruction` constructor is fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - add special SROA handling for `NamedTuple` generated by keyword sorter: With the change on `structdiff`, IR for a call with type-unstable keyword arguments after inlining would look like: ``` %1 = tuple(a, b, c)::Tuple{Any, Any, Any} %2 = NamedTuple{(:a, :b, :c)(%1)::NamedTuple{(:a, :b, :c), _A} where _A<:Tuple{Any, Any, Any} %3 = Core.getfield(%2, :a)::Any %4 = Core.getfield(%2, :b)::Any %5 = Core.getfield(%2, :c)::Any [... other body of the keyword func ...] ``` We can implement a bit hacky special handling within our SROA pass that checks if this definition (%2) is partly well-known `NamedTuple` construction, where its names are fully known, and also checks if its call argument (%1) is fully-known `tuple` call. In a case when the length of the `NamedTuple` names and the length of the arguments for the `tuple` call, we can safely replace those `getfield` calls with the corresponding `tuple` call argument, while letting the later DCE pass to delete the constructions of tuple and named-tuple altogether. With these changes, the IR for the example `NewInstruction` constructor is fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@0

This makes it easier to correlate LLVM IR with the originating source code by including both argument name and argument type in the LLVM argument variable. <details> <summary>Example 1</summary> ```julia julia> function f(a, b, c, d, g...) e = a + b + c + d f = does_not_exist(e) + e f end f (generic function with 1 method) julia> @code_llvm f(0,0,0,0,0) ``` ```llvm ; @ REPL[1]:1 within `f` define nonnull {}* @julia_f_141(i64 signext %"a::Int64", i64 signext %"b::Int64", i64 signext %"c::Int64", i64 signext %"d::Int64", i64 signext %"g[0]::Int64") #0 { top: %0 = alloca [2 x {}*], align 8 %gcframe3 = alloca [4 x {}*], align 16 %gcframe3.sub = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 0 %1 = bitcast [4 x {}*]* %gcframe3 to i8* call void @llvm.memset.p0i8.i64(i8* align 16 %1, i8 0, i64 32, i1 true) %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #7 %tls_ppgcstack = getelementptr i8, i8* %thread_ptr, i64 -8 %2 = bitcast i8* %tls_ppgcstack to {}**** %tls_pgcstack = load {}***, {}**** %2, align 8 ; @ REPL[1]:3 within `f` %3 = bitcast [4 x {}*]* %gcframe3 to i64* store i64 8, i64* %3, align 16 %4 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 1 %5 = bitcast {}** %4 to {}*** %6 = load {}**, {}*** %tls_pgcstack, align 8 store {}** %6, {}*** %5, align 8 %7 = bitcast {}*** %tls_pgcstack to {}*** store {}** %gcframe3.sub, {}*** %7, align 8 %Main.does_not_exist.cached = load atomic {}*, {}** @0 unordered, align 8 %iscached.not = icmp eq {}* %Main.does_not_exist.cached, null br i1 %iscached.not, label %notfound, label %found notfound: ; preds = %top %Main.does_not_exist.found = call {}* @ijl_get_binding_or_error({}* nonnull inttoptr (i64 139831437630272 to {}*), {}* nonnull inttoptr (i64 139831600565400 to {}*)) store atomic {}* %Main.does_not_exist.found, {}** @0 release, align 8 br label %found found: ; preds = %notfound, %top %Main.does_not_exist = phi {}* [ %Main.does_not_exist.cached, %top ], [ %Main.does_not_exist.found, %notfound ] %8 = bitcast {}* %Main.does_not_exist to {}** %does_not_exist.checked = load atomic {}*, {}** %8 unordered, align 8 %.not = icmp eq {}* %does_not_exist.checked, null br i1 %.not, label %err, label %ok err: ; preds = %found call void @ijl_undefined_var_error({}* inttoptr (i64 139831600565400 to {}*)) unreachable ok: ; preds = %found %.sub = getelementptr inbounds [2 x {}*], [2 x {}*]* %0, i64 0, i64 0 ; @ REPL[1]:2 within `f` ; ┌ @ operators.jl:587 within `+` @ int.jl:87 %9 = add i64 %"b::Int64", %"a::Int64" %10 = add i64 %9, %"c::Int64" ; │ @ operators.jl:587 within `+` ; │┌ @ operators.jl:544 within `afoldl` ; ││┌ @ int.jl:87 within `+` %11 = add i64 %10, %"d::Int64" %12 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 3 store {}* %does_not_exist.checked, {}** %12, align 8 ; └└└ ; @ REPL[1]:3 within `f` %13 = call nonnull {}* @ijl_box_int64(i64 signext %11) %14 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 2 store {}* %13, {}** %14, align 16 store {}* %13, {}** %.sub, align 8 %15 = call nonnull {}* @ijl_apply_generic({}* nonnull %does_not_exist.checked, {}** nonnull %.sub, i32 1) store {}* %15, {}** %12, align 8 %16 = call nonnull {}* @ijl_box_int64(i64 signext %11) store {}* %16, {}** %14, align 16 store {}* %15, {}** %.sub, align 8 %17 = getelementptr inbounds [2 x {}*], [2 x {}*]* %0, i64 0, i64 1 store {}* %16, {}** %17, align 8 %18 = call nonnull {}* @ijl_apply_generic({}* inttoptr (i64 139831370516384 to {}*), {}** nonnull %.sub, i32 2) %19 = load {}*, {}** %4, align 8 %20 = bitcast {}*** %tls_pgcstack to {}** store {}* %19, {}** %20, align 8 ; @ REPL[1]:4 within `f` ret {}* %18 } ``` </details> <details> <summary>Example 2</summary> ```julia julia> function g(a, b, c, d; kwarg=0) a + b + c + d + kwarg end g (generic function with 1 method) julia> @code_llvm g(0,0,0,0,kwarg=0) ``` ```llvm ; @ REPL[3]:1 within `g` define i64 @julia_g_160([1 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(8) %"#1::NamedTuple", i64 signext %"a::Int64", i64 signext %"b::Int64", i64 signext %"c::Int64", i64 signext %"d::Int64") #0 { top: %0 = getelementptr inbounds [1 x i64], [1 x i64]* %"#1::NamedTuple", i64 0, i64 0 ; ┌ @ REPL[3]:2 within `#g#1` ; │┌ @ operators.jl:587 within `+` @ int.jl:87 %1 = add i64 %"b::Int64", %"a::Int64" %2 = add i64 %1, %"c::Int64" ; ││ @ operators.jl:587 within `+` ; ││┌ @ operators.jl:544 within `afoldl` ; │││┌ @ int.jl:87 within `+` %3 = add i64 %2, %"d::Int64" ; │││└ ; │││ @ operators.jl:545 within `afoldl` ; │││┌ @ int.jl:87 within `+` %unbox = load i64, i64* %0, align 8 %4 = add i64 %3, %unbox ; └└└└ ret i64 %4 } ``` </details>

Pass the types to the allocator functions. ------- Before this PR, we were missing the types for allocations in two cases: 1. allocations from codegen 2. allocations in `gc_managed_realloc_` The second one is easy: those are always used for buffers, right? For the first one: we extend the allocation functions called from codegen, to take the type as a parameter, and set the tag there. I kept the old interfaces around, since I think that they cannot be removed due to supporting legacy code? ------ An example of the generated code: ```julia %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2 %13 = bitcast {}*** %ptls_field6 to i8** %ptls_load78 = load i8*, i8** %13, align 8 %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7 ``` Fixes #43688. Fixes #45268. Co-authored-by: Valentin Churavy <[email protected]>

Backports PR #50337 for RAI julia v1.9.2 Original description: =================== Pass the types to the allocator functions. ------- Before this PR, we were missing the types for allocations in two cases: 1. allocations from codegen 2. allocations in `gc_managed_realloc_` The second one is easy: those are always used for buffers, right? For the first one: we extend the allocation functions called from codegen, to take the type as a parameter, and set the tag there. I kept the old interfaces around, since I think that they cannot be removed due to supporting legacy code? ------ An example of the generated code: ```julia %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2 %13 = bitcast {}*** %ptls_field6 to i8** %ptls_load78 = load i8*, i8** %13, align 8 %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7 ``` Fixes #43688. Fixes #45268. Co-authored-by: Valentin Churavy <[email protected]>

Fixes #7

Fix `getproperty` calls and handle Core.Typeof(Vararg)

@benchmark

This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. ## Current Status This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. ## Obligatory benchmark Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```

@benchmark

This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```

@benchmark

This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```

@benchmark

This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```

@benchmark

This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. ## Current Status This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. ## Obligatory benchmark Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto #5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto #4 if not %13 3 ─ goto #5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto #6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto #7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto #8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```

@benchmark

This is part of the work to address JuliaLang#51352 by attempting to allow the compiler to perform SRAO on persistent data structures like `PersistentDict` as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation. We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode `PersistentDict` into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged). This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure. The basic ideas is that `Core` provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations. However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the `KeyValue` interface which consists of two generic functions, `get` and `set`. The core optimization is that the compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)` into `v` without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that. This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well. Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable. ## Current Status This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup. ## Obligatory benchmark Before: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 32.940 ns … 28.754 μs ┊ GC (min … max): 0.00% … 99.76% Time (median): 49.647 ns ┊ GC (median): 0.00% Time (mean ± σ): 57.519 ns ± 333.275 ns ┊ GC (mean ± σ): 10.81% ± 2.22% ▃█▅ ▁▃▅▅▃▁ ▁▃▂ ▂ ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃ 32.9 ns Histogram: frequency by time 68.6 ns < Memory estimate: 128 bytes, allocs estimate: 4. julia> @code_typed foo() CodeInfo( 1 ─ %1 = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %2 = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64} │ %3 = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64} │ %4 = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %5 = $(Expr(:boundscheck, true))::Bool └── goto JuliaLang#5 if not %5 2 ─ %7 = Base.sub_int(1, 1)::Int64 │ %8 = Base.bitcast(UInt64, %7)::UInt64 │ %9 = Base.getfield(%4, :size)::Tuple{Int64} │ %10 = $(Expr(:boundscheck, true))::Bool │ %11 = Base.getfield(%9, 1, %10)::Int64 │ %12 = Base.bitcast(UInt64, %11)::UInt64 │ %13 = Base.ult_int(%8, %12)::Bool └── goto JuliaLang#4 if not %13 3 ─ goto JuliaLang#5 4 ─ %16 = Core.tuple(1)::Tuple{Int64} │ invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{} └── unreachable 5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} │ Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}} └── goto JuliaLang#6 6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32 │ %24 = Base.or_int(%23, 0x00010000)::UInt32 │ Base.setfield!(%2, :bitmap, %24)::UInt32 └── goto JuliaLang#7 7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64} └── goto JuliaLang#8 8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64 └── return %29 ``` After: ``` julia> using BenchmarkTools julia> function foo() a = Base.PersistentDict(:a => 1) return a[:a] end foo (generic function with 1 method) julia> @benchmark foo() BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 2.459 ns … 11.320 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.460 ns ┊ GC (median): 0.00% Time (mean ± σ): 2.469 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂ █ ▁ █ ▂ █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █ 2.46 ns Histogram: log(frequency) by time 2.47 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @code_typed foo() CodeInfo( 1 ─ return 1 ```

Pass the types to the allocator functions. ------- Before this PR, we were missing the types for allocations in two cases: 1. allocations from codegen 2. allocations in `gc_managed_realloc_` The second one is easy: those are always used for buffers, right? For the first one: we extend the allocation functions called from codegen, to take the type as a parameter, and set the tag there. I kept the old interfaces around, since I think that they cannot be removed due to supporting legacy code? ------ An example of the generated code: ```julia %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2 %13 = bitcast {}*** %ptls_field6 to i8** %ptls_load78 = load i8*, i8** %13, align 8 %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7 ``` Fixes #43688. Fixes #45268. Co-authored-by: Valentin Churavy <[email protected]>

ghost assigned StefanKarpinski Apr 27, 2011

JeffBezanson added a commit that referenced this issue Feb 6, 2012

implementing remote parallel load, closes #81

d42e36a

old load() is now called include() this also makes it easy to work on #7 array constructor tweaks

JeffBezanson closed this as completed in 5587896 Feb 10, 2012

ViralBShah mentioned this issue Mar 16, 2012

WIP: libuv rewrite and Windows port #521

Merged

vtjnash mentioned this issue Mar 17, 2012

build fails at sys.ji #600

Closed

ViralBShah mentioned this issue Mar 18, 2012

Building julia with clang: tests fail #602

Closed

kk49 mentioned this issue May 9, 2012

finfer causes segmentation fault #816

Closed

forkandwait mentioned this issue May 12, 2012

Failed build on FreeBSD #829

Closed

6e441f9c mentioned this issue May 30, 2012

Julia segfaults while loading stage0.jl #901

Closed

vtjnash pushed a commit to vtjnash/julia that referenced this issue Jun 1, 2012

Close most issues for the windows port

23cb6ad

Closes JuliaLang#7 Closes JuliaLang#10 Closes JuliaLang#13

dioptre mentioned this issue Jul 30, 2012

error during bootstrap on OS X #1089

Closed

nolta mentioned this issue Oct 4, 2012

quantile chokes on integral vectors if it has to interpolate #1333

Closed

fundamental mentioned this issue Apr 11, 2013

Regression in reload() Results in REPL Hanging #2829

Closed

railskipl mentioned this issue May 3, 2013

Tab-Expansion of Base in REPL causes SIGABRT #2064

Closed

staticfloat mentioned this issue Jan 18, 2014

Build failure on Fedora #5422

Closed

cmundi mentioned this issue Mar 18, 2014

RFC: Windows CI #6028

Merged

svaksha mentioned this issue May 3, 2014

[openblas] make testall - error in spawn.jl #6728

Closed

jameskyle mentioned this issue Nov 26, 2014

Illegal hardware instruction when performing inner products #9154

Closed

staticfloat mentioned this issue Dec 18, 2014

Current master does not start on Mac OS 10.10 #9380

Closed

yuyichao mentioned this issue May 26, 2015

RFC: Try to generate reusable jlcall wrapper #11439

Open

MaheshWaidande mentioned this issue Jul 27, 2015

Julia binary goes in infinite wait on Ubuntu 14.10 (ppc64le) #12326

Closed

chipkent mentioned this issue Jul 19, 2016

Possible LLVM segfault #17495

Closed

vchuravy mentioned this issue Jan 1, 2017

Basic support for builtins from compiler-rt #18734

Closed

5 tasks

Keno mentioned this issue Sep 5, 2023

Standardize the entry-point for Julia execution #50974

Merged

Keno added a commit that referenced this issue Oct 9, 2023

Automatically step through kw wrappers

5ebe035

Fixes #7

Keno pushed a commit that referenced this issue Oct 9, 2023

Merge pull request #7 from JuliaDebug/teh/varargs

4d11b77

Fix `getproperty` calls and handle Core.Typeof(Vararg)

jcphill mentioned this issue Apr 18, 2024

broadcast of short-circuiting boolean operator || generates runtime dispatch and per-element allocations #54141

Closed

Andrew-Milestone mentioned this issue Apr 28, 2024

Error adding NeuralOperators in Julia v1.10.2 #54292

Closed

Keno pushed a commit that referenced this issue Jun 5, 2024

docs: Add make script (#7)

8799660

gbaraldi mentioned this issue Jul 2, 2024

Bump LLVM to v18 #54848

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load path #7

load path #7

StefanKarpinski commented Apr 27, 2011

JeffBezanson commented Jun 28, 2011

StefanKarpinski commented Jun 28, 2011

JeffBezanson commented Feb 10, 2012

StefanKarpinski commented Feb 10, 2012

load path #7

load path #7

Comments

StefanKarpinski commented Apr 27, 2011

JeffBezanson commented Jun 28, 2011

StefanKarpinski commented Jun 28, 2011

JeffBezanson commented Feb 10, 2012

StefanKarpinski commented Feb 10, 2012