drti

The Dynamic Runtime Inlining (DRTI) project

With this project it is possible to take the output from an LLVM compiler such as clang and allow selected parts of the code to recompile themselves at runtime. This can inline function calls made via function pointers including virtual function dispatch and across shared-object boundaries. In these cases the targets of the calls are generally known only at runtime, hence the name "Runtime Inlining".

This is different to LLVM's Link Time Optimizations and can inline cases that LTO doesn't handle. It also doesn't involve an LLVM bitcode interpreter; it runs native machine code generated by Ahead of Time (AOT) compilation that can make calls to a Just in Time (JIT) runtime compiler when needed.

The basic concept is to get the normal compiler front end like clang to emit its LLVM Intermediate Representation (IR) as a bitcode file, then run a custom LLVM pass over this to inject DRTI code at the appropriate places. The custom pass also embeds the original bitcode as a binary string in the output so that the bitcode is available at runtime for recompilation. The command-line tool llc compiles the adapted bitcode into a native object file.

Although the concept is fairly simple there were (and still are) several challenges in its implementation. Currently the project is a proof of concept, demonstrating that the idea can work, but not likely to improve the speed of any real-world applications.

Supported Platforms

For now the code only works on Linux on the x86-64 architecture. It requires a small patch to the LLVM sources which I've tested with LLVM version 12 and 13.

Building the code

You'll need to download and unpack the LLVM sources (e.g. llvm-project-12.0.1.src.tar.xz)
Clone the drti git repository, e.g. git clone https://github.com/drti/drti.git
Apply the small LLVM source patch llvm-drti.patch from the drti root directory into the llvm subdirectory, e.g. cd llvm-project-12.0.1.src/llvm && patch -p1 <$HOME/drti/llvm-drti.patch
Configure, build and install LLVM and clang. Some suggested cmake options are -DLLVM_TARGETS_TO_BUILD="X86" -DLLVM_ENABLE_PROJECTS="clang;compiler-rt" -DCMAKE_INSTALL_PREFIX=$HOME/install/llvm-12

You can then cd to the drti root directory to build drti and run its tests using make, providing the path to the root of your fresh LLVM installation, e.g.

make LLVM_EXE_ROOT_DIR=$HOME/install/llvm-12

Just to be clear, LLVM_EXE_ROOT_DIR should be set so that the Makefile can run $(LLVM_EXE_ROOT_DIR)/bin/llvm-config.

Please send the author feedback if you do try this out; I would be interested to hear if anyone else gets it working.

Examples and tests

There are some tests in tests subdirectory which demonstrate function pointer and virtual function inlining. The Makefiles contain the logic for building the DRTI decoration pass, the runtime support functions and for invoking the decoration pass on the test code. As noted in the "Implementation" section, currently DRTI requires an explicit list of candidate functions to decorate which is contained in tests/drti_test_targets.txt

Implementation

This section details some of the complexities of making DRTI work, including some remaining areas for further development.

Call-target identification

As mentioned in the introduction, DRTI has an ahead-of-time pass that "decorates" functions in preparation for runtime inlining. However it does not require that an entire program and all its libraries be decorated. Decorated functions may call in to undecorated ones and vice-versa.

The problem, then, is for one decorated function to discover when the target of a function call is another decorated function. Or to put it another way, two decorated functions have to be able to recognise an edge between them in the call graph, but only when it is a direct edge with no intervening calls. The second condition is very important, since a call chain from A* -> B -> C* (where only A* and C* are decorated functions) must not result in mistakenly recompiling A with a direct call to C.

Unfortunately that means that a simple solution such as using thread-specific storage (TSS) to retain a per-thread DRTI call stack won't simply work, because only decorated functions would update the state and intermediate non-decorated functions would be difficult to detect.

I thought about this problem for a while and have come up with what I believe is a robust mechanism for communication between decorated functions. This is only partly implemented at the moment, for reasons explained in the following description.

The first step is to find a register in which a decorated function can pass a pointer to the latest node in the DRTI call tree. Within the Linux x86-64 Application Binary Interface (ABI) r12 to r15 are suitable. The other registers are not suitable since they might be used already in the function call or in the case of r11 can be overwritten before arriving at the called function. I've chosen r14 to avoid unnecessary incompatibility with the swiftcall convention after coming across D18092 and D18108. For this part using TSS would also work, but the register mechanism is a bit simpler.

The second and trickier part is to get a decorated function to recognise when the r14 register is valid on entry, i.e. when the call came directly from another decorated function. Perhaps surprisingly, there is a way to do this by passing information via the return address which is pushed onto the stack by the caller. The return address normally points to the next instruction after a call, but if we organise for the bytes preceding the return address to contain a "magic" value instead of a call instruction, the callee can get the return address from the stack and dereference it (minus an offset) to check for the magic value. This assumes that the magic value could not occur in a real call instruction, and also that the return address is not too close to the start of a page boundary, since the preceding page isn't guaranteed to be readable and we don't want to SEGV in any circumstances. If there are any other runtime instrumentation systems using the same trick it would also require that they choose a distinct magic value, of course.

This actually leaves one gap in the detection mechanism since a "tail" call to a decorated function via a jump does not push a return address; it leaves its own caller's return address on top of the stack. So in the call chain A* -> B -> C* where B -> C* is a jump rather than a call, C* would mistakenly conclude that it had been called directly by A*. There is a way out of this problem though, since B is required to preserve the callee-saved r14 register. This means that C* can safely read the call tree information provided by A* and examine the final call's target address to determine whether the call would have gone directly to C* or not. This is made more complicated by the presence of the procedure linkage table (PLT) and virtual function call "thunks" but I believe it is doable.

In the existing proof-of-concept the tail-call detection is not yet implemented but there is a test case in the code that demonstrates the problem. The magic value is also stored slightly further away from the return address than described above. This last issue can be fixed by making each decorated call go via a small trampoline function that overwrites the return address and then jumps to the original call target (of course, this will upset the CPU's return address stack a little, see Microbenchmarking Return Address Branch Prediction for a nice discussion of this). The manipulated return address can then point to an instruction that jumps back to the code after the original call site can can be immediately preceded by the magic value. This avoids any risk of mistaking a normal call for a decorated one.

Protection against unreadable preceding pages is already implemented, by forcing the return addresses to be highly aligned. This means the return is either to the zeroth byte of a page or sufficiently far into the page that the magic value is guaranteed to be readable.

Deciding what calls to inline

Currently the developer must provide an explicit list of functions to be decorated by the DRTI pass. This is done via an environment variable and/or a text file. All chains of three decorated functions will be recompiled (once only) the first time the chain is discovered at runtime.

In theory the explicit list is not necessary but doing without would require significantly more work to reduce the overheads of the injected code and to implement runtime heuristics to decide when to recompile. There is some background on efficient implementation techniques in the paper Deriving Code Coverage Information from Profiling Data Recorded for a Trace-based Just-in-time Compiler by Christian Häubl, Christian Wimmer and Hanspeter Mössenböck.

Call re-targeting

DRTI does not have an implementation of "on stack replacement". This means that the only way that inlining can actually happen is where there is a chain of at least three decorated calls. The first call has to be decorated so that it can be retargeted to a recompiled version of its target function, and that target function must make a call to another decorated function that can be inlined during recompilation.

For example in a chain of decorated calls A* -> B* -> C*, the intermediate function B could be recompiled with C inlined, resulting in B+. Then, assuming the original B* ever returns, A* could call B+ instead of B* the next time it needs. This works because the decorated code in A* includes support for this "retargeting".

This can probably be improved using something like the stack maps that were developed for the WebKit JavaScript runtime compiler. As I understand it WebKit has since moved away from using LLVM but the stack map code could still be useful for DRTI.

De-optimization

Currently the recompiled code does not perform any profiling and so can't detect the need for "de-optimization". This means that the first call target encountered will be the one hard-coded into the recompiled version, and any other call targets will go via a slower path without any inlining.

Symbol resolution

Ahead of time compilation depends on a linker to handle external symbols. At runtime it's a bit trickier and in some cases the symbolic information simply won't be available, for example a stripped program and/or symbols with "hidden" visibility. For this reason the DRTI decoration pass not only saves the original bitcode in the output module but also an array with the address of every global that the original bitcode requires. During runtime recompilation it can use this array to resolve symbols as needed.

Static data

When recompiling a module at runtime, DRTI takes care to ensure that it does not end up with an extra copy of any static data defined in a module. For example a C++ function with a static variable such as the test function below results in two global variables as shown in the following LLVM assembly code.

const void* test_target2()
{
    static unsigned& counter = drti_test::new_counter("test_target2");
    ++counter;
    return drti_test::instruction_pointer();
}

@_ZZ12test_target2vE7counter = internal unnamed_addr global i32* null, align 8, !dbg !0
@_ZGVZ12test_target2vE7counter = internal global i64 0, align 8

These names demangle to the following, by the way:

@test_target2()::counter = internal unnamed_addr global i32* null, align 8, !dbg !0
@guard variable for test_target2()::counter = internal global i64 0, align 8

When recompiling this module DRTI converts these variables from "internal" linkage to "external" and resolves them against the global variables array, as mentioned above in the section on "Symbol resolution". This ensures that there is still only one copy of the static data in the program, and therefore prevents the initialisation function being called again the first time the recompiled version of the function runs. One further optimization that is not implemented, but could be in theory, is to elide the initalisation code entirely from the recompiled version of the function.

Virtual function calls

Virtual functions calls are more difficult to inline than normal function pointer calls, since overridden functions require an implicit type conversion and, in some cases, a pointer adjustment on the implicit "this" argument. To see the issue, consider the following classes:

struct Base { virtual void foo() = 0; };
struct Derived : Base { void foo() override; };
void bar(Base* b)
{
    b->foo();
}

The call to b->foo() looks like this in LLVM bitcode, where it's important to note that it is calling a function of type void(%struct.Base*):

  %vtable = load void (%struct.Base*)**, void (%struct.Base*)*** %0, align 8, !tbaa !1600
  %1 = load void (%struct.Base*)*, void (%struct.Base*)** %vtable, align 8
  tail call void %1(%struct.Base* %b)
                    ^^^^^^^^^^^^^

However the implementation function expects a Derived* as shown here:

define void @_ZN7Derived3fooEv(%struct.Derived* nocapture %this)
                               ^^^^^^^^^^^^^^^^

This is inevitable since a member function within the class Derived must have a "this" pointer of the correct type. So when at runtime DRTI attempts to recompile the function bar() with a direct call to Derived::foo there is an argument type mismatch. There isn't enough information in the LLVM bitcode to deduce that a type conversion is valid, and neither is there information about whether a "this" pointer fixup is required, which can arise due to multiple inheritance.

In principle the necessary information could either be provided automatically by the C++ front-end, or derived from inspection of the instructions within any virtual call thunk. As an aside, this kind of instruction inspection is probably necessary to detect tail-calls and procedure-linkage table calls as well. In any case, neither of these automatic mechanisms is implemented in the proof of concept and the developer has to provide pointer conversion function(s) explicitly using a DRTI macro called DRTI_CONVERTIBLE, for example:

DRTI_CONVERTIBLE(Base*, Derived*);

This results in a conversion function that DRTI will attempt to find if needed during runtime compilation:

__attribute__((used, always_inline))
static inline Derived* __drti_converter(Base* value, Derived*)
{
    return static_cast<Derived*>(value);
};

As mentioned this is just a workaround; in a more fully developed solution there would be automatic support for virtual function call type conversion and this pointer fixups.

The LLVM source code patch

DRTI needs to manipulate LLVM machine instructions to enable the passing of call tree information between functions. Although LLVM provides various mechanisms for external code to add custom passes over the machine-independent Intermediate Representation (IR) I couldn't find anything equivalent for the target-specific "machine" passes. In general for code to have access to the MachineFunction and MachineInstr objects it has to be built in to the LLVM target from within the LLVM source tree. There is some support for replacing the register allocator via the RegisterRegAlloc class as described in WritingAnLLVMPass but DRTI isn't a register allocator.

For this reason, the llvm-drti.patch file in the drti root directory contains minor changes to TargetPassConfig.h and .cpp to support custom machine passes in third-party code such as DRTI.

Conclusions

DRTI demonstrates runtime re-compilation of the output from LLVM-based ahead-of-time compilers. By doing this it can inline function calls whose targets are not known until runtime. This is something that JIT compilers do as a matter of course but which, in general, ahead-of-time compilers cannot. It is currently a proof of concept.

Please contact the author via [email protected] with any comments or questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly