Microkernel Construction Lecture Notes

This is work in progress. When finished, this document will contain most of the material presented in the lecture on Microkernel Construction at the University of Karlsruhe in summer 2008. It is intended as a learning aid to be used in conjunction with the lecture slides (available from https://i30www.ira.uka.de/teaching/courses/lecture.php?courseid=167).

The material on which this document is based has been developed by Prof. Jochen Liedtke in 1999, with contributions and additions by Uwe Dannowski, Kevin Elphinstone, Joshua LeVasseur, Espen Skoglund, Jan Stoess, and Volkmar Uhlig.

May this document help in understanding the design of microkernels in general and the L4 microkernel in particular.

... Raphael Neider (September 2008)

Chapter 1. Overview

Table of Contents

Monolithic Kernels
Microkernel-Based Systems
Historic Attempts
Basic Abstractions and Mechanisms

Monolithic Kernels

Monolithic operating system kernels abstract from the hardware and offer more convenient services such as threads, address spaces, files, synchronization primitives, ... to applications. New features are added by adding code to the kernel, which leads to code size explosion, e.g. in the Linux kernel. In order to fulfill physical memory management tasks, fair scheduling, and hardware management tasks, monolithic kernels have all their code execute in the privileged processor mode. Hence each subsystem (file system, networking stack, ...) and each driver must be fully trusted, as malicious or faulty code can affect the whole system.

Another problem with monolithic kernels is their poor maintainability: As all subsystems share the same binary, it is easy to take shortcuts and directly access data structures of other subsystems from anywhere. As a consequence, changing the structures can require changes throughout the kernel and thus hampers development.

If the subsystems had clean interfaces, whose use would be enforced by the compiler (layered, modular, or object-oriented kernels), at least the second problem (maintenance) could be solved, but the first one (robustness) still remains as long as all submodules share the same address space.

Microkernel-Based Systems

The goal of microkernel-based operating systems is to minimize the amount of code that is executed in the privileged processor mode. For this purpose, each driver and each convenience OS subsystem is moved into an unprivileged user-level application, a so called server. Only a slim layer beneath the servers executes in privileged mode and only deals with safety critical tasks such as granting access to physical memory/devices, routing interrupts to the appropriate driver, and communication between the threads/address spaces.

Such a scheme would allow to execute multiple operating systems on top of the microkernel at the same time (virtualization), run special services directly on top of the microkernel (no high-level OS involved), or provide real-time services next to general purpose services without having to port the latter to a real time-capable OS.

Due to the separation of the subsystems and drivers into different address spaces, clean interfaces are required and enforced by offering only a limited communication API. As a consequence, both the robustness of the resulting system as well as its maintainability should improve compared with the monolithic approach.

Historic Attempts

However, early microkernel-based OSes such as IBM's Workplace OS and early microkernels such as Mach uncovered a significant performance penalty due to the dramatically increased number of context switches: Instead of referencing data structures in some submodule, microkernel-based systems need to issue a request-IPC to the server, switch to its address space, probably copy the requested data into kernel buffers and/or the client's address space, switch back to the client's address space, and finally access the data.

Two lessons are to be learned from these projects: Communication across address space boundaries is the most frequently used service in microkernel-based systems and, consequently, this operation must be as fast as possible. The whole microkernel must be designed to allow for really fast IPC. To achieve this, the target architecture must be carefully analyzed for possible performance features (e.g., sysenter/sysexit on x86) and bottlenecks (avoid slow DRAM accesses, obey cache structure during layout of the central data structures, avoid mandatory TLB flush on context switch if possible, minimize kernel impact on branch prediction logic, ...).

Basic Abstractions and Mechanisms

The L4 microkernel, which is the basis for this lecture, was designed from scratch (rather than stripping existing systems down) and implemented to provide fast IPC. It offers two abstractions: Threads to abstract from the CPU, and address spaces to protect threads against each other. In order to be able to manipulate these abstractions, the L4 kernel also provides two mechanisms: inter-process communication (IPC) allows threads to communicate and to synchronize, whereas mapping allows to recursively populate address spaces.

The kernel maps all hardware properties to these abstractions: All activities (including I/O devices) are represented in threads, interrupts are transformed into an IPC from the device thread to the driver thread. Pagefaults are wrapped into a pagefault IPC from the faulting thread to another thread that is responsible for providing memory (a so-called pager), and (physical) memory is given to an address space by mapping memory from the pager (the pager itself requests the memory from its respective pager, the root pager idempotently maps physical memory to its clients on demand).

These two abstractions (threads and address spaces) are just one of potentially many valid choices. However, the provided abstractions should be chosen so that everything that needs to be managed in privileged mode for reasons of

hardware requirements (loading %cr3 on x86 is a privileged instruction) or
safety/security considerations (a user-level solution could compromise the functionality of the microkernel and thus the whole system, e.g. physical memory management and access to hardware-read page tables)

is managed by the microkernel (in privileged mode), and everything else is delegated to unprivileged user-mode tasks.

An additional criterion is to avoid duplicated functionality in the kernel: There should always be exactly one way to do a job, no more and no less. As a consequence, all abstractions and mechanisms must be general enough to allow all desired uses. On the other hand, all abstractions and mechanisms must be orthogonal, so that none can be emulated using the others (no superfluous abstractions or mechanisms are allowed inside the microkernel ^[1]).

^[1]This is not true for large kernels such as Linux: Files can be manipulated using either the open(), read(), write(), and close() syscalls or via mmap()/munmap().

Chapter 2. Threads and Thread Switching

Table of Contents

Threads

Switching Threads

Kernel Stack Models

Per-Processor Kernel Stack
Per-Thread Kernel Stack

Sample Code

Threads are one of the abstractions offered by the L4 microkernel. They abstract from the CPU and thus allow to share a single CPU among multiple activities (tasks, programs, jobs, ...). The thread abstraction enables multi-programming systems by switching between threads whenever another activity is to be executed on the CPU.

Threads

To this end, threads must be represented by a data structure that can hold all the CPU state, including the current instruction pointer, the stack pointer, the contents of general purpose registers and special purpose registers (e.g., floating point or status registers) contents. Additionally, to enable controlled switching between threads (scheduling), scheduling parameters such as thread priorities, time slice lengths, and thread status (e.g., running, ready, blocked, invalid) are also required. All this information is stored in so-called TCBs, which will be discussed in more detail in Chapter 4, TCBs.

Switching Threads

Switching threads is conceptually easy:

Store the CPU state of the currently running thread in its TCB
Load the CPU state of the thread to resume from its TCB into the CPU

However, two problems arise: First, TCBs must be protected kernel objects, so that no thread on user-level can manipulate other threads' states. Furthermore, threads must not be allowed to modify even their own TCB, if/as the TCBs contain scheduling information (e.g., the remaining timeslice length).

As a consequence, thread switches must (most of the time, see Chapter 10, Local IPC) be performed by the kernel, which means that the kernel must run in order to do so. If the kernel is running, the user-level state of the thread must be preserved somewhere in order to be able to correctly resume it. For this second problem, some hardware support on entering the kernel is required: On kernel entry, the user-level CPU state could be saved on a stack (x86), or the kernel can use a different set of CPU state (register banks, found in MIPS or Itanium machines).

On stack-based machines (x86), a thread switch thus requires the following steps:

A thread executes in user mode
Some event triggers a thread switch (interrupt, exception, syscall, ...)
The system enters kernel mode
- The hardware automatically stores the user-level stack pointer ...
- ... loads a kernel stack pointer
- and saves the user-level instruction pointer and stack pointer on the kernel stack
- The hardware loads a kernel instruction pointer based on the cause of the kernel entry (interrupt, exception, syscall, ...)
The kernel code saves the remaining CPU state into the current thread's TCB before overwriting it
The kernel loads the CPU state of the next thread from its TCB ...
... and returns to user-mode, causing the hardware to
- load the user-level instruction pointer and
- the user-level stack pointer from the current stack

Kernel Stack Models

Concerning the stack being used in kernel mode, different models are possible. The kernel stack should always be different from the user-level stack for security reasons: First, the user can set up an invalid stack pointer before entering kernel mode, loading a well-known, valid stack pointer on kernel entry is likely to be easier than validating the current value. Secondly, the kernel entry might be caused by a page fault on the user-level stack. Storing user-level state on this stack in order to handle the page fault will thus raise further page faults (accesses the same non-mapped memory) which can never be handled.

The presented models differ in the number of kernel-level stacks.

Per-Processor Kernel Stack

All threads executing on one CPU can use the same kernel stack. ^[2] As a consequence, either only one thread can execute in kernel mode at any time (i.e., threads executing in kernel mode cannot be preempted), or unusual approaches such as continuations must be used.

This model is efficient with respect to cache lines used during a context switch (only one stack), but uncomfortable from the kernel developer's point of view.

Per-Thread Kernel Stack

Alternatively, each thread can have its own kernel-level stack. This model resembles user-level code and makes kernel mode nothing special from the kernel developer's point of view. However, this model is more costly both with respect to virtual memory use (only one stack must be addressable in the previous model, whereas this model requires (number of threads) stacks to be present in virtual memory) and with respect to cache line use on thread switch: Both the stacks of the current thread an the stack of the next thread are accessed, effectively doubling the number of cache lines used for them.

Due to its simplicity, L4 employs the per-thread stack model and integrates the kernel stacks into the TCBs. (TCB plus stack fit into a single page to avoid additional TLB misses.)

Sample Code

Code to cooperatively switch threads in kernel mode on x86:

/* save all registers on the stack */
pusha

/* compute TCB address of current thread
 * (esp already points to the end of the TCB,
 * i.e., the top of the kernel stack)
 */
mov     esp, ebp
and     -sizeof_tcb, ebp

/* assume: edi = address of TCB of B */
mov     esp, TCB_Off_esp(ebp)
mov     TCB_Off_esp(edi), esp

/* setup kernel stack for next kernel entry */
add     sizeof_tcb, edi
mov     edi, 0(esp0_ptr)

/* restore all registers from the stack */
popa

Note:

TCBs comprise 2^k bytes (k in [9..12]) and are naturally aligned
-2^k can be used to mask out the offset inside a TCB and compute its start address: (any address inside a TCB) AND -2^k results in the 2^k-aligned base address of the TCB
On kernel entry, the hardware loads the stack pointer from esp0
Stack on x86 grow downwards, hence on leaving the kernel, esp0 is set to the end of the TCB

^[2]Threads on different CPUs should always have separate kernel stacks in order to exploit possible parallelism and to avoid synchronization issues among the CPUs on kernel entries.

Chapter 3. System Calls

Table of Contents

Traditional System Calls on x86
Fast System Calls on x86
Reconciling Stack Layouts
Returning From System Calls

Manipulation of the microkernel abstractions thread and address space is exposed to the user-level applications via system calls. System calls are provided to create threads (potentially inside a new address space), manipulate their scheduling parameters (in L4: priority, processor affinity) in a controlled way, and for IPC (to bridge address space boundaries and to allow for synchronization among threads).

In order to use one of these services, the user-level application must be able to

enter the kernel,
pass parameters and the ID of the desired service to kernel-level,
and receive results (success/failure, data received via IPC, ...).

Due to the separate kernel stacks, passing parameters on the stack is no viable option. For this reason (and for performance reasons, i.e., to avoid memory accesses), user- and kernel level exchange data solely in registers. ^[3]

For kernel entry/exit, hardware support is required to atomically exchange the user-level and the kernel-level state (at least stack pointer and instruction pointer). Most architectures offer a special instruction for the single purpose of entering/leaving the kernel.

Traditional System Calls on x86

Intel traditionally offered the int n instruction to allow software interrupts, which are handled exactly like hardware interrupts: When the user executes the int instruction, the hardware

loads the kernel stack pointer %esp0 (%esp for privilege ring 0, i.e., kernel mode) for the current thread from a known memory location into %esp and its segment into %ss,
stores the user-level stack pointer (both segment %ss and offset %esp), flags, instruction pointer (%eip and its code segment %cs), and (potentially) an error code (used in case of exceptions rather than interrupts, e.g., to indicate page faults, division by zero, ...) on the kernel stack,
turns off interrupts to guarantee atomicity of these operations,
enters kernel mode (privilege ring 0),
consults a table (in kernel space) in order to find the address of the handler for the given interrupt number (n),
and jumps to the handler (usually a hand-coded assembly function in the microkernel), updating both %eip and %cs.

After this, the handler function can store the remaining registers on the kernel-level stack and call a C-function to implement the requested service. Having completed the request, the kernel restores the user-level registers (or stores results in there, depending on the API/ABI), and returns to user-mode using the iret (return from interrupt) instruction, which atomically pops the user-level %eip/%cs and the %esp/%ss pairs as well as the flags from the kernel stack and enables interrupts again.

The int instruction is easy to use, as the kernel needs to handle interrupts anyways (at least timer interrupts for time-sliced, round robin scheduling and inter-processor interrupts for inter-processor IPC), so the infrastructure is there already.

Fast System Calls on x86

On the other hand, using int is quite slow, especially due to the memory lookup of both the kernel stack pointer (%esp0) and the entry point of the handler routine in the interrupt descriptor table (IDT). As kernel entries/syscalls are frequent (especially in microkernel-based systems), Intel introduced a less costly alternative kernel entry facility via the sysenter/sysexit instructions. ^[4]

These new instructions avoid memory lookups by providing CPU-internal special purpose registers (model-specific registers, MSRs) to store a single entry point into the kernel as well as a single kernel stack pointer to load on kernel entry via sysenter. Furthermore these instructions effectively disable segmentation by setting up flat segments (full 4 GB segments starting at linear/virtual address 0), which allows to skip additional costly segmentation-related checks on mode changes between kernel-mode and user-mode. As most of today's operating systems (including L4) ignore segmentation and use flat segments already (using only paging for access control and physical memory management), this is usually not a problem (but see Chapter 9, Small Spaces).

As sysenter always jumps to the same handler routine in the kernel, an argument (a register) would be required to select one of the syscalls offered by the kernel. Instead of sacrificing a register for this purpose, L4 implements only the most frequently used system call, namely IPC, via sysenter, all other system calls use the traditional kernel entry method via int and either have an explicit argument denoting the system call number or use different interrupt vectors (int 0x80, int 0x81, ...) for the different system calls.

As L4 aims not only at maximum speed but also at portability, systems without the (relatively) new sysenter instruction shall also be supported and the optimal method of entering the kernel shall be selected at runtime. A problem here is that the stack layout after kernel entry via sysenter differs from the layout after using int: Using sysenter, the hardware only

loads %eip from a MSR and forces %cs to the flat segment (0..4 GB),
loads %esp from a MSR and forces %ss to the flat segment (0..4 GB),
turns off interrupts,
and enters kernel mode.

Nothing is stored on the stack, the user-level instruction pointer and stack pointer to use when returning to user-level must be passed in registers (by convention and because sysexit expects them there, %eip is passed in %edx and %esp in %ecx).

Reconciling Stack Layouts

In order to obtain a stack layout that matches the int entry path, the sysenter handler (in the kernel) first stores the user %eip (passed from user-level in the register %edx) and the user %esp (from %ecx) as well as the error code ("IPC syscall") on the stack at the "correct" locations. The segment registers (%cs, %ss) are always flat and thus need not be stored (they are still correct and in place since thread creation), and the flags are undefined after the IPC syscall as per the microkernel API and thus need not be stored as well.

Example 3.1. Emulate effects of int after sysenter

mov     (esp), esp              /* load esp from TSS.esp0 */
sub     $20, esp                /* create interrupt frame */
mov     ecx, 16(esp)      /* store %esp and %eip where... */
mov     edx, 4(esp)         /* ... the HW would have done */
mov     $5, (esp)                 /* indicate IPC syscall */

Having set up this stack frame, the sysenter handler can jump right to the int entry code to handle the rest, thus enabling faster kernel entry and compatibility without any code duplication.

One problem remains: sysenter always loads the kernel stack pointer from the MSR, whereas int loads the kernel stack pointer from the memory location %esp0. For consistency, both locations would have to be kept in sync; at each thread switch, both the MSR and %esp0 would have to be modified. As writing the MSR is quite costly, we make it point to %esp0 and load the kernel stack pointer from the memory location. This re-introduces one of the memory accesses saved by sysenter, but is still much better than always using int.

A similar problem arises for threads that execute in virtual x86 mode (real mode): Whenever they cause a kernel entry/exception, all segment registers are pushed onto the kernel stack before setting up the regular exception frame. To accommodate for this, L4 makes %esp0 point to the real top-of-stack only before switching to a virtual x86-mode thread; in all other (the usual) cases, %esp0 points to 4 words before the top-of-stack, effectively pretending the next kernel entry would have pushed 4 nonsense words before its stack frame. This allows to address, e.g. the user-level return address (%eip) at a fixed offset in the kernel stack under all circumstances and thus simplifies the kernel code by avoiding special cases.

Returning From System Calls

The return path via sysexit has to load the user-level %eip and %esp (from the stack) into %edx and %ecx respectively and must enable interrupts prior to issuing the sysexit instruction. All of this plus segment register reloading and checking would also be done by the iret instruction on the traditional exit path, but much slower.

Example 3.2. Emulate iret using sysexit

mov     16(esp), ecx    /* load %esp and %eip from where iret would */
mov     4(esp), edx
sti                              /* enable interrupts as iret would */
sysexit                                      /* return to user mode */

^[3]These might be virtual registers, see the section called "Virtual Registers and the UTCB".

^[4]AMD introduced similar instructions, syscall and sysret, which implement the same ideas and shall thus not be discussed in more detail here.

Chapter 4. TCBs

Table of Contents

Locating TCBs

Locating the TCB of the Current Thread
Locating the TCB of a Thread by Thread ID
Locating the TCB of the Next Ready-to-Run Thread

0-Mapping-Trick

Address Space Layout

Address Space Regions
Regions in the Kernel Address Space
Synchronization of the Kernel Address Space
Processor-Specific Data

TCBs implement threads. Most important, a thread's TCB stores the thread's CPU state (at least a stack pointer, the remaining state such as registers, instruction pointer, and flags can be stored on the kernel stack) while the thread is not executing on a CPU. Besides the state, also thread-local management data is stored in the TCB:

the global thread ID (due to the way TCBs are mapped to threads, see the section called "Locating TCBs"
the local thread ID (in order to access the UTCB from inside the kernel, see the section called "Virtual Registers and the UTCB"
resources (such as the floating point unit or the temporary mapping area, see the section called "Temporary Mapping Area") currently used by the thread as these are relevant during scheduling/thread switching
a thread state designator (ready, waiting, locked running, ...)
various scheduling parameters such as priority, total and remaining timeslice length, total quantum (maximum total CPU time allowed for the thread, currently unused)
doubly linked list pointers linking all threads in the same state (ready list per priority, waiting list, ...)
a pointer to the list of timeouts for the thread
a pointer to the structure that describes the thread's address space (e.g., a pointer to the top level page directory)
a reference to the hardware address space identifier (physical address of the top level page directory (%cr3) for Intel or ASID/TLB tag for architectures with tagged TLBs)

The resources bitfield indicates whether the fast (hand assembled) IPC path can be used or whether the (slower) C path must be used: If any resources are in use (indicated by at least one set bit in the resources bitfield), the C path must be used to additionally save the state of the resources in use (see the section called "Fast Path IPC").

The thread state and the list pointers are redundant, but speed up access in various situations: When deciding which thread to run next, lists of ready threads per priority level are useful, as with these simply the successor of the current thread can be chosen (round robin, assuming no higher priority thread is ready, see Chapter 7, Dispatching). As list manipulations are rather costly (the current and both its predecessor and its successor TCB must be updated to insert or remove a thread into/from a list, probably causing cache and costly TLB misses), having an additional thread state field in the TCB allows lazy list handling (see the section called "Lazy Dispatching").

Also note that the per-state lists are embedded into the TCBs; no additional list items have to be allocated/released. This simplifies the kernel's memory management (no malloc/free required) and avoids having to locate free memory on demand (increases performance). Furthermore, embedding the list into the TCB guarantees that the list pointers and the other items in the TCB reside in the same memory page: All thread-related data (including the kernel stack, which is "appended" to the TCB) thus share a single TLB entry, avoiding superfluous TLB misses where possible.

A further note on performance: The layout of the TCB structure is performance critical. All items that need to be accessed during IPC (at least global ID, state, and space) should reside in the same cache line/in as few cache lines as possible. Each cache line used by the microkernel introduces a likely cache miss on first access and probably evicts data from user-level from the cache, adding another cache miss on the next access.

Locating TCBs

In normal (i.e., expected) use, TCBs have to be located for three reasons:

the TCB of the current thread must be accessed
the TCB of an IPC partner thread, identified by its global thread ID, must be accessed
the TCB of the next ready-to-run thread must be accessed

Locating the TCB of the Current Thread

The first task could be accomplished by using a kernel global variable to always store a pointer to the current thread's TCB. However, each access would then require a memory access to read this pointer in addition to accessing the TCB itself. L4 instead makes use of the design decision of integrating the kernel stack into the TCB: Defining TCBs (including the stack) to be 2^k bytes in size (k in [9..12]) and forcing them to be naturally aligned (i.e., the least significant k bits of the start address of the TCB are all 0) guarantees that any address within the TCB masked with ((1 << k) - 1) yields the start address of the TCB. Given the fact that the kernel stack pointer points somewhere into the TCB of the current thread and is immediately loaded into the %esp register upon kernel entry, the L4 kernel can efficiently locate the current TCB by simply masking %esp with a constant mask.

Locating the TCB of a Thread by Thread ID

To motivate the scheme used to locate TCBs by thread ID, let us first look at how TCBs are to be stored. As mentioned previously, a microkernel should not rely on fine grained memory allocation/heap management a la malloc/free, but rather use preallocated data structures whenever possible. This approach avoids the management overheads (both (memory) space and runtime) as well as certain types of errors (out of memory). As a consequence, L4 allocates virtual address space regions for all TCBs on startup.

To further simplify TCB management, all TCBs are allocated en bloc, that is in a single large virtual array, indexed by the thread number. As a consequence, the number of threads that can be active in an L4-based system is limited by the number of TCBs allocated at startup. Given a 32-bit system and 2 kByte per TCB (typical for x86 systems), this approach requires 21 bit thread numbers (up to 2 M threads). However, in this constellation, the TCB array would fill the complete 4 GB address space, leaving no space for other kernel data structures or even code in the same address space. Consequently, L4 limits the number of threads to (an arbitrarily chosen) 256 k (18 bit thread numbers) and thus dedicates 512 MB of virtual memory to the TCB array.

Structure of Thread IDs

The simplest choice for thread IDs would now be to use the address of the TCB as its thread's ID. This choice is bad because thread IDs are passed from the user to the kernel (e.g., to denote the partner thread of an IPC operation), and using pointers would require sanity checks on the IDs. A better choice would be to use the 18 bit thread number as thread ID, which can efficiently be masked to produce a valid index into the TCB array. However, this approach would waste 14 bits of the 32 bit per word used to hold a thread ID.

Thinking of the thread number as a distinction of threads in space (different thread numbers denote different TCBs), leads to the idea of using the yet unused 14 bits as a distinction of threads in time (the same thread number might (have to) be reused during the uptime of the system). Following this scheme, L4 thread IDs are in fact split into a thread number field (index into the TCB array) and a version field, the use of which is not prescribed by the microkernel but rather left to the user-level thread or task manager on top.

The version field is also the reason for why the global thread ID is stored in the TCB: Any thread ID with the same thread number would resolve to the same TCB, but only a specific version of the thread is currently active. Mismatches in the version field indicate stale thread IDs (e.g., the intended thread has terminated and its TCB reused by another thread) and should thus cause, e.g., IPC operations to fail.

Direct with Split Version Field

Given the above constraints (32 bit thread IDs, 18 bit thread number, 14 bit version number, 2 kByte TCBs), the most efficient way to encode the thread IDs is to split the version field and place it around the thread number field, so that a single mask operation suffices to compute the byte offset of the desired TCB in the TCB array. By using the least significant 11 bits and the most significant 3 bits of the thread ID for the version field, the thread number field an bits 12..29 directly encodes the byte offset in the TCB array.

Example 4.1. Locating a TCB by its thread ID by masking out a split version field

mov     thread_id, eax
mov     eax, ebp
and     mask_thread_no, eax
add     offset, eax

cmp     TCB_Off_Myself(eax), ebp
jne     invalid_thread_id

Masking out the version field and adding the base address of the TCB array (offset in the sample code) yields a pointer to the addressed TCB iff the versions match. This is checked by the cmp instruction, which compares the given thread ID with the one stored in the TCB. The CPU jumps to the error handling code at invalid_thread_id if the two thread IDs are not equal, otherwise it simply continues.

This approach is efficient (no superfluous memory/cache/TLB accesses), but the split version field is a bit awkward.

Direct with Shifting

With just a single register-internal shift-instruction more, the version field can be left intact and be placed entirely before (or after) the thread number field.

Example 4.2. Locating a TCB by its thread ID by shifting and masking the thread ID

mov     thread_id, eax
mov     eax, ebp
and     mask_thread_no, eax
shr     threadno_shift, eax
add     offset, eax

cmp     TCB_Off_Myself(eax), ebp
jne     invalid_thread_id

Both approaches take the thread number directly as an index into the TCB array. If thread numbers are not assigned sequentially (or if many threads have terminated but their TCB not yet reused), the TCBs of active threads will be spread over the whole TCB array. As a consequence, the TCBs must be backed by standard (e.g., 4 kByte) pages in order to avoid allocating much physical memory (a superpage, e.g., 4 MB) of which only a small fraction is used. This leads to a situation in which it is likely for operations that involve two or more TCBs (e.g., IPC) to require two TLB entries (one per TLB). If the active TCBs could be migrated to/clustered in few superpages, only one TLB entry might be required covering both TCBs in IPC operations. With the fixed relation between thread number and TCB location above, migration would also change the thread ID, which is not feasible (all references to the previous thread ID would have to be identified and updated).

Indirect via Table

Introducing a level of indirection in the form of lookup table solves this problem: The thread number is just an index into a table (thread_table in the sample code), which holds in each entry a pointers to the associated TCB. By placing the version field into bits 18..31 of the thread ID and making use of Intel's advanced addressing modes (scaling the offset by the word/pointer size of 4 bytes), this boils down to following piece of code:

Example 4.3. Locating a TCB by its thread ID using a lookup table

mov     thread_id, eax
mov     eax, ebp
and     mask_thread_no, eax
mov     thread_table(4*eax), eax

cmp     TCB_Off_Myself(eax), ebp
jne     invalid_thread_id

When a thread terminates, the released TCB can be reused by the next created thread (or even another active thread's TCB can be moved into the gap, updating only the table pointer). With 2 kByte TCBs and 4 MB superpages (typical for x86), the TCBs of 2048 active threads would fit into a single superpage and thus reduces the TLB footprint (and TLB misses) of the kernel.

On the downside, the lookup table itself requires 256 k x 4 B = 1 MB of memory, and accesses to it also require it to be mapped in the TLB. A clever scheme could integrate the lookup table into the first 1 MB of the superpage that also covers the active TCBs, so that table and TCBs usually share a single TCB entry. This approach would leave the TCBs of 3 MB / 2 kB = 1536 active threads in the first superpage together with the lookup table; still sufficient for many scenarios. Since more threads only introduce another TLB entry per 2048 TCBs, this approach scales even better than the direct approaches with respect to TLB utilization.

Another drawback of this indirect approach is the added memory access to read the table. This might (probably will) cause a cache miss and thus increases the cost again slightly, but probably not as much as the additional TLB misses in the direct approaches.

The indirect approach has probably not been evaluated well enough in the implementation of the L4/x86 kernel, but has proven its feasibility with the introduction of communication spaces (see Chapter 12, Security). L4 currently employs the direct scheme with shifting and masking and the version field kept intact.

Locating the TCB of the Next Ready-to-Run Thread

The next ready-to-run thread is neither known by its ID nor is it (usually) the current thread. Locating it highly depends on the scheduler implemented and shall thus be discussed in more detail later (see Chapter 7, Dispatching).

The basic idea is to have a linked list of ready threads, with the current thread being the list head. The next ready thread is then simply the next thread in the list (this description ignores priorities for brevity), which is reachable via the list pointers in the current thread's TCB.

0-Mapping-Trick

The TCB array with its 256k entries of 2 kByte each requires 512 MB of memory. For obvious reasons, the microkernel should not claim 512 MB of physical memory from the start on but should rather grow only on demand. Thus the TCB array is allocated only in virtual memory. The first access to a page in the TCB array will cause a pagefault, which is handled specially by the kernel:

on reads, a shared 0-filled page (the so-called "0-page") is mapped read-only; this allows the access to complete after pagefault resolution and guarantees that the mandatory thread ID validation fails (thread ID 0 is reserved by the kernel and represents no thread at all (the nilthread ID))
on writes, a newly allocated 0-filled page is mapped writable; this guarantees that all TCBs on the page are initialized with an invalid thread ID (0), so that future access still correctly report them as invalid while allowing to fill in TCB data should a new thread be created

This is a variant of the copy-on-write scheme, which saves memory even if a malicious user-level thread probes every thread ID in the system; there will only be a single physical frame mapped multiple times into the TCB array. The initial pagefaults can be avoided by premapping the 0-filled page to back the whole TCB array. Afterwards, only the first TCB modification on each page will cause a pagefault.


	This scheme works both for standard (4 kB) pages and for superpages. If the TCB array should be backed with superpages, still a small 0-filled page can be used to back the unused parts. However, simplicity dictates that the first write requires a superpage to be mapped (otherwise physical memory management, which aims at allowing later promotion to superpages, becomes veeeery difficult)!

In the indirect TCB addressing scheme, the lookup table should not be backed by a shared 0-filled read-only frame, but rather by a dedicated, writable frame containing the virtual address of the first TCB in every word: As long as the first TCB is read-only mapped to the 0-page, all accesses to any TCB will fail at the validation of the thread ID. Once the first TCB is allocated (and thus mapped to a writeable page), all invalid acceses will still fail the validation procedure (the requested thread ID does not match the thread ID in the first TCB).

This guarantees that reading from the lookup table always yields a valid pointer into the TCB array, thus avoiding additional checks (the 0-page would yield NULL-pointers, which would have to be checked for). If the lookup table is integrated into the superpage that also stores the initial portion of the TCB array, this page must still be allocated exclusively and writable from the beginning; only the remaining parts of the TCB array can/should use the 0-mapping trick as described earlier.

Address Space Layout

Having proposed to dedicate 512 MB of virtual address space to TCBs, this is the right time to look at the address space layout to show how this fits in. Generally, there are two possible approaches to store the kernel code and data: One can use an address space dedicated to the kernel, which would require an address space switch whenever a user-level thread invokes a system call, or one can split the 32 bit (4 GB) virtual address space into a user region and a kernel region. As address space switches are comparatively expensive on the main target platform (x86) due to untagged TLBs (also see Chapter 9, Small Spaces), logically splitting the address space is the preferred way to go.

Address Space Regions

With 512 MB virtual memory being required for the TCBs, a split into 3 GB for the user-level and 1 GB for the kernel seems reasonable (though debatable). L4 reserves the upper 1 GB (0xC0000000..0xFFFFFFFF) of each address space for its own purposes, the lower 3 GB can (nearly) freely be used by the user-level thread (for limitations see the section called "Virtual Registers and the UTCB"). While the lower 3 GB are (naturally) potentially different in each address space, the kernel's GB is synchronized across all address spaces so that the kernel has a consistent view on its memory irrespective of in which address space it currently executes (exceptions to this rule are processor-local memory and per-space regions in kernel memory).

Regions in the Kernel Address Space

The kernel's 1 GB is further split into the aforementioned per-space region of (maybe) 128 MB, the TCB array (512 MB), and 128 MB for kernel code and data (such as the mapping database, see Chapter 8, Virtual Memory Mapping).

The last 256 MB are used as a window to the first 256 MB of physical memory to allow for efficient manipulation of objects that are either not mapped into virtual memory or are only known by their physical address (this is handy, e.g., when walking multi-level hardware-pagetables (x86) in software because these reference the next level by its physical address).

Synchronization of the Kernel Address Space

While most of the kernel's address space is rather static (code, data, window to physmem) or intentionally not synchronized across address spaces (per-space region), the TCB region is highly dynamic and must be kept consistent across all address spaces.

Imagine thread A creates a new thread N, whose TCB resides in a yet unused page. The kernel would create the mapping for the TCB array page in the kernel region within A's address space and continue executing A. At the end of A's timeslice, the kernel decides to run thread B in a different address space. B also wants to talk to N, checks if it exists and gets back the NULL mapping from the TCB array in B's kernel region and falsely concludes that the thread does not exist. Even worse, B might now try to and succeed in creating another thread N, which is probably different from the one A created before (B's N might, e.g., execute different code than A's N). Now the kernel has an inconsistent view (two TCBs for the same thread number), the global IDs are no longer global.

To circumvent this problem, all mappings in the TCB array are first added into a master page table (e.g., the page table of the first user-level task to run). Initially, all TCBs are mapped read-only to the 0-page in this master table. On pagefaults in the TCB array, the mapping in the master table is read and copied to the faulting page table entry. This already suffices to resolve read pagefaults: The faulting thread will now retry the memory access and read either the correct TCB or the 0-page (and thus detect invalid accesses). For pagefaults on write accesses, the master mapping is copied only if it does not map to the 0-page. In the latter case, the faulting thread was trying to create a new thread (using a previously unused TCB). To allow for this operation to succeed, instead of duplicating the 0-page mapping, a new frame is allocated, cleared, and mapped writeable into the master page table (replacing the read-only mapping to the 0-page). Only after this is the new (writeable) mapping copied to the page table of the faulting address space.

This ensures consistency across all address spaces without having to update all page tables at every change (i.e., we update/synchronize lazily). A potential problem could be invalidated mappings; these would have to be propagated to all page tables immediately. For two reasons this is not a problem:

L4 never releases frames allocated to the TCB array, so no invalidations occur.
With Intel's 2-level pagetables, we would synchronize only the top level page directory entries so that all address spaces would use the same page tables (or superpages) for the TCB array. Invalidating individual pagetable entries (standard pages) would thus be propagated to all address spaces instantaneously (by reference to the same page table). Page directory entries (or superpage entries) must neither be invalidated nor changed, as this would again require an eager update of all page tables in the system.

Processor-Specific Data

Some data such as the pointers to the heads of the ready lists per priority level should be CPU-local to avoid conflicts in the caches (ping-pong) and remove the requirement to serialize concurrent access to the data structures from different CPUs. CPU-local data can be implemented either by using an array for each structure, which is indexed by the number of the accessing CPU or by consolidating all CPU-local data structures in few pages and back the virtual addresses with different physical frames depending on the accessing CPU. The latter approach is "cleaner" as the source code does not need to be cluttered with artificial array accesses but rather accesses all data structures an all CPUs at the same virtual address, which is transparently mapped to different memory locations and is thus favored by L4. On the downside, this approach requires a top-level page directory per address space and per CPU in order to selectively replace some mappings to point off to CPU-local data. These additional page directories must be kept in synch, which is implemented on the page directory entry level as with the TCB array (see the section called "Synchronization of the Kernel Address Space").

Chapter 5. IPC Functionality and Interface

Table of Contents

Synchronous Communication

Communication Primitives

Message Types

Timeouts

Send Timeout
Receive Timeout
Transfer Timeout
Timeout Encoding

Encoding of IPC Parameters

Operation and Addresses
Deceiving IPC and Proxies
Timeouts
Message Content
Receive Buffers
IPC Result
Virtual Registers and the UTCB

The introduction of address spaces isolates threads in different address spaces from each other, so that they cannot interfere with each other. On the downside, threads in different address spaces can also no longer work together, as there can be no communication between them. To overcome this (micro)kernel-imposed limitation, the kernel provides inter-process communication (IPC) primitives.

Synchronous Communication

When implementing communication primitives, one fundamental design decision has to be made: Shall the sender block until the IPC operation returns (successfully or not) or shall the sending thread continue immediately and let the kernel deal with the actual message transfer?

The latter case (asynchronous send) usually requires

a message buffer in the kernel,
some kind of notification for failed operations,
two copies of the message (first into the kernel buffer and then into the receiver's user-land buffer).

Buffering messages in the kernel makes the kernel vulnerable for denial-of-service attacks (e.g., by letting threads send messages that will never be received). Furthermore, copying the messages twice not only wastes processor cycles, but also increases the kernel's cache (and probably TLB) footprint.

Assuming that IPC is the most frequently invoked microkernel service, L4 implements only synchronous IPC, which allows to keep the message in the sender's address space until it can be copied to the receiver. This scheme avoids both the in-kernel buffers and a second copy of the message plus the sender can be informed about success or failure when the IPC operation returns.

Communication Primitives

For message-based communication, at least two primitives are required:

send to (specified thread)
receive from (specified thread)

These operations allow for two threads to communicate with each other and prevent interference from other threads: The receiver only accepts messages from the intended sender, aka closed receive). Other threads must wait until the receiver accepts messages from them. However, servers that offer their services to all threads are difficult to implement using the closed receive primitive for two reasons:

The server must know all potential client thread IDs.
As the thread blocks during receive (assuming the client has not yet issued a request),
- the server either has to provide one receiver thread per client, or
- the server must poll all clients for unhandled requests.

Avoiding these problems, the kernel should offer an additional primitive (known as an open receive):

receive (from any thread)

Looking at the (expected) common interaction patterns in a multi-server system, clients will typically send requests to servers and then wait for the answer (cf. RPC). If in such a scenario the client is preempted after having sent the request but before having issued the accompanying receive, the server might try to deliver the result (answer message) to the client before the client is receiving. As a consequence, the server would block and thus be prevented from servicing other requests. To solve this problem, two approaches can be considered:

The server could check whether the client is ready to receive and store the answer message in some queue if it is not. As this would consume server resources, this approach would enable denial-of-service attacks against the server. Worse, the server cannot avoid having to store some answers, as even well-behaving, non-malicious clients could be preempted between their send request and receive response calls.
The better solution provides another IPC primitive to let the clients express the send and receive phases atomically. Such a primitive would guarantee that the client is ready to receive as soon as the request message has been delivered to the server. As a consequence, the server could assume that well-behaving clients use the atomic send-and-receive primitive and discard answers if the client turns out not to be ready to receive it.

Combined send-and-receive primitives could be:

call (send to specific thread and receive from same)
reply-and-wait (send to specific thread and receive from any)
reply-and-wait for other (send to specific thread and wait for message from other specific thread)


	`call` is the typical client-side primitive when requesting a service, whereas `reply-and-wait` would be the primitive most suitable for the server (send answer to client A and wait for requests from any client). The third combination has yet to proof its worth, but comes for free (see the section called "Operation and Addresses").

Message Types

Depending on the size of the messages to be transferred, we can distinguish three message types:

Registers: Short messages can be transferred in (virtual) registers. Such message transfers avoid memory accesses wherever possible, thus avoiding cache and TLB misses as well as user-level pagefaults.
Strings: Contiguous data can be copied from the sender address space to the receiver address space via string IPC. This is the most general form of messages, but also the slowest (the message has to be copied at least once).
Pages: If large amounts of data are to be transferred, the virtual memory mechanism can be used to establish shared memory between the sender and the receiver. This allows to easily make data available to multiple address spaces without ever copying the data. Unlike the other two message types, this type offers call-by-reference semantics (rather than call-by-value), i.e., changes to the data in the receiver's address space are visible in the sender's address space as well. Should this not be desired, this form of communication should allow restricting the access rights for the receiver.


	Besides memory pages, other resources can be mapped as well (e.g., access to I/O ports on IA32).

The distinction becomes important when implementing the IPC primitives, as we shall discuss in Chapter 6, IPC Implementation.

Timeouts

In order to enable threads to recover from a (blocking) communication attempt with an unresponsive (malicious, faulty, or uncooperative) partner, the duration of each IPC operation can be bounded with a number of timeouts.

Send Timeout

The time a sender waits for the receiver to call the corresponding recv can be limited using the send timeout. A 0-timeout enables polling/sending the message only if the receiver is ready.

Receive Timeout

The time a receiver waits for a sender to arrive can be bounded using a receive timeout. A 0-timeout enables polling/receiving messages only if a sender is waiting to deliver.

Transfer Timeout

Using transfer timeouts (also knows as xfer timeout) both the sender and the receiver can limit the duration of the actual message transfer. Transfer timeouts are only considered for string IPCs, because all other types are guaranteed to be free of user-level pagefaults during message delivery. Since strings need to be copied from the sender's address space to the receiver, pagefaults can arise both in the sender's address space (while reading the message) and in the receiver's address space (while writing the message). Since pagefaults are resolved by user-level pagers (see Chapter 8, Virtual Memory Mapping), which are under the control of the user and thus not trustworthy, the sender can specify a xfer timeout to limit the total time available to the receiver's pager to resolve all occurring pagefaults. Conversely, the receiver can specify a xfer timeout to limit the total time available for the sender's pager. Both xfer timeouts are combined and the minimum determines the effective xfer timeout that bounds the duration of the message transfer.

Historic Remark

In previous L4 versions (v2, vX.0), xfer timeouts were approximated by send and receive pagefault timeouts. These limited the duration of each individual pagefault resolution in the sender's respectively in the receiver's address space. This approach, however, still allows for nearly unbounded message delivery times, because many pagefaults can occur (depending on the number and length of the strings). The new transfer timeouts provide better control over the maximum message delivery time for both threads involved and have thus replaced the send and receive pagefault timeouts.

Timeout Encoding

Concerning the encoding of timeouts, there are a number of constraints to consider:

Precision: Users must be able to precisely timeouts in the near future (microseconds) but should also be able to set timeouts in the distant future (hours, days). Distant timeouts, however, do not need to be precise, as longer, precise timeouts can easily be constructed using a number of short term timeouts.
Reference: Users should be able to specify both relative timeouts ("(with)in two minutes") and absolute timeouts ("today at 13:00").
Size: As multiple timeouts could be passed to each IPC operation (send/recv and xfer timeouts), the encoding must be compact. 64 bit values at (constant) microsecond granularity are unacceptable.

As a consequence, L4 employs 16 bit wide, logarithmic timeouts with 10 bit mantissa m (in microseconds) and 5 bits exponent e (for relative timeouts, the 16^th bit discerns relative and absolute timeouts), allowing relative timeouts be specified as m*2^e between 0 and 1023*2³¹ microseconds (about 610 hours).

For absolute timeouts, L4 similarly uses 16 bit: 4 bit exponent e, 1 bit epoch designator d, and 10 bit mantissa m (in microseconds); the 16^th bit again distinguishes absolute from relative timeouts. Computing absolute timeouts from 16 bit relies on a reference clock R, which is an L4 provided 64 bit monotonically increasing microsecond counter (though the update rate of the counter is architecture dependent). The absolute timeout triggers as soon as the least significant 10+e bits of R equal m << e and bit 10+e of R equals d (or C-like, when (R & ((1ULL << (11+e)) - 1)) == ((d << (10+e)) | (m << e))).

Encoding of IPC Parameters

For every IPC operation, a number of parameters such as the intended receiver, the type and contents of the message and the desired operation (send, receive, call, ...), or the desired timeouts (if any) must be specified. Clever encoding of these can avoid superfluous memory accesses and thus help to make IPC fast.

Operation and Addresses

To avoid code duplication, L4 only implements a single IPC system call, which implements the atomic send-and-receive operation between threads. As a consequence, the system call requires two addresses:

destination thread ID

The ID of the thread that is to receive the message that is sent by this invocation of the IPC syscall. If no send phase is desired, a special NIL thread ID (0) can be specified.

receive specifier

If not NIL, this thread ID specifies the thread that is expected to send a message to the invoking (i.e., the current) thread. The receive phase always follows the send phase (if any): Typically, having received a message requires some sort of processing before a meaningful answer can be sent. An exception to this rule would be a simple "ACK", which is implicitly passed to the originator as the return code of the send operation: If the send returns without errors, the message has been delivered (see the section called "Synchronous Communication").

If the receive specifier is NIL, no receive phase is initiated and the syscall returns right after the send phase (if any).

For an open receive, which would allow the current thread to receive a message from any thread, another special thread ID can be used: the wildcard thread ID (-1).

This scheme allows to encode the desired operation (send to, receive from, receive from any, call, or reply and wait) and all relevant threads in only two words.

As both addresses must be inspected in all IPC calls, these should be passed in registers rather than memory to the kernel.

Deceiving IPC and Proxies

TODO (Not really relevant for the examinations, though.)

Timeouts

Four timeouts accompany each IPC:

send timeout: The send timeout bounds the period of time for which the thread may block waiting to start sending its message to the intended receiver (!= nilthread).
send xfer timeout: The send xfer timeout bounds the duration of the actual message transfer and is used solely during string IPC to prevent the receiver's pager from delaying the message transfer indefinitely (by not or only slowly resolving pagefaults in the receiver's address space).
receive timeout: The receive timeout bounds the period of time for which the thread may block waiting for the partner (!= nilthread) to start sending a message.
receive xfer timeout: The recv timeout bounds the duration of the actual message transfer and is used solely during string IPC to prevent the sender's pager from delaying the message transfer indefinitely.

With all timeout values being 16 bit wide, (at least) two timeouts can be stored in a single word. As the send and/or receive timeouts are always required whereas the xfer timeouts are only required for string IPC operations, we combine send and receive timeout in one 32 bit word, which should be passed to the kernel in a register to avoid a memory access. The xfer timeouts are combined in a second word, which can be passed in memory: we only access them for string IPCs, which touch a lot of memory anyway, so that the (relative) overhead will be small.

Message Content

The message itself is stored in 64 word-wide virtual message registers (MR₀ .. MR₆₃), each of which is realized either in a physical processor register or in a thread-local region of memory (see the section called "Virtual Registers and the UTCB").

The message itself consists of a message tag in MR₀, followed by 0..63 untyped words (which are simply copied during an IPC), followed by 0..63 typed words. Typed words (or typed items) are interpreted by the kernel and serve to encode strings and mappings.

String Items

String items instruct the kernel to copy a contiguous region of memory (a "string") from the sender's address space to the receiver's address space. To this end, string items must convey

the length of the string (in bytes) and
the start address of the string in the sender's address space.

The first word in a string item ( string specifier) consists of the following fields:

continue bit (C, bit 0): The least significant bit is set to indicate that another typed item follows after the string item.
cacheability hints (hh, bits 1..2): The next two bits can be used to instruct the kernel to bypass the L2 and or L1 cache when copying (if supported by the hardware).
type bit (bit 3): The next bit is 0 to indicate that this is a string item (map and grant items set this bit to 1).
number of substrings - 1 (N, bits 4..8): Each string item can specify up to 64 start positions of substrings of the same size. Together with the same property of receive string items, this allows to efficiently encode scatter/gather string IPC.
compound string bit (c, bit 9): Indicates that another string item follows and is to be considered to be part of the same string (allows scatter/gather strings with substrings of different sizes).
length of substrings - 1 (bits 10..31/63): The remaining bits in the string specifier specify the length of the substrings in bytes. As 0 byte long strings would be useless, the "length - 1" is stored, allowing up to 4 MB of data per substring (on 32 bit systems).

The N+1 words that follow the string specifier denote the start addresses of the substrings, which need not be aligned and thus require a full word each.

Map Items

Map items enable the sending thread to establish a region of shared memory with the receiver. In order to allow the mapper the restrict access from the "mappee" to the memory (e.g., read only), map items must specify

the start address and length of the virtual address region to share and
the rights to be granted to the receiver (e.g., read-only).

Since multiple map items should be allowed in each message, but only a single receive window can be specified by the receiver (arbitrarily large, but only one, see the section called "Receive Buffers"), each map item also specifies an offset within the receive window.

All this is encoded in two words. The first word of a map item consists of:

continue bit (C, bit 0): The least significant bit is set to indicate that another typed item follows after the string item.
type bits (bits 1..3): Map items are identified by the value 4 (0b100) here. (Grant items require a value of 5 (0b101), string items use 0b0hh.)
reserved bits (bits 4..9): Must be 0 for now.
send base address (bits 10..31/63): The remaining bits specify the 22 or 54 (on 64 bit systems) most significant bits of the offset in the receive window (see Chapter 8, Virtual Memory Mapping). As fpages cover at least 1024 = 2¹⁰ byte and must be naturally aligned, the unspecified least significant 10 bits must be 0 and need not be stored.

Grant Items

Grant items allow to map an fpage and to remove it from the sender's address space, the fpage is effectively moved to the receiver's address space.

In their encoding, grant items differ from map items only in the first bit of the type field (bit 1 of the send base): Map items set this bit to 0, grant items set it to 1.


	The effect of the grant operation cannot be achieved with any series of map and unmap operations and is thus a required additional, not a redundant operation (also see Chapter 8, Virtual Memory Mapping for the effects of grant on derived mappings and the mapping database).

Message Format

In order to properly decode the message, the first message register (the message tag) always provides some meta-information about the remaining message:

number of untyped words (u, bits 0..5)

The number of untyped words following the tag. These are copied during IPC but not interpreted by the kernel.

number of typed words (t, bits 6..11)

The number of typed words (not items), that (always) follow the untyped words. These are interpreted by the kernel and trigger string copy, map, or grant operations.

flags (bits 12..15)

The flags in the message tag are used to signal the use of a kernel-based IPC control mechanism (IPC propagation or IPC redirection), indicate inter-processor IPCs, and to store the error indicator after the IPC.

The flags are typically set by the kernel for inspection by the receiver and the sender (error bit).

label (bits 16..31/63)

The remaining bits in the message tag can be used to transport the first bits of data to the receiver. The kernel uses the label to label synthesized messages (page fault IPC, exception IPC, ...), so that a single thread can be both the pager and the exception handler of another thread.

As the message tag describes the message, it must be inspected on every IPC and should thus be passed to the kernel in a register rather than via memory.

Receive Buffers

Whenever an IPC operation includes a receive phase, the receiver must specify

whether to accept strings,
where to put them (if at all),
whether to accept mappings, and
where mapped pages should appear in the receiver's address space.

Untyped words are simply copied into the receiver's message registers, so nothing needs to be specified here (the number of message registers is the same for all threads, so no overflow can occur).

For strings, string items as in the send case are employed to denote the buffers into which the strings shall be received (overly long send strings are truncated to the length of the according receive string).

In order to accept mappings, the receiver can specify a single contiguous virtual address region (an fpage, see Chapter 8, Virtual Memory Mapping); all incoming map items must fit into this region or an error is signaled.

The receive window is given as part of the acceptor in BR₀:

continue (C, bit 0)

If set, receive string items follow. Otherwise no string items are accepted.

unused (bits 1..3)

Always 0.

receive window size (s, bits 4..9) , base address (bits 10..31/63)

The size of the receive is actually 2^s byte. Due to encoding and portability problems, s must be at least 10 and the base address must be naturally aligned (its last s are forced to 0 by the kernel). Disallowing s<10 guarantees that all valid base addresses can be stored in the upper 22/54 bits.

As an exception, the special nilpage (0) can be used to refuse all mappings.

The remaining 33 (virtual) buffer registers BR₁..BR₃₃ provide a list of receive string items, its end being indicated by the first string specifier with a clear C bit.

IPC Result

Finally, the completion status of an IPC operation (successful, aborted after timeout, partner thread invalid, mappings rejected, ...) should be reported to the caller. Assuming success as the common case, L4 reports the outcome in only one bit (error flag in the message tag, MR₀). Details about the error (if any) are given in a predefined, thread-local memory location (see the section called "Virtual Registers and the UTCB").

Virtual Registers and the UTCB

In order to set up an architecture independent API without limiting all architectures to the greatest common divisor of them all, L4 bases its IPC API on a set of virtual registers to store the IPC arguments (MRs, BRs, xfer timeouts, error details). Virtual registers are mapped either to physical registers or to a kernel provided, user accessible, thread local, unpaged memory location, which is called the user-level tread control block (UTCB), Accesses to this memory never raise pagefaults, the kernel guarantees that the UTCB is always present in physical memory.

The kernel can locate the UTCB easily because the local thread ID, which is stored next to the global thread ID in the kernel TCB, is defined as the address of the thread's UTCB.

The user, however, needs a different way to locate the UTCBs: On IA32, L4 employs an otherwise unused segment register (%gs) to grant access to a segment that contains as its first (and only) word a pointer to the UTCB of the currently running thread.

Using a pointer in a user-accessible but fixed segment has several advantages over other schemes: The pointer can easily and safely be manipulated by both the kernel and the user (see Chapter 10, Local IPC), thus allowing cheap thread switching purely at user-level (among threads in the same address space). If the segment was set up to directly give access to the UTCB,

the user could not easily change it to access a different UTCB (due to security considerations; allowing this would grant full control over physical memory to the user-level applications) and
the kernel would have to spend more time on each thread switch updating the segment (which is by far more costly than writing a single pointer to memory).

Chapter 6. IPC Implementation

Table of Contents

IPC and Scheduling

General Implementation of Short IPC

Fast Path IPC

Performance Considerations
Combining Fast and Slow Paths

Long IPC

String IPC

Copy In/Copy Out
Temporary Mapping Area
Management of the Temporary Mapping Area

Temporary Mapping Area on Multi-Processor Machines

IPC and Scheduling

The following elaboration mainly focuses on single-processor systems, i.e., only one thread executes on the CPU at any time.

Furthermore, we assume a strict priority-based scheduling strategy with fixed priorities and round-robin per priority level as provided by L4 (see Chapter 7, Dispatching), though the ideas presented should apply more generally as well.

The implementation of synchronous IPC primitives (send, receive, and combined operations such as call or reply-and-wait) are tightly related to scheduling: Whenever a thread waits for an IPC partner (trying to send to a thread that is currently not accepting messages, or trying to receive a message while none is pending), the thread is blocked and another thread needs to be scheduled.

Conversely, if the send phase of a call operation completes, the sender will wait for the response of the receiver. This (common!) case can be improved by avoiding to go through the scheduler and instead dispatch the receiver thread directly, allowing it to use the remainder of the sender's timeslice to start fulfilling its request. Executing the receiver in the sender's timeslice is called ((timeslice donation)); the sender would block anyway (after the call), but instead of losing its assigned CPU time, the sender passes it on the receiver so that the latter can do some work on the sender's behalf.

This approach is fair in all respects if the sender used the call primitive:

The sender cannot continue execution, but wants the receiver to do some work; donating processor time can only speed things up for the sender.
The receiver gets extra CPU time; the remainder of the sender's timeslice is accounted for by the sender (not by the receiver) and does not affect later scheduling decisions for (or against) the receiver thread.
The receiver effectively inherits the priority of the sender for the donated timeslice, so that the receiver may run even if it has a low priority. This never leads to threads with a high(er) priority being delayed longer than till the end of the sender's timeslice, which would also have to be tolerated if the sender had not sent a message to the receiver, and thus does not violate scheduling guarantees for threads other than the sender (who donated some processor time to the receiver, thus is slightly penalized).

Besides a call to a waiting receiver, other (attempted) IPC operations also require dispatching/scheduling actions:

send, receive, or call while the partner is not waiting

This is the most obvious and easiest case, as the only "choice" is to block the current thread (set it to waiting) and schedule another one (either according to the scheduling policy or (if possible) the missing partner thread so as to allow the IPC to complete as soon as possible).

receive from a partner that has previously called us

In this case, the transfer can complete immediately and the receiver continues (the partner thread is now waiting for the receiver's reply).

call a partner that is waiting for us

As described before, the receiver is dispatched and allowed to use the remainder of the sender's timeslice (timeslice donation).

receive from a partner that has previously invoked a send for us

send to a partner that has previously invoked a recv or called us

These cases are different from the previous ones: Before, either the sender or the receiver (or none of them) was ready to run after the invocation of the IPC primitive; the other thread blocked. If (as in these cases) both threads are ready to run after the IPC, a true scheduling decision must be made:

Does the awakened/unblocked thread have a higher priority than the currently running thread? If so, we must switch threads (the unblocked thread will be the only ready-to-run thread in the highest ready priority level, otherwise the currently running thread could not have been running (priority too low!)).
If the unblocked thread has a lower priority that the current thread, continue executing the current thread.
If the two threads have the same priority, a policy decision is required: Continue to execute the current thread until the end of its timeslice (lazy IPC operation) or immediately switch to the partner (eager)?
[The author of this document believes that the eager approach unjustifiably penalizes the current thread (cutting short its timeslice though it has work to do) and thus favors the lazy approach, which additionally avoids the overhead of a thread switch. L4 implements the lazy approach.]

General Implementation of Short IPC

IPC operations that only transfer messages in the message registers (no strings, no mappings) are called short IPC. Under the assumption that call style IPC is prevalent in microkernel-based systems and that the receiver (often a server) is waiting, such a short IPC can on uni-processor machines be implemented via a (simplified) thread switch, carrying at least part of the message in the register file as follows

The sender executes and loads (part of) the message into its physical register file (and the remainder into its virtual MRs).
The sender perform the call system call, thus entering kernel mode and switching to the kernel stack of the sender thread. All registers except the stack pointer are preserved.
(The x86 segment registers %cs, %ss, %ds, %es, %fs, and %gs, the (extended) FLAGS and instruction pointer are overwritten or kernel entry and cannot be used to transfer messages.)
The kernel locates the destination thread (its TCB), verifies its validity, and state (waiting for IPC), and prepares to switch to it (on x86: update esp0). Some (x86: 2) additional registers are required for locating and verifying the dest. TCB, all others are preserved.
The kernel switches to the kernel stack of the dest. thread, preserving all other registers.
Then the kernel switches to the dest. address space (if different from the sender's), still preserving most registers.
Finally, the kernel causes a return to the user-level of the dest. thread, without loading the physical register file of the dest. thread from its stack (hence the above mentioned simplified thread switch). This allows some (x86: 5) registers to survive from the sender's user-level up to the receiver's user-level, so they can be used to transfer a message.

Fast Path IPC

Short IPC, meaning IPC with its message solely contained in untyped message registers without strings or mappings, is (assumed to be) the most frequent kernel operation in microkernel-based systems. This justifies taking special care on its implementation to avoid most of the possible performance problems (cache misses, TLB misses, memory references in general, down to microarchitectural effects such as pipeline stalls and flushes or instruction scheduling; the latter shall not be discussed here in more detail).

In order to retain in full control over these effects, L4 features two IPC paths:

The fast path is implemented in hand-crafted assembler, but handles only the easy (hopefully common) case:
- short IPC (only untyped message words)
- call or reply-and-wait semantics, so that no scheduling decision is required
  - the sender blocks after sending (waits for reply or next request)
  - the receiver is waiting and dispatched after sending
- no effective timeouts
  - send timeout: irrelevant as the receiver must be waiting
  - xfer timeouts: irrelevant due to the required absence of string items
  - recv timeout: must be infinite (only effective timeout)
The general purpose, fully fledged implementation to handle all (other) cases is implemented in C, but still with good performance in mind

The selection between the two IPC paths is made at the beginning of the IPC entry point by checking the above criteria. If any criterion for the fast path is violated, the code branches off to (calls) the C code, otherwise the fast path is used.

Performance Considerations

In order to retain good performance, the target architecture must be inspected and searched for possible problems. On x86, these boil down to

cache line length; all fields in the TCB that are relevant for (short) IPC should be accumulated into as few cache lines as possible, e.g., by clustering them at the beginning of the TCB (which is at naturally aligned at 1k or 2k boundaries and thus surely aligned at cache line boundaries; placing IPC critical words later would require knowledge of the cache line lengths to guarantee matching alignment)
TLB coverage of the TLBs; ideally both sender and receiver TCB should be covered by the same TLB entry (using superpages and a table-based lookup mechanism this could be achieved at the expense of an additional cache line (entry in the lookup table) plus in-kernel bookkeeping of free TCBs). The current approach just makes sure that each TCB fits into a single, standard page-sized TLB entry.
memory locality (related to cache lines and TLBs); scheduling would require up to two remove and two insert/append operations on the doubly-linked per-state lists (ready list, waiting list) embedded in the TCBs (move sender from ready to wait list, move receiver from waiting to ready list), each of which would require touching three TCBs (the predecessor and successor in addition to the one to be removed or inserted), thus requiring up to 12 TLB entries (one per TCB). Using lazy scheduling (see the section called "Lazy Dispatching" for details), especially on the (fast) IPC paths, avoids this overhead.
branch prediction logic; each conditional branch is recorded by the hardware to predict the outcome (branch to be taken or not) the next time, so L4 combines all fast-path-tests into a single result and uses a single conditional branch to split the fast path from the C path. To this end, all criteria that lead off the fast path are ored together; if the resulting value is 0, no criterion required the slow path and we remain on the fast path; if the result is not 0, we use the C path instead.

Combining Fast and Slow Paths

Attention is required to make the two paths compatible: If one thread enters the wait state on the slow path (e.g., because it used recv rather than call) and this thread is to be resumed by a thread that sends to it using the fast path (via a call), the receiver must be in a state that allows proper resumption on this path. Conversely, the reply might come in on the slow path, reactivating a thread that entered the wait state on the fast path.

Luckily this turns out to be no problem in practice, because both the fast and the slow IPC path leave the waiting thread in a consistent state (same thread state, same stack layout for the top few items). Exclusively cooperative switching of threads inside the kernel (the kernel is not preemptible, except during string IPC) avoids race conditions while the state of a thread is changed, i.e., during the thread switch.

However, with the fast path leading up to a sysexit (to activate the receiver), the fast path cannot be used to activate kernel threads such as interrupt threads, which never execute in user mode and not even have a surrounding address space.

Long IPC

Long IPC denotes all IPC that includes string items or map/grant items, i.e., typed items. The complexity of these operations in conjunction with accessing the required data structures makes an assembly "fast path" implementation infeasible. Additionally, due to the string copy or pagetable manipulation, the cost of long IPC are generally so high that such a fast path would not pay that much and is thus not provided (also saves maintenance overhead).

Long IPC requires the following steps:

thread A uses the IPC syscall to communicate with thread B
we enter kernel mode, the hardware disables interrupts
we check whether the fast path is possible; for long IPC, it is not
validate partner thread and its state
- block if partner is not ready
analyze and transfer the message
- lock both partners' TCBs
- enable interrupts
- actually perform mapping/string copy operations
- disable interrupts
- unlock both TCBs
switch to the receiver's address space (if != current)
return to user-mode of the receiver

We want to enable interrupts during string copy operations in order to keep the interrupt latency low: Each string can be up to 4 MB in size, the pure copy time approaches the length of a timer tick (order of milliseconds). We must disable interrupts afterwards as the thread switch and other kernel code assumes atomicity.

We must allow for pagefaults to be handled during string IPC, which requires that we prevent the reactivation of any of the partners. Otherwise another thread might concurrently be allowed to send a message to the receiver, partly overwriting its message registers and wreaking havoc. The sender must not continue before having sent the message so that it cannot mess with its message registers (besides other reasons).

The approach taken by L4 to avoid such problems is to lock both partners' TCBs while the interrupts are still disabled. Locking simply means to set their thread states to locked_running (sender) and locked_waiting (receiver), which are different from both ready/running and waiting and thus prevent anyone from communicating with or dispatching these threads.

We defer a detailed description on mapping IPC until Chapter 8, Virtual Memory Mapping to present it in its natural context, so that the remainder of this chapter will focus on string IPC. Also keep in mind that this description of the IPC operations disregards the multi-processor case and thus ignores the additional SMP-safe locking requirements on accessing TCBs and such. The L4 implementation is SMP safe.

String IPC

String IPC provides a means to safely (guarded by the kernel) copy large amounts of data between address spaces. Though not strictly required (short IPC would suffice), providing string IPC offers a much more comfortable and more efficient solution to this task than short IPC could do while still keeping many of the IPC properties: synchronous, atomic message transfer (the sender continues only after the message has been transferred completely) from sender defined data into memory locations that are designated as such by the receiver. Furthermore, kernel-mediated string IPC guarantees that the originator of the accepted string data is known, as the sender's true identity (thread ID) is conveyed to the receiver during the IPC (a matter of trust).

Copy In/Copy Out

The problem with inter-address-space string copies (or memory copies in general) is that neither sender nor receiver have to have access to both the original data (source buffer) and destination buffers; if both buffers reside in private memory (not shared between the sender and the receiver address spaces), no single page table exists that contains mappings to make both buffers' physical frames accessible. In this case, not even the kernel can copy the data directly.

The traditional approach to copying data from one address space to another is to use an intermediate buffer in the kernel's address space (which is mapped into all address spaces (top 1 GB) and is synchronized to map to the same physical frames in all address spaces). First, running in the sender's address space, the kernel copies the strings into the intermediate buffer in kernel space, then it switches to the receiver's address space and copies the data out of the kernel buffer into the receiver's buffers.

Evaluation

Though this approach works well, it requires two copies of the message and (for long messages) trashes many cache lines (source buffer, intermediate buffer and destination buffer will occupy cache lines, probably evicting each other).

Assuming 4 byte words, 32 byte (8 word) cache lines, and an n word string to copy, copying the string from source to intermediate and from intermediate to destination buffer each requires n word-wide read operations and n word-wide write operations, thus 4n (cached) memory accesses in total.

Assuming further that the sender has just written the message into the source buffer and that the receiver will immediately read the complete string after reception, small messages will still be in the cache when copied into the kernel buffer (only n/8 cache misses on the kernel buffer). With the same argument, short strings will still be in the cache when copied from the kernel buffer into the destination buffer. The cache misses on the receive buffer are not considered here as the receiver will read the message immediately; if the kernel writes the messages there (suffering cache misses), the receiver will not suffer cache misses. Conversely, if the message was not in the cache, the receiver would suffer cache misses; the kernel generated cache misses on the buffer are no avoidable overhead. The total overhead of this approach is n/8 cache misses when writing to the kernel buffer, all other cache misses are due to the sender writing and the receiver reading the message and thus "desired" (at least, not to be avoided).

For large n (n > size of the cache in words), the assumption of the message still residing in the cache before each copy does not hold: The later part of the string will evict the earlier part from the cache, thus this approach yields n/8 cache misses when reading the source buffer, n/8 cache misses when writing the kernel buffer, n/8 cache misses when reading the kernel buffer ^[5], and n/8 cache misses when writing the receive buffer. All of these are overhead (both the sender and the receiver will suffer additional cache misses when accessing their buffer), so for large n the overhead is 4n/8 (or n/2) cache misses.

Compare this to the alternative approach to be discussed next.

Temporary Mapping Area

The traditional copy-in/copy-out-scheme works well for small n, but even then involves complex buffer management schemes in the kernel (pagefaults on the receiver buffer might cause a pager to page in memory from disk, in the meantime other threads might also perform string IPC; the size of the kernel buffer is not/cannot be fixed nor known in advance).

To overcome these problems, L4 uses temporary mappings to make both the sender's (source) buffer and the receive buffer accessible in one address space at the same time. ^[6] For this purpose, the kernel reserves in its address space region a mapping area (the size of two consecutive superpages, 8 MB on x86), and maps the receive buffer (at most 4 MB, but not necessarily aligned; only guaranteed to fit into an 8 MB region!) into it. Afterwards, the sender can access the source buffer at its natural (virtual) address and the receive buffer via the (virtual) address of the mapping region in the kernel memory.

Having set this up, we can copy directly from source buffer to destination buffer (using the virtual addresses of the mapping region), without requiring a kernel buffer and without having to copy the string twice.

Evaluation

With the assumptions from the copy-in/copy-out-scheme, this approach only requires 2n memory operations (instead of 4n), 2n/8 cache lines (instead of 3n/8, no cache lines are required for the temp. mapping area if the cache is physically indexed, as the temp. mapping maps to the same physical frame as the receive buffer), avoids all overhead cache misses for small n (the message is only copied in the cache (given it fits), possible cache misses on the receive buffer are attributed to the receiver as above and thus not counted as overhead), and incurs only 2n/8 (instead of 5n/8; also 3n/8 as stated in the table in the lecture slides seems wrong) overhead cache misses for large n (n/8 when reading the source buffer and n/8 when writing the receive buffer; the receiver will suffer additional cache misses when accessing the buffer so these are true overhead (avoidable) misses).

On the downside, the temporary mapping also has a number of drawbacks: The setup cost is higher (set up the kernel mapping to the receive buffer), and more needs to be done should the copy operation be interrupted: The mapping area is a shared resource, whose state must be preserved across thread switches, but in contrast to the kernel buffer above, the mapping area can quite easily be shared, as detailed next.

Management of the Temporary Mapping Area

Setting up the temporary mapping area only requires copying the (up to 1024) receiver's pagetable entries that cover the receive buffer to the pagetable entries that cover the temporary mapping area in the sender's pagetable. Due to the two-level pagetable scheme employed by Intel x86, copying the up to two top-level pagetable entries suffices to make the complete receiver accessible. This effectively integrates one of the receiver's second level pagetables into the sender's address space.

Since we made long IPC preemptible (interrupts enabled during the message transfer), other threads might be scheduled while the temporary mapping is set up. These threads might

also want to perform string IPC, thus also occupying the temporary mapping region (there is only one in the kernel!),
cause the underlying mappings of the source or receive buffer to change (e.g., revoke mappings due to memory pressure).

Even worse would be a scenario in which thread A is preempted in a string copy to thread B, and the preempting thread C is resumed in the middle of its (previously started and preempted) string copy to thread D: Without care, C would now write into the receive buffer of B, as that's the way the temp. mapping area is (still) set up when C resumes.

To avoid such problems, L4 always invalidates the temp. mapping area on a thread switch. As the mappings might already be in the TLB, we also have to flush the TLB, as always after revoking mappings. Flushing the TLB is costly (not the flush itself, but restoring the TLB entries afterwards requires a pagetable walk (two memory accesses) per TLB entry), and is implicitly performed on x86 whenever the address space is switched (by loading the address of the top-level pagetable into %cr3). So, we need only explicitly flush the TLB if we switch to a thread in the same address space as the (preempted) sender of the string IPC.

If now thread C resumes and writes to the temp. mapping area, it will raise a pagefault (A's/B's mappings there have been invalidated), which will be handled as a special case in the L4 kernel to reestablish C's temp. mapping area:

If the pagetable entry of mapping area matches receivers pagetable entry, we have a pagefault on the receive buffer (the second-level pagetable entry is invalid or read-only), so that the receiver's pager is notified to handle the pagefault via ^[7]
- switching to the receiver thread
- touching the receive buffer (causes pagefault, resolved by the pager)
- switching back to the sender thread
Then the top-level pagetable entry is copied (again) from the receiver's pagetable to the sender's one (it might have changed during pagefault resolution by the receiver's pager or it might have been invalid, e.g., when C resumes its preempted string copy).

In order to be able to restore the temp. mapping after preemption, we need to store the thread ID of the receiver and the location of the receive buffer in the sender's TCB (done when the temp. mapping is set up before starting to copy).

Temporary Mapping Area on Multi-Processor Machines

On multi-processor machines, each CPU should be able to perform a string IPC in parallel with all other CPUs. A single temp. mapping area in the kernel, however, would prohibit such parallelism and instead require locking and synchronization on the shared resource "mapping area".

As a workaround, each CPU can be given a private temp. mapping area in a distinct region within the kernel memory space. In effect, this approach allocates not one but N (the number of CPUs in the system) mapping areas in the kernel, each one being used exclusively by one CPU; no additional locking to access the mapping area is required. However, this approach requires to limit the max. number of CPUs at compile time so that the kernel memory regions can be reserved. If the kernel is compiled for up to 32 CPUs but only 2 are present, 30 x 8 MB of virtual kernel memory are wasted. What's more, this approach does not scale well to larger, many-core systems with 100s or CPUs: one cannot dedicate 100 x 8 MB of the 1 GB of kernel memory to mapping areas!

As an alternative, L4 uses CPU-local memory: Each CPU has its own copy of the top-level pagetables and can thus establish its own temporary mapping area without disturbing the other CPU's mapping areas. This approach not only scales better (you just need to duplicate the top-level pagetable (4 kB) instead of the mapping area (8 MB)), but also restricts the mapping area to the same virtual addresses on all CPUs, which makes programming easier (instead of using map_area[get_current_cpu()] you can now simply access map_area from all CPUs and still have it map to different receive buffers on each CPU).

^[5]When copying the string backwards out of the kernel buffer, the later part (copied first) is still in the cache, reducing the number of cache misses; this is neglected here assuming that the string is always copied in order of ascending addresses.

^[6]The mappings are temporary as they vanish at each thread switch.

^[7]The sender thread is running, but the pagefault occurs in the receiver thread's address space (although none of the threads there might be running). Activating the pager of a thread that is not even running is called page fault tunneling.

Chapter 7. Dispatching

Table of Contents

Scheduling in L4
Optimizations
Dispatching Mechanism
Lazy Dispatching
Timeouts

L4 offers threads as one of its (few) abstractions. As a consequence, a mechanism to dispatch (i.e., switch to and execute) a thread is required as well.

The dispatching mechanism described here is only used if the next thread to run is not indicated by the user and thus a thread has to be selected according to some scheduling policy. The IPC paths, e.g., often imply a thread switch to the receiver; this path bypasses the dispatcher. Similarly, a yield(next_tread_id) operation can be used to switch to another thread without going through the complete dispatching mechanism. On the other hand, should the current thread be preempted due to end of timeslice or because a high priority thread (e.g., an interrupt handler) becomes runnable, the dispatcher machinery is invoked.

Scheduling in L4

Whenever L4 must choose the next thread to run, it consults its built in scheduler. The scheduler uses

fixed priorities (0..255) per thread

Although the threads' priorities can be changed at runtime by the user, the kernel does not change them as, e.g., earliest-deadline-first-based schedulers might; 255 signifying the highest priority)

hard priorities

That the scheduler always chooses one of the ready threads with the highest priority, low priority threads can starve.

round robin per priority

Threads of the same priority are scheduled in round-robin fashion for a given timeslice; at the end of the timeslice, the thread is preempted and another thread is dispatched. For this purpose, the L4 scheduler maintains

a doubly linked, circular list of all ready TCBs per priority level (list pointers are embedded in the TCBs to avoid allocation and deallocation overhead and to increase locality: the list node and the TCB share the same TLB entry), and
an array of 256 pointers to the TCB of the currently running/next thread per priority level.

In principle, the scheduler starts at the array entry for the highest priority and checks if there is a thread ready to run (array entry is not nil). In this case, the thread is selected for dispatch, otherwise, the array is searched in order of decreasing priorities. If no thread is ready to run, a special idle thread is dispatched.

When the current thread blocks, yields, or exhausts its timeslice, the next thread in that priority class is dispatched and given a new timeslice. Additionally, the pointer to the currently running thread in the array is updated to point to the selected thread.

Timeslice accounting is slightly inaccurate as it happens in the handler of the periodic timer interrupt by subtracting the length of the timer period from the remaining timeslice length. If the remaining timeslice becomes zero or negative, it is said to be exhausted, causing an immediate scheduling process. If the thread blocks in between two timer ticks, the next thread will start with only half a timer period till the next timer tick. Whether it is billed a whole timer period or whether this case is correctly accounted for is beyond the author's knowledge.

Optimizations

Usually, threads have priorities around 100 on L4. To avoid checking the priority levels 255..101 for runnable threads, L4 remembers the highest priority with a runnable thread and only checks at and below that level during scheduling.

This so-called "highest active priority" must be updated whenever a thread with a higher priority becomes ready to run, and should be updated whenever the currently highest active priority level becomes empty (because the last thread on that level blocks). Both events can easily be registered in the kernel.

On x86, further optimizations are possible for sparsely populated priority arrays: x86 offers a bsr (bit scan reverse) instruction, which can be used to locate the index of the most significant set (1) bit in a word. If we maintain an bitfield of 256 bits (8 32-bit words), i.e., one "active" bit per priority level, we can use this instruction to locate the highest active priority without searching the array at all.

The current implementation always folds 16 priority levels into one bit, reducing the length of the bitfield to search to 16 bits. This approach requires to scan up to 15 (empty) arrays entries in the order of decreasing priorities to identify the real highest active priority and complicates the conditions under which a bit can be cleared. The reasons for not using 32 bits are unclear, there is even no apparent benefit of using only one word instead of 8: Both cache utilisation and TLB coverage would favor the 256 bit bitfield, no difference with respect to branch prediction logic is obvious.

Dispatching Mechanism

Finding the next thread to run can be quite time consuming (esp. if no bsr instruction can be used) and should thus be preemptible itself, in order to achieve a low interrupt latency. If the scheduler were to execute in the context of the current thread (the one that is to be replaced), this goal would be hard to achieve: On interrupts, the current state of the interrupted thread would be saved, the (in-kernel) interrupt handler would still execute on the stack of the current thread and possibly invoke the scheduler again. When the current thread is scheduled again, it would resume its preempted scheduling activity, although the thread itself should continue! Additionally, the interrupt handler and nested scheduler consume space on the threads kernel stack, which, being included in the TCB, is quite limited.

Solving all these problems, L4 uses a dedicated thread for the scheduling decisions. This thread is never resumed after preemption but rather started anew, discarding its previous state (the scheduling-relevant data structures have possibly changed since the scheduler has been preempted, so a fresh start is recommended). This property also removes the requirement for an exceptionally large kernel stack; except for a single interrupt handler, nothing will be nested on top of the scheduler.

The consequences of this model are that dispatching via the scheduler requires two thread switches: first from the current to the dispatcher thread, and then from dispatcher to the next thread. However, as the dispatcher never executes in user mode, it does not have an associated address space, so that switching there is cheap (no address space switch, no TLB flush). As it is never referenced from user mode, it can even do without a thread ID, which further allows to freely place (and size) its stack; it need not be included in any TCB.

Lazy Dispatching

L4 uses circular ready lists per priority level. If these lists were maintained eagerly, a thread would have to be removed whenever it blocks (e.g., during a call IPC, waiting for the reply). Removing a thread from the ready list requires touching its preceding list element (to update its next pointer) as well as touching its successor list element (to update its prev pointer) in addition to the list element itself. With the list being embedded in TCBs, this translates to having to touch three TCBs (predecessor, successor, and current), each one probably covered by a different TLB entry. Though we cannot avoid touching the current TCB, and though we will probably switch to the next thread anyways, so touching the successor list element does not "waste" a TLB entry for the single purpose of updating the ready list, touching the predecessor TCB solely serves the purpose of keeping the ready list up-to-date. The possible TLB miss on touching the predecessor TCB can be avoided by maintaining the ready lists lazily as follows.

Instead of removing a blocking thread's TCB from the ready list immediately, we just leave it in there and simply set its state to blocked. If a thread that is selected for dispatching is marked as blocked, we remove it from the list and dispatch the next thread. This lazy handling of ready lists is called lazy dispatching.

... and what does this scheme buy us?!?

If the thread was blocked only for a short time and not scheduled between its block and unblock events, it can remain in the ready list; lazy list maintenance saves us the complete remove/insert operations on the list.
If the thread is scheduled while being blocked, we already touched its predecessor TCB (in most cases, that is the currently executing/preempted thread)! Since we will also have to touch the successor TCB (since that will be the thread to dispatch), performing the list operation now does not introduce additional TLB misses. (Accessing the blocked TCB is as costly and as useless as touching the predecessor TCB in the eager case, but benefit 1. remains.)

Timeouts

Threads that are blocked during IPC can be unblocked either by completing their IPC (partner arrives and accepts/delivers a message) or by aborting it due to a timeout. With many blocked threads in a system it would be infeasible to check all blocked threads' timeouts on each timer interrupt (aka tick). Instead, L4 maintains a (more or less ordered, see below) list of timeouts, with the one closest to its expiry at the head of the list. This only requires to check a single timeout at each tick (and would even allow tickless implementations, setting the timer to min(timeslice length, earliest timeout) on each kernel exit), but requires a clever data structure to keep track of the timeouts.

Since timeouts could be created with every IPC, inserting them into the "list" must be fast. Under the assumption that most timeouts are never actually raised (they are usually used to recover from error conditions/deadlocks/false assumptions; under normal conditions they are cancelled once the IPC completes), canceling a timeout must also be cheap. With the same argument (raised timeouts indicate some error in the application), handling an expired timeout may be more expensive.

Both unsorted and sorted true lists do not meet our requirements: Unsorted lists take O(n) time to find the nearest timeout, sorted lists take O(n) time to insert a new element.

Using (balanced) trees to maintain the timeouts would probably allow cheap lookup/insert operations (O(1), O(log n)), but are deemed to be too complicated ^[8] and expensive (probably in terms of time (e.g., reorganizing an AVL tree) and memory (>= 3 pointers per tree node instead of 1 or 2 in lists)).

Anyway, an epoch-based multi-list scheme has been proposed for L4. ^[9] The idea is to define two absolute times on system startup, e_soon and e_late, with e_soon being several micro- or milliseconds from now on and e_late being approx. an order of magnitude later. We can now use an ordered list (aka soon list) for the nearly expired timeouts (those that fire before e_soon), an unsorted list (aka late late list) for timeouts in the far future (firing after e_late), and another unsorted list (aka late list) for timeouts in between the other two (at or after e_soon but before e_late).

Insertion into the sorted soon list is cheap because (hopefully) there is only a small number of timeouts stored in that list (e_soon should be chosen to maintain this property). Insertion into the unsorted late and late late lists is cheap because we can simply append (or prepend) new timeouts (O(1)). To facilitate quick removal of cancelled timeouts, we can store a pointer to the timeout list entry in the associated TCB (O(1)). As the current design precludes any thread from having more than one active timeout, we can even go further and include a wakeup list node in the TCBs (as is done with the ready lists as well), removing the necessity to store a pointer to the list node.

Once e_soon is reached, a correction phase is required (note that the soon list is empty at this time!): First, we define a new e_soon in the near future, then we iterate over the late list, moving all timeouts that fire before the new e_soon into the (sorted) soon list and continue.

Whenever we reach the updated e_soon, we process the late list as just described. Only once we reach e_late do we need to inspect the timeouts in the late late list; which we process similar to the late correction above: First update e_late, then iterate over all entries from the late late list and move them into the (sorted) soon list (if expiry is before e_soon) or the late list (if expiry is before e_late).

The epochs serve to reduce the sorting efforts (only nearly expired timeouts are sorted), the late list keeps the cost of the late correction phase small (if e_late is not too far in the future), and the probably costly late late correction phase handles the remaining (long-term) timeouts.

Note that the correction phases are only required once the current time exceeds e_soon or e_late: The insertion strategy guarantees that no timeout in the late or late late lists can be missed (all expiry times are later than e_soon or e_late, respectively).

As an additional optimization, we can reuse the lazy dispatching idea: Instead of removing the timeout list element when a timeout is cancelled, we can leave it in the list and just set its expiry time to never. Such items can be removed from the lists (at a low cost) during correction phases or can be reused for the next timeout iff it falls into the same late/late late list (the soon list must always be sorted).

^[8]Trees being to complicated is a kind of strange argument; once you get them right, their complexity should be no problem, and data structures such as the mapping database are both tree-like and even more complex...

^[9]The scheme presented has never been implemented. As non-zero and non-infinite timeouts have been found to be rarely used, L4 uses a single (ordered) wakeup list for all non-trivial timeouts.