Table of Contents
List of Examples
This is work in progress. When finished, this document will contain most of the material presented in the lecture on Microkernel Construction at the University of Karlsruhe in summer 2008. It is intended as a learning aid to be used in conjunction with the lecture slides (available from https://i30www.ira.uka.de/teaching/courses/lecture.php?courseid=167).
The material on which this document is based has been developed by Prof. Jochen Liedtke in 1999, with contributions and additions by Uwe Dannowski, Kevin Elphinstone, Joshua LeVasseur, Espen Skoglund, Jan Stoess, and Volkmar Uhlig.
May this document help in understanding the design of microkernels in general and the L4 microkernel in particular.
... Raphael Neider (September 2008)
Table of Contents
Monolithic operating system kernels abstract from the hardware and offer more convenient services such as threads, address spaces, files, synchronization primitives, ... to applications. New features are added by adding code to the kernel, which leads to code size explosion, e.g. in the Linux kernel. In order to fulfill physical memory management tasks, fair scheduling, and hardware management tasks, monolithic kernels have all their code execute in the privileged processor mode. Hence each subsystem (file system, networking stack, ...) and each driver must be fully trusted, as malicious or faulty code can affect the whole system.
Another problem with monolithic kernels is their poor maintainability: As all subsystems share the same binary, it is easy to take shortcuts and directly access data structures of other subsystems from anywhere. As a consequence, changing the structures can require changes throughout the kernel and thus hampers development.
If the subsystems had clean interfaces, whose use would be enforced by the compiler (layered, modular, or object-oriented kernels), at least the second problem (maintenance) could be solved, but the first one (robustness) still remains as long as all submodules share the same address space.
The goal of microkernel-based operating systems is to minimize the amount of code that is executed in the privileged processor mode. For this purpose, each driver and each convenience OS subsystem is moved into an unprivileged user-level application, a so called server. Only a slim layer beneath the servers executes in privileged mode and only deals with safety critical tasks such as granting access to physical memory/devices, routing interrupts to the appropriate driver, and communication between the threads/address spaces.
Such a scheme would allow to execute multiple operating systems on top of the microkernel at the same time (virtualization), run special services directly on top of the microkernel (no high-level OS involved), or provide real-time services next to general purpose services without having to port the latter to a real time-capable OS.
Due to the separation of the subsystems and drivers into different address spaces, clean interfaces are required and enforced by offering only a limited communication API. As a consequence, both the robustness of the resulting system as well as its maintainability should improve compared with the monolithic approach.
However, early microkernel-based OSes such as IBM's Workplace OS and early microkernels such as Mach uncovered a significant performance penalty due to the dramatically increased number of context switches: Instead of referencing data structures in some submodule, microkernel-based systems need to issue a request-IPC to the server, switch to its address space, probably copy the requested data into kernel buffers and/or the client's address space, switch back to the client's address space, and finally access the data.
Two lessons are to be learned from these projects:
Communication across address space boundaries is the most frequently used
service in microkernel-based systems and, consequently, this operation
must be as fast as possible.
The whole microkernel must be designed to allow for really fast IPC.
To achieve this, the target architecture must be carefully analyzed for
possible performance features (e.g., sysenter
/sysexit
on x86) and
bottlenecks (avoid slow DRAM accesses, obey cache structure during layout
of the central data structures, avoid mandatory TLB flush on context
switch if possible, minimize kernel impact on branch prediction logic, ...).
The L4 microkernel, which is the basis for this lecture, was designed from scratch (rather than stripping existing systems down) and implemented to provide fast IPC. It offers two abstractions: Threads to abstract from the CPU, and address spaces to protect threads against each other. In order to be able to manipulate these abstractions, the L4 kernel also provides two mechanisms: inter-process communication (IPC) allows threads to communicate and to synchronize, whereas mapping allows to recursively populate address spaces.
The kernel maps all hardware properties to these abstractions: All activities (including I/O devices) are represented in threads, interrupts are transformed into an IPC from the device thread to the driver thread. Pagefaults are wrapped into a pagefault IPC from the faulting thread to another thread that is responsible for providing memory (a so-called pager), and (physical) memory is given to an address space by mapping memory from the pager (the pager itself requests the memory from its respective pager, the root pager idempotently maps physical memory to its clients on demand).
These two abstractions (threads and address spaces) are just one of potentially many valid choices. However, the provided abstractions should be chosen so that everything that needs to be managed in privileged mode for reasons of
%cr3
on x86 is a privileged instruction) or
is managed by the microkernel (in privileged mode), and everything else is delegated to unprivileged user-mode tasks.
An additional criterion is to avoid duplicated functionality in the kernel: There should always be exactly one way to do a job, no more and no less. As a consequence, all abstractions and mechanisms must be general enough to allow all desired uses. On the other hand, all abstractions and mechanisms must be orthogonal, so that none can be emulated using the others (no superfluous abstractions or mechanisms are allowed inside the microkernel [1]).
[1] This is not true for large kernels such as Linux:
Files can be manipulated using either the open()
, read()
, write()
,
and close()
syscalls or via mmap()
/munmap()
.
Table of Contents
Threads are one of the abstractions offered by the L4 microkernel. They abstract from the CPU and thus allow to share a single CPU among multiple activities (tasks, programs, jobs, ...). The thread abstraction enables multi-programming systems by switching between threads whenever another activity is to be executed on the CPU.
To this end, threads must be represented by a data structure that can hold all the CPU state, including the current instruction pointer, the stack pointer, the contents of general purpose registers and special purpose registers (e.g., floating point or status registers) contents. Additionally, to enable controlled switching between threads (scheduling), scheduling parameters such as thread priorities, time slice lengths, and thread status (e.g., running, ready, blocked, invalid) are also required. All this information is stored in so-called TCBs, which will be discussed in more detail in Chapter 4, TCBs.
Switching threads is conceptually easy:
However, two problems arise: First, TCBs must be protected kernel objects, so that no thread on user-level can manipulate other threads' states. Furthermore, threads must not be allowed to modify even their own TCB, if/as the TCBs contain scheduling information (e.g., the remaining timeslice length).
As a consequence, thread switches must (most of the time, see Chapter 10, Local IPC) be performed by the kernel, which means that the kernel must run in order to do so. If the kernel is running, the user-level state of the thread must be preserved somewhere in order to be able to correctly resume it. For this second problem, some hardware support on entering the kernel is required: On kernel entry, the user-level CPU state could be saved on a stack (x86), or the kernel can use a different set of CPU state (register banks, found in MIPS or Itanium machines).
On stack-based machines (x86), a thread switch thus requires the following steps:
The system enters kernel mode
... and returns to user-mode, causing the hardware to
Concerning the stack being used in kernel mode, different models are possible. The kernel stack should always be different from the user-level stack for security reasons: First, the user can set up an invalid stack pointer before entering kernel mode, loading a well-known, valid stack pointer on kernel entry is likely to be easier than validating the current value. Secondly, the kernel entry might be caused by a page fault on the user-level stack. Storing user-level state on this stack in order to handle the page fault will thus raise further page faults (accesses the same non-mapped memory) which can never be handled.
The presented models differ in the number of kernel-level stacks.
All threads executing on one CPU can use the same kernel stack. [2] As a consequence, either only one thread can execute in kernel mode at any time (i.e., threads executing in kernel mode cannot be preempted), or unusual approaches such as continuations must be used.
This model is efficient with respect to cache lines used during a context switch (only one stack), but uncomfortable from the kernel developer's point of view.
Alternatively, each thread can have its own kernel-level stack. This model resembles user-level code and makes kernel mode nothing special from the kernel developer's point of view. However, this model is more costly both with respect to virtual memory use (only one stack must be addressable in the previous model, whereas this model requires (number of threads) stacks to be present in virtual memory) and with respect to cache line use on thread switch: Both the stacks of the current thread an the stack of the next thread are accessed, effectively doubling the number of cache lines used for them.
Due to its simplicity, L4 employs the per-thread stack model and integrates the kernel stacks into the TCBs. (TCB plus stack fit into a single page to avoid additional TLB misses.)
Code to cooperatively switch threads in kernel mode on x86:
/* save all registers on the stack */ pusha /* compute TCB address of current thread * (esp already points to the end of the TCB, * i.e., the top of the kernel stack) */ mov esp, ebp and -sizeof_tcb, ebp /* assume: edi = address of TCB of B */ mov esp, TCB_Off_esp(ebp) mov TCB_Off_esp(edi), esp /* setup kernel stack for next kernel entry */ add sizeof_tcb, edi mov edi, 0(esp0_ptr) /* restore all registers from the stack */ popa
Note:
[2] Threads on different CPUs should always have separate kernel stacks in order to exploit possible parallelism and to avoid synchronization issues among the CPUs on kernel entries.
Table of Contents
Manipulation of the microkernel abstractions thread and address space is exposed to the user-level applications via system calls. System calls are provided to create threads (potentially inside a new address space), manipulate their scheduling parameters (in L4: priority, processor affinity) in a controlled way, and for IPC (to bridge address space boundaries and to allow for synchronization among threads).
In order to use one of these services, the user-level application must be able to
Due to the separate kernel stacks, passing parameters on the stack is no viable option. For this reason (and for performance reasons, i.e., to avoid memory accesses), user- and kernel level exchange data solely in registers. [3]
For kernel entry/exit, hardware support is required to atomically exchange the user-level and the kernel-level state (at least stack pointer and instruction pointer). Most architectures offer a special instruction for the single purpose of entering/leaving the kernel.
Intel traditionally offered the int n
instruction to allow software
interrupts, which are handled exactly like hardware interrupts:
When the user executes the int
instruction, the hardware
%esp0
(%esp
for privilege ring 0, i.e.,
kernel mode) for the current thread from a known memory location into
%esp
and its segment into %ss
,
%ss
and offset %esp
),
flags, instruction pointer (%eip
and its code segment %cs
),
and (potentially) an error code (used in case of exceptions rather than
interrupts, e.g., to indicate page faults, division by zero, ...) on the
kernel stack,
n
),
%eip
and %cs
.
After this, the handler function can store the remaining registers on the
kernel-level stack and call a C-function to implement the requested service.
Having completed the request, the kernel restores the user-level registers
(or stores results in there, depending on the API/ABI), and returns to
user-mode using the iret
(return from interrupt) instruction, which
atomically pops the user-level %eip
/%cs
and the %esp
/%ss
pairs as well
as the flags from the kernel stack and enables interrupts again.
The int
instruction is easy to use, as the kernel needs to handle interrupts
anyways (at least timer interrupts for time-sliced, round robin scheduling and
inter-processor interrupts for inter-processor IPC), so the infrastructure is
there already.
On the other hand, using int
is quite slow, especially due to the memory
lookup of both the kernel stack pointer (%esp0
) and the entry point of the
handler routine in the interrupt descriptor table (IDT).
As kernel entries/syscalls are frequent (especially in microkernel-based
systems), Intel introduced a less costly alternative kernel entry facility
via the sysenter
/sysexit
instructions.
[4]
These new instructions avoid memory lookups by providing CPU-internal special
purpose registers (model-specific registers, MSRs) to store a single entry
point into the kernel as well as a single kernel stack pointer to load on
kernel entry via sysenter
.
Furthermore these instructions effectively disable segmentation by setting up
flat segments (full 4 GB segments starting at linear/virtual address 0),
which allows to skip additional costly segmentation-related checks on mode
changes between kernel-mode and user-mode.
As most of today's operating systems (including L4) ignore segmentation and
use flat segments already (using only paging for access control and physical
memory management), this is usually not a problem (but see Chapter 9, Small Spaces).
As sysenter
always jumps to the same handler routine in the kernel,
an argument (a register) would be required to select one of the
syscalls offered by the kernel.
Instead of sacrificing a register for this purpose, L4 implements only
the most frequently used system call, namely IPC, via sysenter
,
all other system calls use the traditional kernel entry method via int
and either have an explicit argument denoting the system call number or
use different interrupt vectors (int 0x80
, int 0x81
, ...) for the
different system calls.
As L4 aims not only at maximum speed but also at portability, systems
without the (relatively) new sysenter
instruction shall also be
supported and the optimal method of entering the kernel shall be
selected at runtime.
A problem here is that the stack layout after kernel entry via sysenter
differs from the layout after using int
:
Using sysenter
, the hardware only
%eip
from a MSR and forces %cs
to the flat segment (0..4 GB),
%esp
from a MSR and forces %ss
to the flat segment (0..4 GB),
Nothing is stored on the stack, the user-level instruction pointer and
stack pointer to use when returning to user-level must be passed in
registers (by convention and because sysexit
expects them there,
%eip
is passed in %edx
and %esp
in %ecx
).
In order to obtain a stack layout that matches the int
entry path,
the sysenter
handler (in the kernel) first stores the user %eip
(passed from user-level in the register %edx
) and the user %esp
(from %ecx
) as well as the error code ("IPC syscall") on the stack
at the "correct" locations.
The segment registers (%cs
, %ss
) are always flat and thus need not
be stored (they are still correct and in place since thread creation),
and the flags are undefined after the IPC syscall as per the microkernel
API and thus need not be stored as well.
Example 3.1. Emulate effects of int
after sysenter
mov (esp), esp /* load esp from TSS.esp0 */ sub $20, esp /* create interrupt frame */ mov ecx, 16(esp) /* store %esp and %eip where... */ mov edx, 4(esp) /* ... the HW would have done */ mov $5, (esp) /* indicate IPC syscall */
Having set up this stack frame, the sysenter
handler can jump right
to the int
entry code to handle the rest, thus enabling faster
kernel entry and compatibility without any code duplication.
A similar problem arises for threads that execute in virtual x86 mode (real
mode):
Whenever they cause a kernel entry/exception, all segment registers are
pushed onto the kernel stack before setting up the regular exception frame.
To accommodate for this, L4 makes %esp0
point to the real top-of-stack only
before switching to a virtual x86-mode thread; in all other (the usual) cases,
%esp0
points to 4 words before the top-of-stack, effectively pretending the
next kernel entry would have pushed 4 nonsense words before its stack frame.
This allows to address, e.g. the user-level return address (%eip
) at a fixed
offset in the kernel stack under all circumstances and thus simplifies the
kernel code by avoiding special cases.
The return path via sysexit
has to load the user-level %eip
and %esp
(from the stack) into %edx
and %ecx
respectively and must enable
interrupts prior to issuing the sysexit
instruction.
All of this plus segment register reloading and checking would also be
done by the iret
instruction on the traditional exit path, but much
slower.
Example 3.2. Emulate iret
using sysexit
mov 16(esp), ecx /* load %esp and %eip from where iret would */ mov 4(esp), edx sti /* enable interrupts as iret would */ sysexit /* return to user mode */
[3] These might be virtual registers, see the section called "Virtual Registers and the UTCB".
[4] AMD introduced similar instructions, syscall
and sysret
,
which implement the same ideas and shall thus not be discussed in more detail
here.
Table of Contents
TCBs implement threads. Most important, a thread's TCB stores the thread's CPU state (at least a stack pointer, the remaining state such as registers, instruction pointer, and flags can be stored on the kernel stack) while the thread is not executing on a CPU. Besides the state, also thread-local management data is stored in the TCB:
%cr3
) for Intel or ASID/TLB tag for
architectures with tagged TLBs)
The resources bitfield indicates whether the fast (hand assembled) IPC path can be used or whether the (slower) C path must be used: If any resources are in use (indicated by at least one set bit in the resources bitfield), the C path must be used to additionally save the state of the resources in use (see the section called "Fast Path IPC").
The thread state and the list pointers are redundant, but speed up access in various situations: When deciding which thread to run next, lists of ready threads per priority level are useful, as with these simply the successor of the current thread can be chosen (round robin, assuming no higher priority thread is ready, see Chapter 7, Dispatching). As list manipulations are rather costly (the current and both its predecessor and its successor TCB must be updated to insert or remove a thread into/from a list, probably causing cache and costly TLB misses), having an additional thread state field in the TCB allows lazy list handling (see the section called "Lazy Dispatching").
Also note that the per-state lists are embedded into the TCBs; no additional
list items have to be allocated/released.
This simplifies the kernel's memory management (no malloc
/free
required)
and avoids having to locate free memory on demand (increases performance).
Furthermore, embedding the list into the TCB guarantees that the list pointers
and the other items in the TCB reside in the same memory page:
All thread-related data (including the kernel stack, which is "appended" to
the TCB) thus share a single TLB entry, avoiding superfluous TLB misses where
possible.
A further note on performance: The layout of the TCB structure is performance critical. All items that need to be accessed during IPC (at least global ID, state, and space) should reside in the same cache line/in as few cache lines as possible. Each cache line used by the microkernel introduces a likely cache miss on first access and probably evicts data from user-level from the cache, adding another cache miss on the next access.
In normal (i.e., expected) use, TCBs have to be located for three reasons:
The first task could be accomplished by using a kernel global variable to
always store a pointer to the current thread's TCB.
However, each access would then require a memory access to read this pointer
in addition to accessing the TCB itself.
L4 instead makes use of the design decision of integrating the kernel stack
into the TCB:
Defining TCBs (including the stack) to be 2k
bytes in size (k
in [9..12])
and forcing them to be naturally aligned (i.e., the least significant k bits
of the start address of the TCB are all 0) guarantees that any address within
the TCB masked with ((1 << k) - 1)
yields the start address of the TCB.
Given the fact that the kernel stack pointer points somewhere into the TCB of
the current thread and is immediately loaded into the %esp
register upon
kernel entry, the L4 kernel can efficiently locate the current TCB by simply
masking %esp
with a constant mask.
To motivate the scheme used to locate TCBs by thread ID, let us first look at
how TCBs are to be stored.
As mentioned previously, a microkernel should not rely on fine grained memory
allocation/heap management a la malloc
/free
, but rather use preallocated
data structures whenever possible.
This approach avoids the management overheads (both (memory) space and
runtime) as well as certain types of errors (out of memory).
As a consequence, L4 allocates virtual address space regions for all TCBs on
startup.
To further simplify TCB management, all TCBs are allocated en bloc, that is in a single large virtual array, indexed by the thread number. As a consequence, the number of threads that can be active in an L4-based system is limited by the number of TCBs allocated at startup. Given a 32-bit system and 2 kByte per TCB (typical for x86 systems), this approach requires 21 bit thread numbers (up to 2 M threads). However, in this constellation, the TCB array would fill the complete 4 GB address space, leaving no space for other kernel data structures or even code in the same address space. Consequently, L4 limits the number of threads to (an arbitrarily chosen) 256 k (18 bit thread numbers) and thus dedicates 512 MB of virtual memory to the TCB array.
The simplest choice for thread IDs would now be to use the address of the TCB as its thread's ID. This choice is bad because thread IDs are passed from the user to the kernel (e.g., to denote the partner thread of an IPC operation), and using pointers would require sanity checks on the IDs. A better choice would be to use the 18 bit thread number as thread ID, which can efficiently be masked to produce a valid index into the TCB array. However, this approach would waste 14 bits of the 32 bit per word used to hold a thread ID.
Thinking of the thread number as a distinction of threads in space (different thread numbers denote different TCBs), leads to the idea of using the yet unused 14 bits as a distinction of threads in time (the same thread number might (have to) be reused during the uptime of the system). Following this scheme, L4 thread IDs are in fact split into a thread number field (index into the TCB array) and a version field, the use of which is not prescribed by the microkernel but rather left to the user-level thread or task manager on top.
The version field is also the reason for why the global thread ID is stored in the TCB: Any thread ID with the same thread number would resolve to the same TCB, but only a specific version of the thread is currently active. Mismatches in the version field indicate stale thread IDs (e.g., the intended thread has terminated and its TCB reused by another thread) and should thus cause, e.g., IPC operations to fail.
Given the above constraints (32 bit thread IDs, 18 bit thread number, 14 bit version number, 2 kByte TCBs), the most efficient way to encode the thread IDs is to split the version field and place it around the thread number field, so that a single mask operation suffices to compute the byte offset of the desired TCB in the TCB array. By using the least significant 11 bits and the most significant 3 bits of the thread ID for the version field, the thread number field an bits 12..29 directly encodes the byte offset in the TCB array.
Example 4.1. Locating a TCB by its thread ID by masking out a split version field
mov thread_id, eax mov eax, ebp and mask_thread_no, eax add offset, eax cmp TCB_Off_Myself(eax), ebp jne invalid_thread_id
Masking out the version field and adding the base address of the TCB array
(offset
in the sample code) yields a pointer to the addressed TCB iff the
versions match.
This is checked by the cmp
instruction, which compares the given thread ID
with the one stored in the TCB.
The CPU jumps to the error handling code at invalid_thread_id
if the two
thread IDs are not equal, otherwise it simply continues.
This approach is efficient (no superfluous memory/cache/TLB accesses), but the split version field is a bit awkward.
With just a single register-internal shift-instruction more, the version field can be left intact and be placed entirely before (or after) the thread number field.
Example 4.2. Locating a TCB by its thread ID by shifting and masking the thread ID
mov thread_id, eax mov eax, ebp and mask_thread_no, eax shr threadno_shift, eax add offset, eax cmp TCB_Off_Myself(eax), ebp jne invalid_thread_id
Both approaches take the thread number directly as an index into the TCB array. If thread numbers are not assigned sequentially (or if many threads have terminated but their TCB not yet reused), the TCBs of active threads will be spread over the whole TCB array. As a consequence, the TCBs must be backed by standard (e.g., 4 kByte) pages in order to avoid allocating much physical memory (a superpage, e.g., 4 MB) of which only a small fraction is used. This leads to a situation in which it is likely for operations that involve two or more TCBs (e.g., IPC) to require two TLB entries (one per TLB). If the active TCBs could be migrated to/clustered in few superpages, only one TLB entry might be required covering both TCBs in IPC operations. With the fixed relation between thread number and TCB location above, migration would also change the thread ID, which is not feasible (all references to the previous thread ID would have to be identified and updated).
Introducing a level of indirection in the form of lookup table solves this
problem:
The thread number is just an index into a table (thread_table
in the sample
code), which holds in each entry a pointers to the associated TCB.
By placing the version field into bits 18..31 of the thread ID and making use
of Intel's advanced addressing modes (scaling the offset by the word/pointer
size of 4 bytes), this boils down to following piece of code:
Example 4.3. Locating a TCB by its thread ID using a lookup table
mov thread_id, eax mov eax, ebp and mask_thread_no, eax mov thread_table(4*eax), eax cmp TCB_Off_Myself(eax), ebp jne invalid_thread_id
When a thread terminates, the released TCB can be reused by the next created thread (or even another active thread's TCB can be moved into the gap, updating only the table pointer). With 2 kByte TCBs and 4 MB superpages (typical for x86), the TCBs of 2048 active threads would fit into a single superpage and thus reduces the TLB footprint (and TLB misses) of the kernel.
On the downside, the lookup table itself requires 256 k x 4 B = 1 MB of memory, and accesses to it also require it to be mapped in the TLB. A clever scheme could integrate the lookup table into the first 1 MB of the superpage that also covers the active TCBs, so that table and TCBs usually share a single TCB entry. This approach would leave the TCBs of 3 MB / 2 kB = 1536 active threads in the first superpage together with the lookup table; still sufficient for many scenarios. Since more threads only introduce another TLB entry per 2048 TCBs, this approach scales even better than the direct approaches with respect to TLB utilization.
Another drawback of this indirect approach is the added memory access to read the table. This might (probably will) cause a cache miss and thus increases the cost again slightly, but probably not as much as the additional TLB misses in the direct approaches.
The indirect approach has probably not been evaluated well enough in the implementation of the L4/x86 kernel, but has proven its feasibility with the introduction of communication spaces (see Chapter 12, Security). L4 currently employs the direct scheme with shifting and masking and the version field kept intact.
The next ready-to-run thread is neither known by its ID nor is it (usually) the current thread. Locating it highly depends on the scheduler implemented and shall thus be discussed in more detail later (see Chapter 7, Dispatching).
The basic idea is to have a linked list of ready threads, with the current thread being the list head. The next ready thread is then simply the next thread in the list (this description ignores priorities for brevity), which is reachable via the list pointers in the current thread's TCB.
The TCB array with its 256k entries of 2 kByte each requires 512 MB of memory. For obvious reasons, the microkernel should not claim 512 MB of physical memory from the start on but should rather grow only on demand. Thus the TCB array is allocated only in virtual memory. The first access to a page in the TCB array will cause a pagefault, which is handled specially by the kernel:
nilthread
ID))
This is a variant of the copy-on-write scheme, which saves memory even if a malicious user-level thread probes every thread ID in the system; there will only be a single physical frame mapped multiple times into the TCB array. The initial pagefaults can be avoided by premapping the 0-filled page to back the whole TCB array. Afterwards, only the first TCB modification on each page will cause a pagefault.
This scheme works both for standard (4 kB) pages and for superpages. If the TCB array should be backed with superpages, still a small 0-filled page can be used to back the unused parts. However, simplicity dictates that the first write requires a superpage to be mapped (otherwise physical memory management, which aims at allowing later promotion to superpages, becomes veeeery difficult)! |
In the indirect TCB addressing scheme, the lookup table should not be backed by a shared 0-filled read-only frame, but rather by a dedicated, writable frame containing the virtual address of the first TCB in every word: As long as the first TCB is read-only mapped to the 0-page, all accesses to any TCB will fail at the validation of the thread ID. Once the first TCB is allocated (and thus mapped to a writeable page), all invalid acceses will still fail the validation procedure (the requested thread ID does not match the thread ID in the first TCB). This guarantees that reading from the lookup table always yields a valid pointer into the TCB array, thus avoiding additional checks (the 0-page would yield NULL-pointers, which would have to be checked for). If the lookup table is integrated into the superpage that also stores the initial portion of the TCB array, this page must still be allocated exclusively and writable from the beginning; only the remaining parts of the TCB array can/should use the 0-mapping trick as described earlier. |
Having proposed to dedicate 512 MB of virtual address space to TCBs, this is the right time to look at the address space layout to show how this fits in. Generally, there are two possible approaches to store the kernel code and data: One can use an address space dedicated to the kernel, which would require an address space switch whenever a user-level thread invokes a system call, or one can split the 32 bit (4 GB) virtual address space into a user region and a kernel region. As address space switches are comparatively expensive on the main target platform (x86) due to untagged TLBs (also see Chapter 9, Small Spaces), logically splitting the address space is the preferred way to go.
With 512 MB virtual memory being required for the TCBs, a split into 3 GB for the user-level and 1 GB for the kernel seems reasonable (though debatable). L4 reserves the upper 1 GB (0xC0000000..0xFFFFFFFF) of each address space for its own purposes, the lower 3 GB can (nearly) freely be used by the user-level thread (for limitations see the section called "Virtual Registers and the UTCB"). While the lower 3 GB are (naturally) potentially different in each address space, the kernel's GB is synchronized across all address spaces so that the kernel has a consistent view on its memory irrespective of in which address space it currently executes (exceptions to this rule are processor-local memory and per-space regions in kernel memory).
The kernel's 1 GB is further split into the aforementioned per-space region of (maybe) 128 MB, the TCB array (512 MB), and 128 MB for kernel code and data (such as the mapping database, see Chapter 8, Virtual Memory Mapping).
The last 256 MB are used as a window to the first 256 MB of physical memory to allow for efficient manipulation of objects that are either not mapped into virtual memory or are only known by their physical address (this is handy, e.g., when walking multi-level hardware-pagetables (x86) in software because these reference the next level by its physical address).
While most of the kernel's address space is rather static (code, data, window to physmem) or intentionally not synchronized across address spaces (per-space region), the TCB region is highly dynamic and must be kept consistent across all address spaces.
Imagine thread A creates a new thread N, whose TCB resides in a yet unused page. The kernel would create the mapping for the TCB array page in the kernel region within A's address space and continue executing A. At the end of A's timeslice, the kernel decides to run thread B in a different address space. B also wants to talk to N, checks if it exists and gets back the NULL mapping from the TCB array in B's kernel region and falsely concludes that the thread does not exist. Even worse, B might now try to and succeed in creating another thread N, which is probably different from the one A created before (B's N might, e.g., execute different code than A's N). Now the kernel has an inconsistent view (two TCBs for the same thread number), the global IDs are no longer global.
To circumvent this problem, all mappings in the TCB array are first added into a master page table (e.g., the page table of the first user-level task to run). Initially, all TCBs are mapped read-only to the 0-page in this master table. On pagefaults in the TCB array, the mapping in the master table is read and copied to the faulting page table entry. This already suffices to resolve read pagefaults: The faulting thread will now retry the memory access and read either the correct TCB or the 0-page (and thus detect invalid accesses). For pagefaults on write accesses, the master mapping is copied only if it does not map to the 0-page. In the latter case, the faulting thread was trying to create a new thread (using a previously unused TCB). To allow for this operation to succeed, instead of duplicating the 0-page mapping, a new frame is allocated, cleared, and mapped writeable into the master page table (replacing the read-only mapping to the 0-page). Only after this is the new (writeable) mapping copied to the page table of the faulting address space.
This ensures consistency across all address spaces without having to update all page tables at every change (i.e., we update/synchronize lazily). A potential problem could be invalidated mappings; these would have to be propagated to all page tables immediately. For two reasons this is not a problem:
Some data such as the pointers to the heads of the ready lists per priority level should be CPU-local to avoid conflicts in the caches (ping-pong) and remove the requirement to serialize concurrent access to the data structures from different CPUs. CPU-local data can be implemented either by using an array for each structure, which is indexed by the number of the accessing CPU or by consolidating all CPU-local data structures in few pages and back the virtual addresses with different physical frames depending on the accessing CPU. The latter approach is "cleaner" as the source code does not need to be cluttered with artificial array accesses but rather accesses all data structures an all CPUs at the same virtual address, which is transparently mapped to different memory locations and is thus favored by L4. On the downside, this approach requires a top-level page directory per address space and per CPU in order to selectively replace some mappings to point off to CPU-local data. These additional page directories must be kept in synch, which is implemented on the page directory entry level as with the TCB array (see the section called "Synchronization of the Kernel Address Space").
Table of Contents
The introduction of address spaces isolates threads in different address spaces from each other, so that they cannot interfere with each other. On the downside, threads in different address spaces can also no longer work together, as there can be no communication between them. To overcome this (micro)kernel-imposed limitation, the kernel provides inter-process communication (IPC) primitives.
When implementing communication primitives, one fundamental design decision has to be made: Shall the sender block until the IPC operation returns (successfully or not) or shall the sending thread continue immediately and let the kernel deal with the actual message transfer?
The latter case (asynchronous send) usually requires
Buffering messages in the kernel makes the kernel vulnerable for denial-of-service attacks (e.g., by letting threads send messages that will never be received). Furthermore, copying the messages twice not only wastes processor cycles, but also increases the kernel's cache (and probably TLB) footprint.
Assuming that IPC is the most frequently invoked microkernel service, L4 implements only synchronous IPC, which allows to keep the message in the sender's address space until it can be copied to the receiver. This scheme avoids both the in-kernel buffers and a second copy of the message plus the sender can be informed about success or failure when the IPC operation returns.
For message-based communication, at least two primitives are required:
These operations allow for two threads to communicate with each other and prevent interference from other threads: The receiver only accepts messages from the intended sender, aka closed receive). Other threads must wait until the receiver accepts messages from them. However, servers that offer their services to all threads are difficult to implement using the closed receive primitive for two reasons:
As the thread blocks during receive (assuming the client has not yet issued a request),
Avoiding these problems, the kernel should offer an additional primitive (known as an open receive):
Looking at the (expected) common interaction patterns in a multi-server system, clients will typically send requests to servers and then wait for the answer (cf. RPC). If in such a scenario the client is preempted after having sent the request but before having issued the accompanying receive, the server might try to deliver the result (answer message) to the client before the client is receiving. As a consequence, the server would block and thus be prevented from servicing other requests. To solve this problem, two approaches can be considered:
Combined send-and-receive primitives could be:
|
Depending on the size of the messages to be transferred, we can distinguish three message types:
Besides memory pages, other resources can be mapped as well (e.g., access to I/O ports on IA32). |
The distinction becomes important when implementing the IPC primitives, as we shall discuss in Chapter 6, IPC Implementation.
In order to enable threads to recover from a (blocking) communication attempt with an unresponsive (malicious, faulty, or uncooperative) partner, the duration of each IPC operation can be bounded with a number of timeouts.
The time a sender waits for the receiver to call the corresponding recv
can be limited using the
send timeout.
A 0-timeout enables polling/sending the message only if the receiver is ready.
The time a receiver waits for a sender to arrive can be bounded using a receive timeout. A 0-timeout enables polling/receiving messages only if a sender is waiting to deliver.
Using transfer timeouts (also knows as xfer timeout) both the sender and the receiver can limit the duration of the actual message transfer. Transfer timeouts are only considered for string IPCs, because all other types are guaranteed to be free of user-level pagefaults during message delivery. Since strings need to be copied from the sender's address space to the receiver, pagefaults can arise both in the sender's address space (while reading the message) and in the receiver's address space (while writing the message). Since pagefaults are resolved by user-level pagers (see Chapter 8, Virtual Memory Mapping), which are under the control of the user and thus not trustworthy, the sender can specify a xfer timeout to limit the total time available to the receiver's pager to resolve all occurring pagefaults. Conversely, the receiver can specify a xfer timeout to limit the total time available for the sender's pager. Both xfer timeouts are combined and the minimum determines the effective xfer timeout that bounds the duration of the message transfer.
Concerning the encoding of timeouts, there are a number of constraints to consider:
As a consequence, L4 employs 16 bit wide, logarithmic timeouts with 10 bit
mantissa m
(in microseconds) and 5 bits exponent e
(for relative timeouts,
the 16th bit discerns relative and absolute timeouts), allowing relative
timeouts be specified as m*2e
between 0 and 1023*231 microseconds
(about 610 hours).
For absolute timeouts, L4 similarly uses 16 bit: 4 bit exponent e
,
1 bit epoch designator d
, and 10 bit mantissa m
(in microseconds);
the 16th bit again distinguishes absolute from relative timeouts.
Computing absolute timeouts from 16 bit relies on a reference clock R
,
which is an L4 provided 64 bit monotonically increasing microsecond counter
(though the update rate of the counter is architecture dependent).
The absolute timeout triggers as soon as the least significant 10+e
bits of
R
equal m << e
and bit 10+e
of R
equals d
(or C-like, when
(R & ((1ULL << (11+e)) - 1)) == ((d << (10+e)) | (m << e))
).
For every IPC operation, a number of parameters such as the intended receiver, the type and contents of the message and the desired operation (send, receive, call, ...), or the desired timeouts (if any) must be specified. Clever encoding of these can avoid superfluous memory accesses and thus help to make IPC fast.
To avoid code duplication, L4 only implements a single IPC system call,
which implements the atomic send-and-receive
operation between threads.
As a consequence, the system call requires two addresses:
If not NIL, this thread ID specifies the thread that is expected to send
a message to the invoking (i.e., the current) thread.
The receive phase always follows the send phase (if any):
Typically, having received a message requires some sort of processing
before a meaningful answer can be sent.
An exception to this rule would be a simple "ACK", which is implicitly
passed to the originator as the return code of the send operation:
If the send
returns without errors, the message has been delivered
(see the section called "Synchronous Communication").
If the receive specifier is NIL, no receive phase is initiated and the syscall returns right after the send phase (if any).
For an open receive, which would allow the current thread to receive a message from any thread, another special thread ID can be used: the wildcard thread ID (-1).
This scheme allows to encode the desired operation (send to
,
receive from
, receive from any
, call
, or reply and wait
) and all
relevant threads in only two words.
As both addresses must be inspected in all IPC calls, these should be passed in registers rather than memory to the kernel.
Four timeouts accompany each IPC:
With all timeout values being 16 bit wide, (at least) two timeouts can be stored in a single word. As the send and/or receive timeouts are always required whereas the xfer timeouts are only required for string IPC operations, we combine send and receive timeout in one 32 bit word, which should be passed to the kernel in a register to avoid a memory access. The xfer timeouts are combined in a second word, which can be passed in memory: we only access them for string IPCs, which touch a lot of memory anyway, so that the (relative) overhead will be small.
The message itself is stored in 64 word-wide virtual message registers (MR0 .. MR63), each of which is realized either in a physical processor register or in a thread-local region of memory (see the section called "Virtual Registers and the UTCB").
The message itself consists of a message tag in MR0, followed by 0..63 untyped words (which are simply copied during an IPC), followed by 0..63 typed words. Typed words (or typed items) are interpreted by the kernel and serve to encode strings and mappings.
String items instruct the kernel to copy a contiguous region of memory (a "string") from the sender's address space to the receiver's address space. To this end, string items must convey
The first word in a string item ( string specifier) consists of the following fields:
C
, bit 0)
hh
, bits 1..2)
N
, bits 4..8)
c
, bit 9)
The N+1
words that follow the string specifier denote the start addresses
of the substrings, which need not be aligned and thus require a full word
each.
Map items enable the sending thread to establish a region of shared memory with the receiver. In order to allow the mapper the restrict access from the "mappee" to the memory (e.g., read only), map items must specify
Since multiple map items should be allowed in each message, but only a single receive window can be specified by the receiver (arbitrarily large, but only one, see the section called "Receive Buffers"), each map item also specifies an offset within the receive window.
All this is encoded in two words. The first word of a map item consists of:
C
, bit 0)
hh
.)
Grant items allow to map an fpage and to remove it from the sender's address space, the fpage is effectively moved to the receiver's address space.
In their encoding, grant items differ from map items only in the first bit of the type field (bit 1 of the send base): Map items set this bit to 0, grant items set it to 1.
The effect of the grant operation cannot be achieved with any series of map and unmap operations and is thus a required additional, not a redundant operation (also see Chapter 8, Virtual Memory Mapping for the effects of grant on derived mappings and the mapping database). |
In order to properly decode the message, the first message register (the message tag) always provides some meta-information about the remaining message:
u
, bits 0..5)
t
, bits 6..11)
The flags in the message tag are used to signal the use of a kernel-based IPC control mechanism (IPC propagation or IPC redirection), indicate inter-processor IPCs, and to store the error indicator after the IPC.
The flags are typically set by the kernel for inspection by the receiver and the sender (error bit).
As the message tag describes the message, it must be inspected on every IPC and should thus be passed to the kernel in a register rather than via memory.
Whenever an IPC operation includes a receive phase, the receiver must specify
Untyped words are simply copied into the receiver's message registers, so nothing needs to be specified here (the number of message registers is the same for all threads, so no overflow can occur).
For strings, string items as in the send case are employed to denote the buffers into which the strings shall be received (overly long send strings are truncated to the length of the according receive string).
In order to accept mappings, the receiver can specify a single contiguous virtual address region (an fpage, see Chapter 8, Virtual Memory Mapping); all incoming map items must fit into this region or an error is signaled.
The receive window is given as part of the acceptor in BR0:
C
, bit 0)
s
, bits 4..9)
,
base address (bits 10..31/63)
The size of the receive is actually 2s byte.
Due to encoding and portability problems, s
must be at least 10
and the base address must be naturally aligned (its last s
are
forced to 0 by the kernel).
Disallowing s<10
guarantees that all valid base addresses can be
stored in the upper 22/54 bits.
As an exception, the special nilpage (0) can be used to refuse all mappings.
The remaining 33 (virtual) buffer registers BR1..BR33 provide a
list of receive string items, its end being indicated by the first
string specifier with a clear C
bit.
Finally, the completion status of an IPC operation (successful, aborted after timeout, partner thread invalid, mappings rejected, ...) should be reported to the caller. Assuming success as the common case, L4 reports the outcome in only one bit (error flag in the message tag, MR0). Details about the error (if any) are given in a predefined, thread-local memory location (see the section called "Virtual Registers and the UTCB").
In order to set up an architecture independent API without limiting all architectures to the greatest common divisor of them all, L4 bases its IPC API on a set of virtual registers to store the IPC arguments (MRs, BRs, xfer timeouts, error details). Virtual registers are mapped either to physical registers or to a kernel provided, user accessible, thread local, unpaged memory location, which is called the user-level tread control block (UTCB), Accesses to this memory never raise pagefaults, the kernel guarantees that the UTCB is always present in physical memory.
The kernel can locate the UTCB easily because the local thread ID, which is stored next to the global thread ID in the kernel TCB, is defined as the address of the thread's UTCB.
The user, however, needs a different way to locate the UTCBs:
On IA32, L4 employs an otherwise unused segment register (%gs
)
to grant access to a segment that contains as its first (and only)
word a pointer to the UTCB of the currently running thread.
Using a pointer in a user-accessible but fixed segment has several advantages over other schemes: The pointer can easily and safely be manipulated by both the kernel and the user (see Chapter 10, Local IPC), thus allowing cheap thread switching purely at user-level (among threads in the same address space). If the segment was set up to directly give access to the UTCB,
Table of Contents
The following elaboration mainly focuses on single-processor systems, i.e., only one thread executes on the CPU at any time. Furthermore, we assume a strict priority-based scheduling strategy with fixed priorities and round-robin per priority level as provided by L4 (see Chapter 7, Dispatching), though the ideas presented should apply more generally as well. |
The implementation of synchronous IPC primitives (send, receive, and combined operations such as call or reply-and-wait) are tightly related to scheduling: Whenever a thread waits for an IPC partner (trying to send to a thread that is currently not accepting messages, or trying to receive a message while none is pending), the thread is blocked and another thread needs to be scheduled.
Conversely, if the send phase of a call operation completes, the sender will
wait for the response of the receiver.
This (common!) case can be improved by avoiding to go through the scheduler
and instead dispatch the receiver thread directly, allowing it to use the
remainder of the sender's timeslice to start fulfilling its request.
Executing the receiver in the sender's timeslice is called ((timeslice
donation));
the sender would block anyway (after the call
), but instead of losing its
assigned CPU time, the sender passes it on the receiver so that the latter
can do some work on the sender's behalf.
This approach is fair in all respects if the sender used the call
primitive:
Besides a call
to a waiting receiver, other (attempted) IPC operations also
require dispatching/scheduling actions:
send
, receive
, or call
while the partner is not waiting
receive
from a partner that has previously call
ed us
call
a partner that is waiting for us
receive
from a partner that has previously invoked a send
for us
send
to a partner that has previously invoked a recv
or call
ed us
These cases are different from the previous ones: Before, either the sender or the receiver (or none of them) was ready to run after the invocation of the IPC primitive; the other thread blocked. If (as in these cases) both threads are ready to run after the IPC, a true scheduling decision must be made:
If the two threads have the same priority, a policy decision is required: Continue to execute the current thread until the end of its timeslice (lazy IPC operation) or immediately switch to the partner (eager)?
[The author of this document believes that the eager approach unjustifiably penalizes the current thread (cutting short its timeslice though it has work to do) and thus favors the lazy approach, which additionally avoids the overhead of a thread switch. L4 implements the lazy approach.]
IPC operations that only transfer messages in the message registers
(no strings, no mappings) are called
short IPC.
Under the assumption that call
style IPC is prevalent in microkernel-based
systems and that the receiver (often a server) is waiting, such a short IPC
can on uni-processor machines be implemented via a (simplified) thread switch,
carrying at least part of the message in the register file as follows
The sender perform the call
system call, thus entering kernel mode and
switching to the kernel stack of the sender thread.
All registers except the stack pointer are preserved.
(The x86 segment registers %cs
, %ss
, %ds
, %es
, %fs
, and %gs
,
the (extended) FLAGS and instruction pointer are overwritten or kernel
entry and cannot be used to transfer messages.)
esp0
).
Some (x86: 2) additional registers are required for locating and verifying
the dest. TCB, all others are preserved.
Short IPC, meaning IPC with its message solely contained in untyped message registers without strings or mappings, is (assumed to be) the most frequent kernel operation in microkernel-based systems. This justifies taking special care on its implementation to avoid most of the possible performance problems (cache misses, TLB misses, memory references in general, down to microarchitectural effects such as pipeline stalls and flushes or instruction scheduling; the latter shall not be discussed here in more detail).
In order to retain in full control over these effects, L4 features two IPC paths:
The fast path is implemented in hand-crafted assembler, but handles only the easy (hopefully common) case:
call
or reply-and-wait
semantics, so that no scheduling decision is
required
no effective timeouts
The selection between the two IPC paths is made at the beginning of the IPC entry point by checking the above criteria. If any criterion for the fast path is violated, the code branches off to (calls) the C code, otherwise the fast path is used.
In order to retain good performance, the target architecture must be inspected and searched for possible problems. On x86, these boil down to
or
ed together;
if the resulting value is 0, no criterion required the slow path and we
remain on the fast path; if the result is not 0, we use the C path instead.
Attention is required to make the two paths compatible:
If one thread enters the wait
state on the slow path (e.g., because it used
recv
rather than call
) and this thread is to be resumed by a thread that
sends to it using the fast path (via a call
), the receiver must be in a
state that allows proper resumption on this path.
Conversely, the reply might come in on the slow path, reactivating a thread
that entered the wait
state on the fast path.
Luckily this turns out to be no problem in practice, because both the fast and the slow IPC path leave the waiting thread in a consistent state (same thread state, same stack layout for the top few items). Exclusively cooperative switching of threads inside the kernel (the kernel is not preemptible, except during string IPC) avoids race conditions while the state of a thread is changed, i.e., during the thread switch.
However, with the fast path leading up to a sysexit
(to activate the
receiver), the fast path cannot be used to activate kernel threads such as
interrupt threads, which never execute in user mode and not even have a
surrounding address space.
Long IPC denotes all IPC that includes string items or map/grant items, i.e., typed items. The complexity of these operations in conjunction with accessing the required data structures makes an assembly "fast path" implementation infeasible. Additionally, due to the string copy or pagetable manipulation, the cost of long IPC are generally so high that such a fast path would not pay that much and is thus not provided (also saves maintenance overhead).
Long IPC requires the following steps:
validate partner thread and its state
analyze and transfer the message
We want to enable interrupts during string copy operations in order to keep the interrupt latency low: Each string can be up to 4 MB in size, the pure copy time approaches the length of a timer tick (order of milliseconds). We must disable interrupts afterwards as the thread switch and other kernel code assumes atomicity.
We must allow for pagefaults to be handled during string IPC, which requires that we prevent the reactivation of any of the partners. Otherwise another thread might concurrently be allowed to send a message to the receiver, partly overwriting its message registers and wreaking havoc. The sender must not continue before having sent the message so that it cannot mess with its message registers (besides other reasons).
The approach taken by L4 to avoid such problems is to lock both
partners' TCBs while the interrupts are still disabled.
Locking simply means to set their thread states to locked_running
(sender)
and locked_waiting
(receiver), which are different from both
ready
/running
and waiting
and thus prevent anyone from communicating
with or dispatching these threads.
We defer a detailed description on mapping IPC until Chapter 8, Virtual Memory Mapping to present it in its natural context, so that the remainder of this chapter will focus on string IPC. Also keep in mind that this description of the IPC operations disregards the multi-processor case and thus ignores the additional SMP-safe locking requirements on accessing TCBs and such. The L4 implementation is SMP safe.
String IPC provides a means to safely (guarded by the kernel) copy large amounts of data between address spaces. Though not strictly required (short IPC would suffice), providing string IPC offers a much more comfortable and more efficient solution to this task than short IPC could do while still keeping many of the IPC properties: synchronous, atomic message transfer (the sender continues only after the message has been transferred completely) from sender defined data into memory locations that are designated as such by the receiver. Furthermore, kernel-mediated string IPC guarantees that the originator of the accepted string data is known, as the sender's true identity (thread ID) is conveyed to the receiver during the IPC (a matter of trust).
The problem with inter-address-space string copies (or memory copies in general) is that neither sender nor receiver have to have access to both the original data (source buffer) and destination buffers; if both buffers reside in private memory (not shared between the sender and the receiver address spaces), no single page table exists that contains mappings to make both buffers' physical frames accessible. In this case, not even the kernel can copy the data directly.
The traditional approach to copying data from one address space to another is to use an intermediate buffer in the kernel's address space (which is mapped into all address spaces (top 1 GB) and is synchronized to map to the same physical frames in all address spaces). First, running in the sender's address space, the kernel copies the strings into the intermediate buffer in kernel space, then it switches to the receiver's address space and copies the data out of the kernel buffer into the receiver's buffers.
Though this approach works well, it requires two copies of the message and (for long messages) trashes many cache lines (source buffer, intermediate buffer and destination buffer will occupy cache lines, probably evicting each other).
Assuming 4 byte words, 32 byte (8 word) cache lines, and an n
word string
to copy, copying the string from source to intermediate and from intermediate
to destination buffer each requires n
word-wide read operations and
n
word-wide write operations, thus 4n
(cached) memory accesses in total.
Assuming further that the sender has just written the message into the
source buffer and that the receiver will immediately read the complete
string after reception, small messages will still be in the cache when
copied into the kernel buffer (only n/8
cache misses on the kernel
buffer). With the same argument, short strings will still be in the
cache when copied from the kernel buffer into the destination buffer.
The cache misses on the receive buffer are not considered here as the
receiver will read the message immediately; if the kernel writes the
messages there (suffering cache misses), the receiver will not suffer
cache misses.
Conversely, if the message was not in the cache, the receiver would
suffer cache misses; the kernel generated cache misses on the buffer
are no avoidable overhead.
The total overhead of this approach is n/8
cache misses when
writing to the kernel buffer, all other cache misses are due to the
sender writing and the receiver reading the message and thus "desired"
(at least, not to be avoided).
For large n
(n
> size of the cache in words), the assumption of the
message still residing in the cache before each copy does not hold:
The later part of the string will evict the earlier part from the cache,
thus this approach yields n/8
cache misses when reading the source
buffer, n/8
cache misses when writing the kernel buffer,
n/8
cache misses when reading the kernel buffer
[5],
and n/8
cache misses when writing the receive buffer.
All of these are overhead (both the sender and the receiver will suffer
additional cache misses when accessing their buffer), so for large
n
the overhead is 4n/8
(or n/2
) cache misses.
Compare this to the alternative approach to be discussed next.
The traditional copy-in/copy-out-scheme works well for small n
,
but even then involves complex buffer management schemes in the
kernel (pagefaults on the receiver buffer might cause a pager to
page in memory from disk, in the meantime other threads might
also perform string IPC; the size of the kernel buffer is not/cannot be
fixed nor known in advance).
To overcome these problems, L4 uses temporary mappings to make both the sender's (source) buffer and the receive buffer accessible in one address space at the same time. [6] For this purpose, the kernel reserves in its address space region a mapping area (the size of two consecutive superpages, 8 MB on x86), and maps the receive buffer (at most 4 MB, but not necessarily aligned; only guaranteed to fit into an 8 MB region!) into it. Afterwards, the sender can access the source buffer at its natural (virtual) address and the receive buffer via the (virtual) address of the mapping region in the kernel memory.
Having set this up, we can copy directly from source buffer to destination buffer (using the virtual addresses of the mapping region), without requiring a kernel buffer and without having to copy the string twice.
With the assumptions from the copy-in/copy-out-scheme, this approach
only requires 2n
memory operations (instead of 4n
), 2n/8
cache
lines (instead of 3n/8
, no cache lines are required for the
temp. mapping area if the cache is physically indexed, as the temp.
mapping maps to the same physical frame as the receive buffer),
avoids all overhead cache misses for small n
(the message is only
copied in the cache (given it fits), possible cache misses on the
receive buffer are attributed to the receiver as above and thus not
counted as overhead), and incurs only 2n/8
(instead of 5n/8
; also
3n/8
as stated in the table in the lecture slides seems wrong)
overhead cache misses for large n
(n/8
when reading the source
buffer and n/8
when writing the receive buffer; the receiver will
suffer additional cache misses when accessing the buffer so these
are true overhead (avoidable) misses).
On the downside, the temporary mapping also has a number of drawbacks: The setup cost is higher (set up the kernel mapping to the receive buffer), and more needs to be done should the copy operation be interrupted: The mapping area is a shared resource, whose state must be preserved across thread switches, but in contrast to the kernel buffer above, the mapping area can quite easily be shared, as detailed next.
Setting up the temporary mapping area only requires copying the (up to 1024) receiver's pagetable entries that cover the receive buffer to the pagetable entries that cover the temporary mapping area in the sender's pagetable. Due to the two-level pagetable scheme employed by Intel x86, copying the up to two top-level pagetable entries suffices to make the complete receiver accessible. This effectively integrates one of the receiver's second level pagetables into the sender's address space.
Since we made long IPC preemptible (interrupts enabled during the message transfer), other threads might be scheduled while the temporary mapping is set up. These threads might
Even worse would be a scenario in which thread A is preempted in a string copy to thread B, and the preempting thread C is resumed in the middle of its (previously started and preempted) string copy to thread D: Without care, C would now write into the receive buffer of B, as that's the way the temp. mapping area is (still) set up when C resumes.
To avoid such problems, L4 always invalidates the temp. mapping area
on a thread switch.
As the mappings might already be in the TLB, we also have to flush the TLB,
as always after revoking mappings.
Flushing the TLB is costly (not the flush itself, but restoring the TLB
entries afterwards requires a pagetable walk (two memory accesses) per
TLB entry), and is implicitly performed on x86 whenever the address space
is switched (by loading the address of the top-level pagetable into %cr3
).
So, we need only explicitly flush the TLB if we switch to a thread in the
same address space as the (preempted) sender of the string IPC.
If now thread C resumes and writes to the temp. mapping area, it will raise a pagefault (A's/B's mappings there have been invalidated), which will be handled as a special case in the L4 kernel to reestablish C's temp. mapping area:
If the pagetable entry of mapping area matches receivers pagetable entry, we have a pagefault on the receive buffer (the second-level pagetable entry is invalid or read-only), so that the receiver's pager is notified to handle the pagefault via [7]
In order to be able to restore the temp. mapping after preemption, we need to store the thread ID of the receiver and the location of the receive buffer in the sender's TCB (done when the temp. mapping is set up before starting to copy).
On multi-processor machines, each CPU should be able to perform a string IPC in parallel with all other CPUs. A single temp. mapping area in the kernel, however, would prohibit such parallelism and instead require locking and synchronization on the shared resource "mapping area".
As a workaround, each CPU can be given a private temp. mapping area in a
distinct region within the kernel memory space.
In effect, this approach allocates not one but N
(the number of CPUs in
the system) mapping areas in the kernel, each one being used exclusively
by one CPU; no additional locking to access the mapping area is required.
However, this approach requires to limit the max. number of CPUs at
compile time so that the kernel memory regions can be reserved.
If the kernel is compiled for up to 32 CPUs but only 2 are present,
30 x 8 MB of virtual kernel memory are wasted.
What's more, this approach does not scale well to larger, many-core systems
with 100s or CPUs: one cannot dedicate 100 x 8 MB of the 1 GB of kernel memory
to mapping areas!
As an alternative, L4 uses CPU-local memory:
Each CPU has its own copy of the top-level pagetables and can thus establish
its own temporary mapping area without disturbing the other CPU's mapping
areas.
This approach not only scales better (you just need to duplicate the top-level
pagetable (4 kB) instead of the mapping area (8 MB)), but also restricts the
mapping area to the same virtual addresses on all CPUs, which makes
programming easier (instead of using map_area[get_current_cpu()]
you can
now simply access map_area
from all CPUs and still have it map to different
receive buffers on each CPU).
[5] When copying the string backwards out of the kernel buffer, the later part (copied first) is still in the cache, reducing the number of cache misses; this is neglected here assuming that the string is always copied in order of ascending addresses.
[6] The mappings are temporary as they vanish at each thread switch.
[7] The sender thread is running, but the pagefault occurs in the receiver thread's address space (although none of the threads there might be running). Activating the pager of a thread that is not even running is called page fault tunneling.
L4 offers threads as one of its (few) abstractions. As a consequence, a mechanism to dispatch (i.e., switch to and execute) a thread is required as well.
The dispatching mechanism described here is only used if the next thread to
run is not indicated by the user and thus a thread has to be selected
according to some scheduling policy.
The IPC paths, e.g., often imply a thread switch to the receiver; this path
bypasses the dispatcher.
Similarly, a |
Whenever L4 must choose the next thread to run, it consults its built in scheduler. The scheduler uses
Threads of the same priority are scheduled in round-robin fashion for a given timeslice; at the end of the timeslice, the thread is preempted and another thread is dispatched. For this purpose, the L4 scheduler maintains
In principle, the scheduler starts at the array entry for the highest priority
and checks if there is a thread ready to run (array entry is not nil
).
In this case, the thread is selected for dispatch, otherwise, the array is
searched in order of decreasing priorities.
If no thread is ready to run, a special idle thread is dispatched.
When the current thread blocks, yield
s, or exhausts its timeslice,
the next thread in that priority class is dispatched and given a new
timeslice.
Additionally, the pointer to the currently running thread in the array is
updated to point to the selected thread.
Timeslice accounting is slightly inaccurate as it happens in the handler of the periodic timer interrupt by subtracting the length of the timer period from the remaining timeslice length. If the remaining timeslice becomes zero or negative, it is said to be exhausted, causing an immediate scheduling process. If the thread blocks in between two timer ticks, the next thread will start with only half a timer period till the next timer tick. Whether it is billed a whole timer period or whether this case is correctly accounted for is beyond the author's knowledge. |
Usually, threads have priorities around 100 on L4. To avoid checking the priority levels 255..101 for runnable threads, L4 remembers the highest priority with a runnable thread and only checks at and below that level during scheduling.
This so-called "highest active priority" must be updated whenever a thread with a higher priority becomes ready to run, and should be updated whenever the currently highest active priority level becomes empty (because the last thread on that level blocks). Both events can easily be registered in the kernel.
On x86, further optimizations are possible for sparsely populated priority
arrays:
x86 offers a bsr
(bit scan reverse) instruction, which can be used to
locate the index of the most significant set (1) bit in a word.
If we maintain an bitfield of 256 bits (8 32-bit words),
i.e., one "active" bit per priority level, we can use this instruction
to locate the highest active priority without searching the array at all.
The current implementation always folds 16 priority levels into one bit, reducing the length of the bitfield to search to 16 bits. This approach requires to scan up to 15 (empty) arrays entries in the order of decreasing priorities to identify the real highest active priority and complicates the conditions under which a bit can be cleared. The reasons for not using 32 bits are unclear, there is even no apparent benefit of using only one word instead of 8: Both cache utilisation and TLB coverage would favor the 256 bit bitfield, no difference with respect to branch prediction logic is obvious.
Finding the next thread to run can be quite time consuming (esp. if no
bsr
instruction can be used) and should thus be preemptible itself,
in order to achieve a low interrupt latency.
If the scheduler were to execute in the context of the current thread
(the one that is to be replaced), this goal would be hard to achieve:
On interrupts, the current state of the interrupted thread would be saved,
the (in-kernel) interrupt handler would still execute on the stack of the
current thread and possibly invoke the scheduler again.
When the current thread is scheduled again, it would resume its preempted
scheduling activity, although the thread itself should continue!
Additionally, the interrupt handler and nested scheduler consume space
on the threads kernel stack, which, being included in the TCB, is quite
limited.
Solving all these problems, L4 uses a dedicated thread for the scheduling decisions. This thread is never resumed after preemption but rather started anew, discarding its previous state (the scheduling-relevant data structures have possibly changed since the scheduler has been preempted, so a fresh start is recommended). This property also removes the requirement for an exceptionally large kernel stack; except for a single interrupt handler, nothing will be nested on top of the scheduler.
The consequences of this model are that dispatching via the scheduler requires two thread switches: first from the current to the dispatcher thread, and then from dispatcher to the next thread. However, as the dispatcher never executes in user mode, it does not have an associated address space, so that switching there is cheap (no address space switch, no TLB flush). As it is never referenced from user mode, it can even do without a thread ID, which further allows to freely place (and size) its stack; it need not be included in any TCB.
L4 uses circular ready lists per priority level.
If these lists were maintained eagerly, a thread would have to be removed
whenever it blocks (e.g., during a call
IPC, waiting for the reply).
Removing a thread from the ready list requires touching its preceding
list element (to update its next
pointer) as well as touching its
successor list element (to update its prev
pointer) in addition to the
list element itself.
With the list being embedded in TCBs, this translates to having to touch
three TCBs (predecessor, successor, and current), each one probably
covered by a different TLB entry.
Though we cannot avoid touching the current TCB, and though we will
probably switch to the next thread anyways, so touching the successor
list element does not "waste" a TLB entry for the single purpose of
updating the ready list, touching the predecessor TCB solely serves the
purpose of keeping the ready list up-to-date.
The possible TLB miss on touching the predecessor TCB can be avoided
by maintaining the ready lists lazily as follows.
Instead of removing a blocking thread's TCB from the ready list
immediately, we just leave it in there and simply set its state to
blocked
.
If a thread that is selected for dispatching is marked as blocked
,
we remove it from the list and dispatch the next thread.
This lazy handling of ready lists is called
lazy dispatching.
... and what does this scheme buy us?!?
blocked
, we already touched
its predecessor TCB (in most cases, that is the currently
executing/preempted thread)!
Since we will also have to touch the successor TCB (since that will be
the thread to dispatch), performing the list operation now does not
introduce additional TLB misses.
(Accessing the blocked TCB is as costly and as useless as touching the
predecessor TCB in the eager case, but benefit 1. remains.)
Threads that are blocked during IPC can be unblocked either by completing
their IPC (partner arrives and accepts/delivers a message) or by aborting
it due to a timeout.
With many blocked threads in a system it would be infeasible to check all
blocked threads' timeouts on each timer interrupt (aka
tick).
Instead, L4 maintains a (more or less ordered, see below) list of timeouts,
with the one closest to its expiry at the head of the list.
This only requires to check a single timeout at each tick (and would even
allow tickless implementations, setting the timer to min(timeslice length,
earliest timeout)
on each kernel exit), but requires a clever data structure
to keep track of the timeouts.
Since timeouts could be created with every IPC, inserting them into the "list" must be fast. Under the assumption that most timeouts are never actually raised (they are usually used to recover from error conditions/deadlocks/false assumptions; under normal conditions they are cancelled once the IPC completes), canceling a timeout must also be cheap. With the same argument (raised timeouts indicate some error in the application), handling an expired timeout may be more expensive.
Both unsorted and sorted true lists do not meet our requirements:
Unsorted lists take O(n)
time to find the nearest timeout,
sorted lists take O(n)
time to insert a new element.
Using (balanced) trees to maintain the timeouts would probably allow
cheap lookup/insert operations (O(1)
, O(log n)
), but are deemed to be too
complicated [8] and expensive (probably in terms of time (e.g.,
reorganizing an AVL tree) and memory (>= 3 pointers per tree node instead
of 1 or 2 in lists)).
Anyway, an epoch-based multi-list scheme has been proposed for L4. [9] The idea is to define two absolute times on system startup, esoon and elate, with esoon being several micro- or milliseconds from now on and elate being approx. an order of magnitude later. We can now use an ordered list (aka soon list) for the nearly expired timeouts (those that fire before esoon), an unsorted list (aka late late list) for timeouts in the far future (firing after elate), and another unsorted list (aka late list) for timeouts in between the other two (at or after esoon but before elate).
Insertion into the sorted soon list is cheap because (hopefully) there is
only a small number of timeouts stored in that list (esoon should be
chosen to maintain this property).
Insertion into the unsorted late and late late lists is cheap because
we can simply append (or prepend) new timeouts (O(1)
).
To facilitate quick removal of cancelled timeouts, we can store a pointer
to the timeout list entry in the associated TCB (O(1)
).
As the current design precludes any thread from having more than one
active timeout, we can even go further and include a wakeup list node in
the TCBs (as is done with the ready lists as well), removing the necessity
to store a pointer to the list node.
Once esoon is reached, a correction phase is required (note that the soon list is empty at this time!): First, we define a new esoon in the near future, then we iterate over the late list, moving all timeouts that fire before the new esoon into the (sorted) soon list and continue.
Whenever we reach the updated esoon, we process the late list as just described. Only once we reach elate do we need to inspect the timeouts in the late late list; which we process similar to the late correction above: First update elate, then iterate over all entries from the late late list and move them into the (sorted) soon list (if expiry is before esoon) or the late list (if expiry is before elate).
The epochs serve to reduce the sorting efforts (only nearly expired timeouts are sorted), the late list keeps the cost of the late correction phase small (if elate is not too far in the future), and the probably costly late late correction phase handles the remaining (long-term) timeouts.
Note that the correction phases are only required once the current time exceeds esoon or elate: The insertion strategy guarantees that no timeout in the late or late late lists can be missed (all expiry times are later than esoon or elate, respectively).
As an additional optimization, we can reuse the lazy dispatching idea:
Instead of removing the timeout list element when a timeout is cancelled,
we can leave it in the list and just set its expiry time to never
.
Such items can be removed from the lists (at a low cost) during correction
phases or can be reused for the next timeout iff it falls into the same
late/late late list (the soon list must always be sorted).
[8] Trees being to complicated is a kind of strange argument; once you get them right, their complexity should be no problem, and data structures such as the mapping database are both tree-like and even more complex...
[9] The scheme presented has never been implemented. As non-zero and non-infinite timeouts have been found to be rarely used, L4 uses a single (ordered) wakeup list for all non-trivial timeouts.