% talk a little about initial page table conditions: % paging not on, but virtual mostly mapped direct to physical, % which is what things look like when we turn paging on as well % since paging is turned on after we create first process. % mention why still have SEG_UCODE/SEG_UDATA? % do we ever really say what the low two bits of %cs do? % in particular their interaction with PTE_U % sidebar about why it is extern char[] \chapter{Page tables} \label{CH:MEM} Page tables are the most popular mechanism through which the operating system provides each process with its own private address space and memory. Page tables determine what memory addresses mean, and what parts of physical memory can be accessed. They allow xv6 to isolate different process's address spaces and to multiplex them onto a single physical memory. Page tables are a popular design because they provide a level of indirection that allow operating systems to perform many tricks. Xv6 performs a few tricks: mapping the same memory (a trampoline page) in several address spaces, and guarding kernel and user stacks with an unmapped page. The rest of this chapter explains the page tables that the RISC-V hardware provides and how xv6 uses them. %% \section{Paging hardware} %% As a reminder, RISC-V instructions (both user and kernel) manipulate virtual addresses. The machine's RAM, or physical memory, is indexed with physical addresses. The RISC-V page table hardware connects these two kinds of addresses, by mapping each virtual address to a physical address. \begin{figure}[t] \center \includegraphics[scale=0.5]{fig/riscv_address.pdf} \caption{RISC-V virtual and physical addresses, with a simplified logical page table.} \label{fig:riscv_address} \end{figure} Xv6 runs on Sv39 RISC-V, which means that only the bottom 39 bits of a 64-bit virtual address are used; the top 25 bits are not used. In this Sv39 configuration, a RISC-V page table is logically an array of $2^{27}$ (134,217,728) \indextext{page table entries (PTEs)}. Each PTE contains a 44-bit physical page number (PPN) and some flags. The paging hardware translates a virtual address by using the top 27 bits of the 39 bits to index into the page table to find a PTE, and making a 56-bit physical address whose top 44 bits come from the PPN in the PTE and whose bottom 12 bits are copied from the original virtual address. Figure~\ref{fig:riscv_address} shows this process with a logical view of the page table as a simple array of PTEs (see Figure~\ref{fig:riscv_pagetable} for a fuller story). A page table gives the operating system control over virtual-to-physical address translations at the granularity of aligned chunks of 4096 ($2^{12}$) bytes. Such a chunk is called a \indextext{page}. \begin{figure}[t] \center \includegraphics[scale=0.5]{fig/riscv_pagetable.pdf} \caption{RISC-V address translation details.} \label{fig:riscv_pagetable} \end{figure} In Sv39 RISC-V, the top 25 bits of a virtual address are not used for translation. The physical address also has room for growth: there is room in the PTE format for the physical page number to grow by another 10 bits. The designers of RISC-V chose these numbers based on technology predictions. $2^{39}$ bytes is 512 GB, which should be enough address space for applications running on RISC-V computers. $2^{56}$ is enough physical memory space for the near future to fit many I/O devices and DRAM chips. If more is needed, the RISC-V designers have defined Sv48 with 48-bit virtual addresses~\cite{riscv:priv}. As Figure~\ref{fig:riscv_pagetable} shows, a RISC-V CPU translates a virtual address into a physical in three steps. A page table is stored in physical memory as a three-level tree. The root of the tree is a 4096-byte page-table page that contains 512 PTEs, which contain the physical addresses for page-table pages in the next level of the tree. Each of those pages contains 512 PTEs for the final level in the tree. The paging hardware uses the top 9 bits of the 27 bits to select a PTE in the root page-table page, the middle 9 bits to select a PTE in a page-table page in the next level of the tree, and the bottom 9 bits to select the final PTE. (In Sv48 RISC-V a page table has four levels, and bits 39 through 47 of a virtual address index into the top-level.) If any of the three PTEs required to translate an address is not present, the paging hardware raises a \indextext{page-fault exception}, leaving it up to the kernel to handle the exception (see Chapter~\ref{CH:TRAP}). The three-level structure of Figure~\ref{fig:riscv_pagetable} allows a memory-efficient way of recording PTEs, compared to the single-level design of Figure~\ref{fig:riscv_address}. In the common case in which large ranges of virtual addresses have no mappings, the three-level structure can omit entire page directories. For example, if an application uses only a few pages starting at address zero, then the entries 1 through 511 of the top-level page directory are invalid, and the kernel doesn't have to allocate pages those for 511 intermediate page directories. Furthermore, the kernel also doesn't have to allocate pages for the bottom-level page directories for those 511 intermediate page directories. So, in this example, the three-level design saves 511 pages for intermediate page directories and $511\times512$ pages for bottom-level page directories. Although a CPU walks the three-level structure in hardware as part of executing a load or store instruction, a potential downside of three levels is that the CPU must load three PTEs from memory to perform the translation of the virtual address in the load/store instruction to a physical address. To avoid the cost of loading PTEs from physical memory, a RISC-V CPU caches page table entries in a \indextext{Translation Look-aside Buffer (TLB)}. Each PTE contains flag bits that tell the paging hardware how the associated virtual address is allowed to be used. \indexcode{PTE_V} indicates whether the PTE is present: if it is not set, a reference to the page causes an exception (i.e., is not allowed). \indexcode{PTE_R} controls whether instructions are allowed to read to the page. \indexcode{PTE_W} controls whether instructions are allowed to write to the page. \indexcode{PTE_X} controls whether the CPU may interpret the content of the page as instructions and execute them. \indexcode{PTE_U} controls whether instructions in user mode are allowed to access the page; if \indexcode{PTE_U} is not set, the PTE can be used only in supervisor mode. Figure~\ref{fig:riscv_pagetable} shows how it all works. The flags and all other page hardware-related structures are defined in \fileref{kernel/riscv.h} To tell a CPU to use a page table, the kernel must write the physical address of the root page-table page into the \texttt{satp}\index{satp@\lstinline{satp}} register. A CPU will translate all addresses generated by subsequent instructions using the page table pointed to by its own \texttt{satp}. Each CPU has its own \texttt{satp} so that different CPUs can run different processes, each with a private address space described by its own page table. Typically a kernel maps all of physical memory into its page table so that it can read and write any location in physical memory using load/store instructions. Since the page directories are in physical memory, the kernel can program the content of a PTE in a page directory by writing to the virtual address of the PTE using a standard store instruction. A few notes about terms. Physical memory refers to storage cells in DRAM. A byte of physical memory has an address, called a physical address. Instructions use only virtual addresses, which the paging hardware translates to physical addresses, and then sends to the DRAM hardware to read or write storage. Unlike physical memory and virtual addresses, virtual memory isn't a physical object, but refers to the collection of abstractions and mechanisms the kernel provides to manage physical memory and virtual addresses. \begin{figure}[h] \centering \includegraphics[scale=0.65]{fig/xv6_layout.pdf} \caption{On the left, xv6's kernel address space. {\sf \small{RWX}} refer to PTE read, write, and execute permissions. On the right, the RISC-V physical address space that xv6 expects to see.} \label{fig:xv6_layout} \end{figure} %% \section{Kernel address space} %% Xv6 maintains one page table per process, describing each process's user address space, plus a single page table that describes the kernel's address space. The kernel configures the layout of its address space to give itself access to physical memory and various hardware resources at predictable virtual addresses. Figure~\ref{fig:xv6_layout} shows how this layout maps kernel virtual addresses to physical addresses. The file \fileref{kernel/memlayout.h} declares the constants for xv6's kernel memory layout. QEMU simulates a computer that includes RAM (physical memory) starting at physical address \texttt{0x80000000} and continuing through at least \texttt{0x88000000}, which xv6 calls \texttt{PHYSTOP}. The QEMU simulation also includes I/O devices such as a disk interface. QEMU exposes the device interfaces to software as \indextext{memory-mapped} control registers that sit below \texttt{0x80000000} in the physical address space. The kernel can interact with the devices by reading/writing these special physical addresses; such reads and writes communicate with the device hardware rather than with RAM. Chapter~\ref{CH:TRAP} explains how xv6 interacts with devices. The kernel gets at RAM and memory-mapped device registers using ``direct mapping;'' that is, mapping the resources at virtual addresses that are equal to the physical address. For example, the kernel itself is located at \lstinline{KERNBASE=0x80000000} in both the virtual address space and in physical memory. Direct mapping simplifies kernel code that reads or writes physical memory. For example, when \lstinline{fork} allocates user memory for the child process, the allocator returns the physical address of that memory; \lstinline{fork} uses that address directly as a virtual address when it is copying the parent's user memory to the child. There are a couple of kernel virtual addresses that aren't direct-mapped: \begin{itemize} \item The trampoline page. It is mapped at the top of the virtual address space; user page tables have this same mapping. Chapter~\ref{CH:TRAP} discusses the role of the trampoline page, but we see here an interesting use case of page tables; a physical page (holding the trampoline code) is mapped twice in the virtual address space of the kernel: once at top of the virtual address space and once with a direct mapping. \item The kernel stack pages. Each process has its own kernel stack, which is mapped high so that below it xv6 can leave an unmapped \indextext{guard page}. The guard page's PTE is invalid (i.e., \lstinline{PTE_V} is not set), so that if the kernel overflows a kernel stack, it will likely cause an exception and the kernel will panic. Without a guard page an overflowing stack would overwrite other kernel memory, resulting in incorrect operation. A panic crash is preferable. \end{itemize} While the kernel uses its stacks via the high-memory mappings, they are also accessible to the kernel through a direct-mapped address. An alternate design might have just the direct mapping, and use the stacks at the direct-mapped address. In that arrangement, however, providing guard pages would involve unmapping virtual addresses that would otherwise refer to physical memory, which would then be hard to use. The kernel maps the pages for the trampoline and the kernel text with the permissions \lstinline{PTE_R} and \lstinline{PTE_X}. The kernel reads and executes instructions from these pages. The kernel maps the other pages with the permissions \lstinline{PTE_R} and \lstinline{PTE_W}, so that it can read and write the memory in those pages. The mappings for the guard pages are invalid. %% \section{Code: creating an address space} %% Most of the xv6 code for manipulating address spaces and page tables resides in {\tt vm.c} \lineref{kernel/vm.c:1}. The central data structure is {\tt pagetable\_t}, which is really a pointer to a RISC-V root page-table page; a {\tt pagetable\_t} may be either the kernel page table, or one of the per-process page tables. The central functions are {\tt walk}, which finds the PTE for a virtual address, and {\tt mappages}, which installs PTEs for new mappings. Functions starting with {\tt kvm} manipulate the kernel page table; functions starting with {\tt uvm} manipulate a user page table; other functions are used for both. {\tt copyout} and {\tt copyin} copy data to and from user virtual addresses provided as system call arguments; they are in {\tt vm.c} because they need to explicitly translate those addresses in order to find the corresponding physical memory. Early in the boot sequence, \indexcode{main} calls \indexcode{kvminit} \lineref{kernel/vm.c:/^kvminit/} to create the kernel's page table using \indexcode{kvmmake} \lineref{kernel/vm.c:/^kvmmake/}. This call occurs before xv6 has enabled paging on the RISC-V, so addresses refer directly to physical memory. \lstinline{kvmmake} first allocates a page of physical memory to hold the root page-table page. Then it calls \indexcode{kvmmap} to install the translations that the kernel needs. The translations include the kernel's instructions and data, physical memory up to \indexcode{PHYSTOP}, and memory ranges which are actually devices. \indexcode{proc_mapstacks} \lineref{kernel/proc.c:/^proc_mapstacks/} allocates a kernel stack for each process. It calls \lstinline{kvmmap} to map each stack at the virtual address generated by \lstinline{KSTACK}, which leaves room for the invalid stack-guard pages. \indexcode{kvmmap} \lineref{kernel/vm.c:/^kvmmap/} calls \indexcode{mappages} \lineref{kernel/vm.c:/^mappages/}, which installs mappings into a page table for a range of virtual addresses to a corresponding range of physical addresses. It does this separately for each virtual address in the range, at page intervals. For each virtual address to be mapped, \lstinline{mappages} calls \indexcode{walk} to find the address of the PTE for that address. It then initializes the PTE to hold the relevant physical page number, the desired permissions (\lstinline{PTE_W}, \lstinline{PTE_X}, and/or \lstinline{PTE_R}), and \lstinline{PTE_V} to mark the PTE as valid \lineref{kernel/vm.c:/perm...PTE_V/}. \indexcode{walk} \lineref{kernel/vm.c:/^walk/} mimics the RISC-V paging hardware as it looks up the PTE for a virtual address (see Figure~\ref{fig:riscv_pagetable}). \lstinline{walk} descends the 3-level page table 9 bits at the time. It uses each level's 9 bits of virtual address to find the PTE of either the next-level page table or the final page \lineref{kernel/vm.c:/pte.=..pagetable/}. If the PTE isn't valid, then the required page hasn't yet been allocated; if the \lstinline{alloc} argument is set, \lstinline{walk} allocates a new page-table page and puts its physical address in the PTE. It returns the address of the PTE in the lowest layer in the tree \lineref{kernel/vm.c:/return..pagetable/}. The above code depends on physical memory being direct-mapped into the kernel virtual address space. For example, as \lstinline{walk} descends levels of the page table, it pulls the (physical) address of the next-level-down page table from a PTE \lineref{kernel/vm.c:/pagetable.=..pa.*E2P/}, and then uses that address as a virtual address to fetch the PTE at the next level down \lineref{kernel/vm.c:/t..pte.=..paget/}. \indexcode{main} calls \indexcode{kvminithart} \lineref{kernel/vm.c:/^kvminithart/} to install the kernel page table. It writes the physical address of the root page-table page into the register \texttt{satp}. After this the CPU will translate addresses using the kernel page table. Since the kernel uses an identity mapping, the now virtual address of the next instruction will map to the right physical memory address. Each RISC-V CPU caches page table entries in a \indextext{Translation Look-aside Buffer (TLB)}, and when xv6 changes a page table, it must tell the CPU to invalidate corresponding cached TLB entries. If it didn't, then at some point later the TLB might use an old cached mapping, pointing to a physical page that in the meantime has been allocated to another process, and as a result, a process might be able to scribble on some other process's memory. The RISC-V has an instruction \indexcode{sfence.vma} that flushes the current CPU's TLB. Xv6 executes {\tt sfence.vma} in {\tt kvminithart} after reloading the \texttt{satp} register, and in the trampoline code that switches to a user page table before returning to user space \lineref{kernel/trampoline.S:/sfence.vma/}. It is also necessary to issue \texttt{sfence.vma} before changing \texttt{satp}, in order to wait for completion of all outstanding loads and stores. This wait ensures that preceding updates to the page table have completed, and ensures that preceding loads and stores use the old page table, not the new one. To avoid flushing the complete TLB, RISC-V CPUs may support address space identifiers (ASIDs)~\cite{riscv:priv}. The kernel can then flush just the TLB entries for a particular address space. Xv6 does not use this feature. \section{Physical memory allocation} The kernel must allocate and free physical memory at run-time for page tables, user memory, kernel stacks, and pipe buffers. Xv6 uses the physical memory between the end of the kernel and \indexcode{PHYSTOP} for run-time allocation. It allocates and frees whole 4096-byte pages at a time. It keeps track of which pages are free by threading a linked list through the pages themselves. Allocation consists of removing a page from the linked list; freeing consists of adding the freed page to the list. %% \section{Code: Physical memory allocator} %% The allocator resides in {\tt kalloc.c} \lineref{kernel/kalloc.c:1}. The allocator's data structure is a \textit{free list} of physical memory pages that are available for allocation. Each free page's list element is a \indexcode{struct run} \lineref{kernel/kalloc.c:/^struct.run/}. Where does the allocator get the memory to hold that data structure? It store each free page's \lstinline{run} structure in the free page itself, since there's nothing else stored there. The free list is protected by a spin lock \linerefs{kernel/kalloc.c:/^struct.{/,/}/}. The list and the lock are wrapped in a struct to make clear that the lock protects the fields in the struct. For now, ignore the lock and the calls to \lstinline{acquire} and \lstinline{release}; Chapter~\ref{CH:LOCK} will examine locking in detail. The function \indexcode{main} calls \indexcode{kinit} to initialize the allocator \lineref{kernel/kalloc.c:/^kinit/}. \lstinline{kinit} initializes the free list to hold every page between the end of the kernel and {\tt PHYSTOP}. Xv6 ought to determine how much physical memory is available by parsing configuration information provided by the hardware. Instead xv6 assumes that the machine has 128 megabytes of RAM. \lstinline{kinit} calls \indexcode{freerange} to add memory to the free list via per-page calls to \indexcode{kfree}. A PTE can only refer to a physical address that is aligned on a 4096-byte boundary (is a multiple of 4096), so \lstinline{freerange} uses \indexcode{PGROUNDUP} to ensure that it frees only aligned physical addresses. The allocator starts with no memory; these calls to \lstinline{kfree} give it some to manage. The allocator sometimes treats addresses as integers in order to perform arithmetic on them (e.g., traversing all pages in \lstinline{freerange}), and sometimes uses addresses as pointers to read and write memory (e.g., manipulating the \lstinline{run} structure stored in each page); this dual use of addresses is the main reason that the allocator code is full of C type casts. \index{type cast} The other reason is that freeing and allocation inherently change the type of the memory. The function \lstinline{kfree} \lineref{kernel/kalloc.c:/^kfree/} begins by setting every byte in the memory being freed to the value 1. This will cause code that uses memory after freeing it (uses ``dangling references'') to read garbage instead of the old valid contents; hopefully that will cause such code to break faster. Then \lstinline{kfree} prepends the page to the free list: it casts \lstinline{pa} to a pointer to \lstinline{struct} \lstinline{run}, records the old start of the free list in \lstinline{r->next}, and sets the free list equal to \lstinline{r}. \indexcode{kalloc} removes and returns the first element in the free list. \section{Process address space} Each process has a separate page table, and when xv6 switches between processes, it also changes page tables. Figure~\ref{fig:processlayout} shows a process's address space in more detail than Figure~\ref{fig:as}. A process's user memory starts at virtual address zero and can grow up to \texttt{MAXVA} \lineref{kernel/riscv.h:/MAXVA/}, allowing a process to address in principle 256 Gigabytes of memory. A process's address space consists of pages that contain the text of the program (which xv6 maps with the permissions \lstinline{PTE_R}, \lstinline{PTE_X}, and \lstinline{PTE_U}), pages that contain the pre-initialized data of the program, a page for the stack, and pages for the heap. Xv6 maps the data, stack, and heap with the permissions \lstinline{PTE_R}, \lstinline{PTE_W}, and \lstinline{PTE_U}. Using permissions within a user address space is a common technique to harden a user process. If the text were mapped with \lstinline{PTE_W}, then a process could accidentally modify its own program; for example, a programming error may cause the program to write to a null pointer, modifying instructions at address 0, and then continue running, perhaps creating more havoc. To detect such errors immediately, xv6 maps the text without \lstinline{PTE_W}; if a program accidentally attempts to store to address 0, the hardware will refuse to execute the store and raises a page fault (see Section~\ref{sec:pagefaults}). The kernel then kills the process and prints out an informative message so that the developer can track down the problem. Similarly, by mapping data without \lstinline{PTE_X}, a user program cannot accidentally jump to an address in the program's data and start executing at that address. In the real world, hardening a process by setting permissions carefully also aids in defending against security attacks. An adversary may feed carefully-constructed input to a program (e.g., a Web server) that triggers a bug in the program in the hope of turning that bug into an exploit~\cite{aleph:smashing}. Setting permissions carefully and other techniques, such as randomizing of the layout of the user address space, make such attacks harder. The stack is a single page, and is shown with the initial contents as created by exec. Strings containing the command-line arguments, as well as an array of pointers to them, are at the very top of the stack. Just under that are values that allow a program to start at \lstinline{main} as if the function \lstinline{main(argc}, \lstinline{argv)} had just been called. To detect a user stack overflowing the allocated stack memory, xv6 places an inaccessible guard page right below the stack by clearing the \lstinline{PTE_U} flag. If the user stack overflows and the process tries to use an address below the stack, the hardware will generate a page-fault exception because the guard page is inaccessible to a program running in user mode. A real-world operating system might instead automatically allocate more memory for the user stack when it overflows. When a process asks xv6 for more user memory, xv6 grows the process's heap. Xv6 first uses {\tt kalloc} to allocate physical pages. It then adds PTEs to the process's page table that point to the new physical pages. Xv6 sets the \lstinline{PTE_W}, \lstinline{PTE_R}, \lstinline{PTE_U}, and \lstinline{PTE_V} flags in these PTEs. Most processes do not use the entire user address space; xv6 leaves \lstinline{PTE_V} clear in unused PTEs. We see here a few nice examples of use of page tables. First, different processes' page tables translate user addresses to different pages of physical memory, so that each process has private user memory. Second, each process sees its memory as having contiguous virtual addresses starting at zero, while the process's physical memory can be non-contiguous. Third, the kernel maps a page with trampoline code at the top of the user address space (without \lstinline{PTE_U}), thus a single page of physical memory shows up in all address spaces, but can be used only by the kernel. \begin{figure}[t] \center \includegraphics[scale=0.5]{fig/processlayout.pdf} \caption{A process's user address space, with its initial stack.} \label{fig:processlayout} \end{figure} \section{Code: sbrk} \lstinline{sbrk} is the system call for a process to shrink or grow its memory. The system call is implemented by the function \lstinline{growproc} \lineref{kernel/proc.c:/^growproc/}. \lstinline{growproc} calls \lstinline{uvmalloc} or \lstinline{uvmdealloc}, depending on whether \lstinline{n} is positive or negative. \lstinline{uvmalloc} \lineref{kernel/vm.c:/^uvmalloc/} allocates physical memory with {\tt kalloc}, and adds PTEs to the user page table with {\tt mappages}. \lstinline{uvmdealloc} calls {\tt uvmunmap} \lineref{kernel/vm.c:/^uvmunmap/}, which uses {\tt walk} to find PTEs and {\tt kfree} to free the physical memory they refer to. Xv6 uses a process's page table not just to tell the hardware how to map user virtual addresses, but also as the only record of which physical memory pages are allocated to that process. That is the reason why freeing user memory (in {\tt uvmunmap}) requires examination of the user page table. %% \section{Code: exec} %% \lstinline{exec} is a system call that replaces a process's user address space with data read from a file, called a binary or executable file. A binary is typically the output of the compiler and linker, and holds machine instructions and program data. \lstinline{exec} \lineref{kernel/exec.c:/^exec/} opens the named binary \lstinline{path} using \indexcode{namei} \lineref{kernel/exec.c:/namei/}, which is explained in Chapter~\ref{CH:FS}. Then, it reads the ELF header. Xv6 binaries are formatted in the widely-used \indextext{ELF format}, defined in \fileref{kernel/elf.h}. An ELF binary consists of an ELF header, \indexcode{struct elfhdr} \lineref{kernel/elf.h:/^struct.elfhdr/}, followed by a sequence of program section headers, \lstinline{struct proghdr} \lineref{kernel/elf.h:/^struct.proghdr/}. Each \lstinline{progvhdr} describes a section of the application that must be loaded into memory; xv6 programs have two program section headers: one for instructions and one for data. The first step is a quick check that the file probably contains an ELF binary. An ELF binary starts with the four-byte ``magic number'' \lstinline{0x7F}, \lstinline{`E'}, \lstinline{`L'}, \lstinline{`F'}, or \indexcode{ELF_MAGIC} \lineref{kernel/elf.h:/ELF_MAGIC/}. If the ELF header has the right magic number, \lstinline{exec} assumes that the binary is well-formed. \lstinline{exec} allocates a new page table with no user mappings with \indexcode{proc_pagetable} \lineref{kernel/exec.c:/proc_pagetable/}, allocates memory for each ELF segment with \indexcode{uvmalloc} \lineref{kernel/exec.c:/uvmalloc/}, and loads each segment into memory with \indexcode{loadseg} \lineref{kernel/exec.c:/loadseg/}. \lstinline{loadseg} uses \indexcode{walkaddr} to find the physical address of the allocated memory at which to write each page of the ELF segment, and \indexcode{readi} to read from the file. The program section header for \indexcode{/init}, the first user program created with \lstinline{exec}, looks like this: \begin{footnotesize} \begin{verbatim} # objdump -p user/_init user/_init: file format elf64-little Program Header: 0x70000003 off 0x0000000000006bb0 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0 filesz 0x000000000000004a memsz 0x0000000000000000 flags r-- LOAD off 0x0000000000001000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**12 filesz 0x0000000000001000 memsz 0x0000000000001000 flags r-x LOAD off 0x0000000000002000 vaddr 0x0000000000001000 paddr 0x0000000000001000 align 2**12 filesz 0x0000000000000010 memsz 0x0000000000000030 flags rw- STACK off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4 filesz 0x0000000000000000 memsz 0x0000000000000000 flags rw- \end{verbatim} \end{footnotesize} We see that the text segment should be loaded at virtual address 0x0 in memory (without write permissions) from content at offset 0x1000 in the file. We also see that the data should be loaded at address 0x1000, which is at a page boundary, and without executable permissions. A program section header's \lstinline{filesz} may be less than the \lstinline{memsz}, indicating that the gap between them should be filled with zeroes (for C global variables) rather than read from the file. For \lstinline{/init}, the data \lstinline{filesz} is 0x10 bytes and \lstinline{memsz} is 0x30 bytes, and thus \indexcode{uvmalloc} allocates enough physical memory to hold 0x30 bytes, but reads only 0x10 bytes from the file \lstinline{/init}. Now \indexcode{exec} allocates and initializes the user stack. It allocates just one stack page. \lstinline{exec} copies the argument strings to the top of the stack one at a time, recording the pointers to them in \indexcode{ustack}. It places a null pointer at the end of what will be the \indexcode{argv} list passed to \lstinline{main}. The first three entries in \lstinline{ustack} are the fake return program counter, \indexcode{argc}, and \lstinline{argv} pointer. \lstinline{exec} places an inaccessible page just below the stack page, so that programs that try to use more than one page will fault. This inaccessible page also allows \lstinline{exec} to deal with arguments that are too large; in that situation, the \indexcode{copyout} \lineref{kernel/vm.c:/^copyout/} function that \lstinline{exec} uses to copy arguments to the stack will notice that the destination page is not accessible, and will return -1. During the preparation of the new memory image, if \lstinline{exec} detects an error like an invalid program segment, it jumps to the label \lstinline{bad}, frees the new image, and returns -1. \lstinline{exec} must wait to free the old image until it is sure that the system call will succeed: if the old image is gone, the system call cannot return -1 to it. The only error cases in \lstinline{exec} happen during the creation of the image. Once the image is complete, \lstinline{exec} can commit to the new page table \lineref{kernel/exec.c:/pagetable.=.pagetable/} and free the old one \lineref{kernel/exec.c:/proc_freepagetable/}. \lstinline{exec} loads bytes from the ELF file into memory at addresses specified by the ELF file. Users or processes can place whatever addresses they want into an ELF file. Thus \lstinline{exec} is risky, because the addresses in the ELF file may refer to the kernel, accidentally or on purpose. The consequences for an unwary kernel could range from a crash to a malicious subversion of the kernel's isolation mechanisms (i.e., a security exploit). Xv6 performs a number of checks to avoid these risks. For example \lstinline{if(ph.vaddr + ph.memsz < ph.vaddr)} checks for whether the sum overflows a 64-bit integer. The danger is that a user could construct an ELF binary with a \lstinline{ph.vaddr} that points to a user-chosen address, and \lstinline{ph.memsz} large enough that the sum overflows to 0x1000, which will look like a valid value. In an older version of xv6 in which the user address space also contained the kernel (but not readable/writable in user mode), the user could choose an address that corresponded to kernel memory and would thus copy data from the ELF binary into the kernel. In the RISC-V version of xv6 this cannot happen, because the kernel has its own separate page table; \lstinline{loadseg} loads into the process's page table, not in the kernel's page table. It is easy for a kernel developer to omit a crucial check, and real-world kernels have a long history of missing checks whose absence can be exploited by user programs to obtain kernel privileges. It is likely that xv6 doesn't do a complete job of validating user-level data supplied to the kernel, which a malicious user program might be able to exploit to circumvent xv6's isolation. %% \section{Real world} %% Like most operating systems, xv6 uses the paging hardware for memory protection and mapping. Most operating systems make far more sophisticated use of paging than xv6 by combining paging and page-fault exceptions, which we will discuss in Chapter~\ref{CH:TRAP}. Xv6 is simplified by the kernel's use of a direct map between virtual and physical addresses, and by its assumption that there is physical RAM at address 0x8000000, where the kernel expects to be loaded. This works with QEMU, but on real hardware it turns out to be a bad idea; real hardware places RAM and devices at unpredictable physical addresses, so that (for example) there might be no RAM at 0x8000000, where xv6 expect to be able to store the kernel. More serious kernel designs exploit the page table to turn arbitrary hardware physical memory layouts into predictable kernel virtual address layouts. RISC-V supports protection at the level of physical addresses, but xv6 doesn't use that feature. On machines with lots of memory it might make sense to use RISC-V's support for ``super pages.'' Small pages make sense when physical memory is small, to allow allocation and page-out to disk with fine granularity. For example, if a program uses only 8 kilobytes of memory, giving it a whole 4-megabyte super-page of physical memory is wasteful. Larger pages make sense on machines with lots of RAM, and may reduce overhead for page-table manipulation. The xv6 kernel's lack of a {\tt malloc}-like allocator that can provide memory for small objects prevents the kernel from using sophisticated data structures that would require dynamic allocation. A more elaborate kernel would likely allocate many different sizes of small blocks, rather than (as in xv6) just 4096-byte blocks; a real kernel allocator would need to handle small allocations as well as large ones. Memory allocation is a perennial hot topic, the basic problems being efficient use of limited memory and preparing for unknown future requests~\cite{knuth}. Today people care more about speed than space efficiency. %% \section{Exercises} %% \begin{enumerate} \item Parse RISC-V's device tree to find the amount of physical memory the computer has. \item Write a user program that grows its address space by one byte by calling \lstinline{sbrk(1)}. Run the program and investigate the page table for the program before the call to \lstinline{sbrk} and after the call to \lstinline{sbrk}. How much space has the kernel allocated? What does the PTE for the new memory contain? \item Modify xv6 to use super pages for the kernel. \item Unix implementations of \lstinline{exec} traditionally include special handling for shell scripts. If the file to execute begins with the text \lstinline{#!}, then the first line is taken to be a program to run to interpret the file. For example, if \lstinline{exec} is called to run \lstinline{myprog} \lstinline{arg1} and \lstinline{myprog} 's first line is \lstinline{#!/interp}, then \lstinline{exec} runs \lstinline{/interp} with command line \lstinline{/interp} \lstinline{myprog} \lstinline{arg1}. Implement support for this convention in xv6. \item Implement address space layout randomization for the kernel. \end{enumerate}