Week 8 CS4973/CS6640 02/26 2025 https://naizhengtan.github.io/25spring/ Memory protection introduction □ 1. Memory protection, the problem statement □ 2. Segmentaion (x86-32) □ 3. PMP (RISC-V) □ 4. Paging (brief) □ 5. Meltdown & its consequences --- 1. Problem statement Q: Motivation? why do we need memory protection? * avoid a process access kernel's memory * avoid a process access another process's memory * avoid a process accidentally modifying its own memory (bugs) * what else? - the fundamental problem: access control subj --[access]--> obj for us, CPU --[op]--> memory Concretely, subj: instructions + context (privilege levels and others) p: r/w/x obj: memory address Q: is this the only way to formulate memory protection? Q: Why "op" only has r/w/x? This is defined by memory abstraction: -- data cells, identified by addresses -- ops: read/write -- instructions are part of the memory Now, memory protection is a yes/no question: allow [instruction+ctx] --[r/w/x]-->[a set of memory addresses]? Q: what are invalid accesses? -- invalid subj (e.g., user program reading kernel memory) -- invalid op (e.g., execute 0xdeadbeef) -- invalid obj (e.g., writing to a readonly memory) a straightforward solution: associate a ACL to obj => for each memory obj, attach a ACL of which subj can do what op Multiple questions: Q1: what are memory objs? (memory granularity) Q2: who is the subj? (how to define subj) Q3: where to store the ACL? 2. x86 segmentation - segfault example: In x86: """ int main() { int *ptr = (int *)0xdeadbeef; *ptr = 1; } """ You will see something like: "zsh: segmentation fault ./segfault" Q: Why it is called segmentation fault? - x86 memory segmentation, a brief history * Segmentation was introduced on the Intel 8086 in 1978 as a way to allow programs to address more than 64KB of memory. // real mode // 16bit => 20bit // address = (base << 4) | offset * The Intel 80286 introduced a second version of segmentation in 1982 that added support for memory protection. * x86-32 (starting from 80386, 32-bit, 1985) How it works: * registers---CS, DS, SS, ES (FS and GS)---contain "selectors" * selectors point to GDT/LDT (global/local descriptor table) which has "descriptors" * descriptor contains base address, limit, access right * finally, the CPU will combine the requested address with the base address memory protection (checks): - check access right - check privilege levels: max(CPL, RPL) <= DPL // in x86, lower is more privileged segment selector (RPL) segment descriptor (DPL) current privilege level (CPL) [show how Linux uses segment] * The x86-64 architecture, introduced in 2003, has largely dropped support for segmentation in 64-bit mode. - How does segmentation answer the previous questions? Q1: what is the granularity of memory obj? A: defined by segment's base and limit, a dynamic region of memory. Q2: how to define the subj? A: defined by instruction Q3: where to store the ACL? A: registers (selectors), memory (descriptors in LDT/GDT) 3. PMP * PMP configuration registers (pmpcfg0–pmpcfg15), 16 of them * PMP address register (pmpaddr0–pmpaddr63), 64 of them Mapping: * each pmpaddr is associated with one pmpcfg; * one pmpcfg maps to four pmpaddr reigsters - How does PMP answer the previous questions? Q1: what is the granularity of memory obj? A: defined by PMP address registers; dynamically defined Q2: how to define the subj? A: defined by instruction Q3: where to store the ACL? A: PMP registers (pmpcfg0–pmpcfg15, pmpaddr0–pmpaddr63) 4. paging (brief) - super brief history * designed for swapping in/out memory (1962) [T Kilburn, R B Payne, D J Howarth, The Atlas Supervisor http://www.chilton-computing.org.uk/acl/technology/atlas/p019.htm] * later, adding memory protections - Paging is the fundamental source of isolation for almost all today's systems - How does paging answer the previous questions? Q1: what is the granularity of memory obj? A: a page (4KB), a fixed size Q2: how to define the subj? A: defined by (1) instruction, (2) privileged levels, and (3) root of the page table (cr3 in x86; satp in RISC-V) Q3: where to store the ACL? A: memory (page table entries) 5. Meltdown & its consequences * Meltdown allows malicious user code to read kernel memory, despite page protections. surprising and disturbing one of a number of recent "micro-architectural" attacks exploiting hidden implementation details of CPUs fixable, but people fear an open-ended supply of related surprises * will skip the technical details * consequences: what OS used to do: put itself in the address space of user processes' but isolate itself by using privilege levels (set "user-cannot-read" bit on page table entries) now, a bummer: user can violate isolation and access kernel memory a fix (simplified): run OS in its own memory [something deep here;] consequences: performance [see paper, An Analysis of Performance Evolution of Linux's Core Operations, SOSP'19 https://dl.acm.org/doi/10.1145/3341301.3359640] === Virtual memory □ 1. Virtual memory intro □ 2. Paging □ 3. Page table □ 4. Today's virtual memory ------ 1. Virtual memory intro - very important idea in computer systems - "to virtualize" means "to lie" or "to fool". we'll say how this is implemented in a few moments. for now, let's look at the benefits of being interposed on memory accesses. - Q: Why virtual memory? What's wrong with using physical addresses? - benefits: (a) programmability (i) huge contiguous memory: program *thinks* it has lots of memory, organized in a contiguous space (ii) multiplexing addresses: programs can use "easy-to-use" addresses like 0, 0x20000, whatever. compiler and linker don't have to worry about where the program actually lives in physical memory when it executes. -- good for modularization (b) protection: (i) separate address space: processes cannot read or write each other's memory --this protection is at the heart of the isolation among processes that is provided by the OS --prevents bug in one process from corrupting another process. (non-adversarial scenarios) --don't even want a process to observe another process's memory (like if that process has secret or sensitive data). (adversarial scenarios) --the idea is that: if you cannot name something, you cannot use it. this is a deep idea. --Question: can you think of another example of this naming idea? (answer: file descriptor) (ii) access control prevent incorrect modifications to data (e.g., writing to RO memory regions), adversarial execution on stack/heap (e.g., bufferoverflow) (c) effective use of resources: (i) overcommitting memory: programmers don't have to worry that the sum of the memory consumed by all active processes is larger than physical memory. (ii) secure sharing: processes share memory under controlled circumstances, but that physical memory may show up at very different virtual addresses --that is, two processes have a different way to refer to the same physical memory cells Q: can physical addresses achieve the above benefits? Why and why not? - today's virtual memory, two main duties: (a) translation: VA (virtual address) => PA (physical address) (b) isolation & protection: access control - for now, we mainly focus on the translation job 2. Paging - the translation problem: VA => PA and hope this translation (i) runs fast, (ii) has small memory overhead, (iii) can be updated quickly. - For example: * Segmentation: base + VA => PA - Q: what's wrong with the segmentation? how will you implement this translation? A. pages -Basic concept: divide all of memory (physical and virtual) into *fixed-size* chunks. --these chunks are called *PAGES*. --they have a size called the PAGE SIZE. (different hardware architectures specify different sizes) --in the traditional x86, the PAGE SIZE will be 4096B = 4KB = 2^{12} It is the same for us, RV32. Q: does "paging" fundamentally change the translation problem? A: not really; now mapping page numbers, instead of memory addresses --it is proper and fitting to talk about pages having **NUMBERS**. --page 0: [0,4095] --page 1: [4096, 8191] --page 2: [8192, 12277] --page 3: [12777, 16384] ... --page 2^{20}-1 [ ..., 2^{32} - 1] --unfortunately, it is also proper and fitting to talk about _both_ virtual and physical pages having numbers. --sometimes we will try to be clear with terms like: VPN: virtual page number PPN: physical page number B. per-process translation Q: who owns the translation? - If a CPU can only run one translation, is it useful? - For a multi-core CPU, if a core can only run one translation, is it useful? - Today, each process (or a program) has a separate mapping --And each page separately mapped --we will allow the OS to gain control on certain operations --Read-only pages trap to OS on write (store) --Invalid pages trap to OS on read (load) or write (store) --OS can change mapping and resume application 3. Page table Q: if you were re-inventing virtual memory, what data structure would you use for the translation? Recall our requirements about the translation: (i) runs fast, (ii) has small memory overhead, (iii) can be updated quickly. A. Problem statement page table conceptually implements a map from VPN --> PPN page table is conceptually an index. the address is broken up into bits: [.............|........] [ VPN | offset ] | | | + | | --> TABLE --> PPN = address top bits index into page table. contents at that index are the PPN. bottom bits are the offset. not changed by the mapping physical address = PPN + offset (note: "+" here means "concatenate": for example, 123 "+" 456 => 123456) result is that each page table entry expresses a mapping about a contiguous group of addresses. B. a naive proposal: (assume 32-bit addresses and 4KB pages) there is in the sky a 2^{20} sized array that maps the virtual address to a *physical* page table[VPN] => PPN EXAMPLE: if OS wants a program to be able to use address 0x402000 to refer to physical address 0x3000, then the OS conceptually adds an entry: table[0x402] = 0x3 (this is the 1026th virtual page being mapped to the 3rd physical page.). in decimal: table[1026] = 3 NOTE: top 28 bits are doing the indirection. bottom 12 bits just figure out where on the page the access should take place. --bottom bits are called _offset_. --so now all we have to do is create this mapping --why is this hard? why not just create the mapping? --Question: how large is this table? --answer: then you need, per process, roughly 4MB (2^{20} entries * 4 bytes per entry). [why 4 bytes per entry? in practice, it's convenient to have the entry size be the same as a data type on the machine] --too much! let's deal with this... [draw a black-box MMU: inputs are VAs; outputs are PAs;] --Question: if you were MMU designer, how would you design the table? C. what we use today: page table PT: a radix tree [see a radix tree figure] Q: Why radix trees? Can you use other trees like binary tree? * PT design space: -- offset -- page size -- address length -- addressable memory unit -- depth of the PT -- PTE size 4. Today's virtual memory - What we have today: [virtual memory -> paging -> page table -> RV32] introducing the alternatives: * VA but not paging: segmentation * paging but not page table: hashing-based translation * page table but not RISC-V: x86 and ARM - how is this translation implemented? --software(OS)-hardware(MMU) co-design --in modern systems, hardware does it. this hardware is configured by the OS. --this hardware is called the MMU, for memory management unit, and is part of the CPU --why doesn't OS just translate itself? similar to asking why we don't execute programs by running them on an emulation of a processor (too slow) - things to remember in what follows: --OS is going to be setting up data structures that the hardware sees --these data structures are *per-process* PTs === Virtual memory implementation □ 1. RV32 page table intro □ 2. (manually) walking page table --- 1. Intro [show overview fig] * a toy example: VA: 0x80001fec * split it into L1/L2 index and offset: 0x80001fec => - l1: 1000 0000 00 => 0x200 => 2*16*16=512 - l2: 0x1 => 1 - offset: 0xfec detail info: (1) satp [handout] (2) PTE [handout] Q: what if a PTE has W but not R? [undefined behavior] (3) VA [handout] 2. walking page table * a real example take a look at helloworld.c int main(int args, char **argv) { m_uint32 data = 0xdeadbeef; printf("%p\n", &data); asm("ecall"); } * run it; it returns 0x80001fec // error Q: what do you think gdb> p/x *0x80001fec [A: 0] Q: Why? We're in M-mode. Only see physical memory. * simulate CPU: manual page walk Goal: see which physical address holds data "0xdeadbeef". Method: we will use gdb to simulate page walk. Steps: (1) split the VA to l1/l2 indexes and offest (2) get the root of page table (3) calc the L1 page (4) calc the L2 page (5) calc the physical address * now, let's do it: (1) split the va to l1/l2 indexes and offest 0x80001fec => l1: 0x200 (512) l2: 0x1 offset: 0xfec (2) get the root of page table gdb> p/x $satp // 0x80080048 (3) find the L1 page: Q: how to interpret 0x80080048? [the most significant bit is MODE] Q: what is the L1 page? gdb> p/x 0x80048 << 12 // 0x80048000 Q: is this physial or virtual? [physical] Q: what is the L1 PTE address? gdb> p/x (0x80048 << 12) + 0x200*4 // 0x200 is the L1 index in VA Q: what is the L1 PTE content? gdb> p/x *((0x80048 << 12) + 0x200*4) // get 0x20014401 (L1 PTE) Q: what does "0x1" in the end mean? [it is the valid bit] (4) find the L2 page: Q: what's the L2 page? gdb> p/x (0x2001401 & ~(0x3ff)) >> 10 << 12 // 0x80051000 Q: what's the L2 PTE address? gdb> p/x 0x80051000 + 0x1*4 // 0x1 is the L2 index in VA Q: what's the L2 PTE content? gdb> p/x *(0x80051000 + 0x1*4) // 0x200158df (5) find the physical address Q: what is "0xdf" in 0x200158df? [check out PTE fig] Q: what is the data page? gdb> p/x (0x200158df & ~(0x3ff)) << 2 // 0x80056000 Q: what is the PA? gdb> p/x ((0x200158df & ~(0x3ff)) << 2) + 0xfec // 0x80056fec // you will see 0xdeadbeef (the data) when inferencing the address