Week 11.a CS 5600 03/28 2022 On the board ------------ 1. Last time 2. page fault - intro - usage - costs - thrashing 3. mmap ------------------------------------------------------- Admin: --Lab2 cheating detection results --confess before Tue Midnight --------- 1. Last time -- TLB -- Virtual memory overview [show slides] -- where does OS live? -- meltdown and spectre --post-meltdown kernel (mitigation: KAISER) notes from Aurojit Panda about kernel-being-mapped-into-each-process: [AP: This answer is complicated (https://www.usenix.org/system/files/login/articles/login_winter18_03_gruss.pdf), but see below for attempt to explain. * In the post-meltdown KAISER/KPTI/KVA/XNU Double Map (all names for similar mitigations) each process has two (logical) page tables: - One, the user mode page table, for use when the process is executing usermode code, unmaps most (but not all) of the kernel, this includes some of the kernel stack, and a few other things. The aim of all mitigations has been to minimize the number of kernel pages in the user mode page table, but different tradeoffs are selected for how much this is minimized. - The second, the kernel mode page table, has exactly the same layout as what [is mentioned above], i.e., the kernel is in all address spaces. On entry to kernel, the OS tries as rapidly as possible to switch from user mode page table to kernel mode one. It switches back before return. Having a kernel mode page table per process is in part to minimize how much one needs to change in the kernel, so maybe one can argue that this is not the "best" possible solution, but it is pretty good.] 2. Page faults A. intro and mechanics We've discussed these a bit. Let's go into a bit more detail... Concept: a reference is illegal, either because it's not mapped in the page tables or because there is a protection violation. requires the OS to get involved this mechanism turns out to be hugely powerful, as we will see. Mechanics --what happens on the x86? [show slides] --processor constructs a trap frame and transfers execution to an interrupt or trap handler [see handout week11.a] [trap frame] %rsp --> [error code] %rip now points to code to handle the trap [how did processor know what to load into %rip? Answer: kernel needs to set up an interrupt descriptor table (IDT), by which CPU knows where to go; btw, page fault exception is #14.] Question: why isn't %cr3 stored in the trap frame? [answer: though trapping to kernel, the address space is unchanged. Recall "where is the OS live".] error code: [ ................................ U/S | W/R | P] U/S: user mode fault / supervisor mode fault R/W: access was read / access was write P: not-present page / protection violation on a page fault, %cr2 holds the faulting virtual address Question: why does OS need the faulting virtual address? [answer: to walk page table to fix the page fault] --intent: when page fault happens, the kernel sets up the process's page entries properly, or kills the process B. Uses of page faults --Best example: overcommitting physical memory (the classical use of "virtual memory") --your program thinks it has, say, 64 GB of memory, but your hardware has only 16 GB of memory --the way that this worked is that the disk was (is) used to store memory pages --advantage: address space looks huge --disadvantage: accesses to "paged" memory (as disk pages that live on the disk are known) are sllooooowwwww: --Rough implementation: --on a page fault, the kernel reads in the faulting page --QUESTION: what is listed in the page structures? how does kernel know whether the address is invalid, in memory, paged, what? --kernel may need to send a page to disk (under what conditions? answer: two conditions must hold for kernel to HAVE to write to disk) (1) kernel is out of memory (2) the page that it selects to write out is dirty --Computers have lots of memory, so less common to hear the sound of swapping these days. You would need multiple large memory consumers running on the same computer. --in fact, today you sometimes experience the other way around: you may want to run file system in memory (ramfs) for performance. --Many other uses a) store memory pages across the network! (Distributed Shared Memory) --basic idea was that on a page fault, the page fault handler went and retrieved the needed page from some other machine --charming idea, but impractical. Why? too expensive. --classic trade-off: transparency vs. efficiency -- a general case: low-level APIs expose more information (less transparency) but have better performance b) copy-on-write --when creating a copy of another process, don't copy its memory. just copy its page tables, mark the pages as read-only --COW is how fork is implemented --QUESTION: do you need to mark the parent's pages as read-only as well? [answer: of course, otherwise the child will see data changes from the parent.] --program semantics aren't violated when programs do reads --when a write happens, a page fault results. at that point, the kernel allocates a new page, copies the memory over, and restarts the user program to do a write --then, only do copies of memory when there is a fault as a result of a write --this idea is all over the place; used in fork(), mmap(), etc. [skipped] c) accounting --good way to sample what percentage of the memory pages are written to in any time slice: mark a fraction of them not present, see how often you get faults d) performance tricks --remember the JVM trick I mentioned --use page fault to get rid of an if-else branch --the paper "Virtual Memory Primitives for User Programs", by Andrew W. Appel and Kai Li, Proc. ASPLOS, 1991. explores the possible usage of virtual memory 3 decades ago: --high-level idea: by giving user-level program the opportunity to do interesting things on page faults, you can build interesting functionality [interesting applications below; introduce some given time] [briefly mentioned] ** Concurrent GC --GC motivation: --manually malloc and free are tedious and error-prone (recall your memory bugs in your labs) --can we ask program/runtime/library/OS to fix this? --Yes, garbage collection (GC)! --many GC algorithms --one classic GC algorithm: copying --two memory regions: one is in-use; the other is free --when doing GC, copy the useful objects to the other region --swap the regions --challenge: concurrently run the garbage collector and the program? --what if program reaches the old memory? may cause stale data [skipped] C. costs --What does paging from the disk cost? --let's look at average memory access time (AMAT) --AMAT = (1-p)*memory access time + p * page fault time, where p is the prob. of a page fault. memory access time ~ 100ns SSD access time ~ 1 ms = 1^6 ns --QUESTION: what does p need to be to ensure that paging hurts performance by less than 10%? 1.1*t_M = (1-p)*t_M + p*t_D p = .1*t_M / (t_D - t_M) ~ 10^1 ns / 10^6 ns = 10^{-5} so only one access out of 100,000 can be a page fault!! --basically, page faults are super-expensive (good thing the machine can do other things during a page fault) Concept is much larger than OSes: need to pay attention to the slow case if it's really slow and common enough to matter. D. Page replacement policies --the fundamental problem/question: (also known as cache eviction problem) --some entity holds a cache of entries and gets a cache miss. The entity now needs to decide which entry to throw away. How does it decide? --make sure you understand why page faults that result from "page-not-present in memory" are a particular kind of cache miss --(the answer is that in the world of virtual memory, the pages resident in memory are basically a cache to the backing store on the disk; make sure you see why this claim, about virtual memory vis-a-vis the disk, is true.) --the system needs to decide which entry to throw away, which calls for a *replacement policy*. --let's cover some policies.... Specific policies * FIFO: throw out oldest (results in every page spending the same number of references in memory. not a good idea. pages are not accessed uniformly.) * LRU: throw out the least recently used (this is often a good idea, but it depends on the future looking like the past. what if we chuck a page from our cache and then were about to use it?) --implementing LRU --reasonable to do in application programs like Web servers that cache pages (or dedicated Web caches). [use queue to track least recently accessed and use hash map to implement the (k,v) lookup] --in OS, LRU itself does not sound great. would be doubling memory traffic (after every reference, have to move some structure to the head of some list) --and in hardware, it's way too much work to timestamp each reference and keep the list ordered (remember that the TLB may also be implementing these solutions) --how can we approximate LRU? --another algorithm: * CLOCK [draw clock on board] --arrange the slots in a circle. hand sweeps around, clearing a bit. the bit is set when the page is accessed. just evict a page if the hand points to it when the bit is clear. --approximates LRU ... because we're evicting pages that haven't been used in a while....though of course we may not be evicting the *least* recently used one (why not?) --Summary: --LRU is usually a good approximation to optimal policy * optimal policy: throw away the entry that won't be used for the longest time. this is optimal. --Implementing LRU in hardware or at OS/hardware interface is a pain --So implement CLOCK ... decent approximations to LRU, which is in turn good approximation to OPT *assuming that past is a good predictor of the future* (this assumption does not always hold!) E. Thrashing [The points below apply to any caching system, but for the sake of concreteness, let's assume that we're talking about page replacement in particular.] What is thrashing? Processes require more memory than system has Specifically, each time a page is brought in, another page, whose contents will soon be referenced, is thrown out Example: --one program touches 50 pages (each equally likely); only have 40 physical page frames --If we have enough physical pages, 100ns/ref --If we have too few physical pages (40 pages), assume every 5th reference leads to a page fault Question: if the SSD latency is 1ms, how many times slowdown do we have? --4refs x 100ns and 1 page fault x 1ms for SSD I/O --this gets us 5 refs per (1ms + 400ns) ~ 0.2ms/ref = 2,000x slowdown!!! --What we wanted: virtual memory the size of disk with access time the speed of physical memory --What we have here: memory with access time roughly of SSD (0.2 ms/mem_ref compare to 1 ms/SSD_access) As stated earlier, this concept is much larger than OSes: need to pay attention to the slow case if it's really slow and common enough to matter. Reasons/cases: (1) process doesn't reuse memory (or has no temporal locality) (2) process reuses memory but the memory that is absorbing most of the accesses doesn't fit. (3) individually, all processes fit, but too much for the system what do we do? --well, in the first two reasons above, there's nothing you can do, other than restructuring your computation or buying memory (e.g., expensive hardware that keeps entire customer database in RAM) --in the third case, can and must shed load. how? two approaches: a. working set b. page fault frequency a. working set --only run a set of processes s.t. the union of their working sets fit in memory --definition of working set (short version): the pages a process has touched over some trailing window of time b. page fault frequency --track the metric (# page faults/instructions executed) --if that thing rises above a threshold, and there is not enough memory on the system, swap out the process [Acknowledgments: Mike Walfish, David Mazieres, Mike Dahlin]