Week 9.b. CS 5600 11/05 2021 On the board ------------ 1. Last time 2. page fault - intro - usage - costs - page replacement policies 3. mmap ------------------------------------------------------- 1. Last time -- x86-64 page table walk -- TLBs -- where does OS live? -- Question: where is page tables? do they exist in the virtual address space of a process? [answer: yes. Though the process or application code cannot see them (page tables) but kernel maps them in the kernel space twice: one copy in the "physical memory" section (see handout week9.a panel 7), and the other in its own data structure.] 2. Page faults A. intro and mechanics We've discussed these a bit. Let's go into a bit more detail... Concept: a reference is illegal, either because it's not mapped in the page tables or because there is a protection violation. requires the OS to get involved this mechanism turns out to be hugely powerful, as we will see. Mechanics --what happens on the x86? --processor constructs a trap frame and transfers execution to an interrupt or trap handler [trap frame] %rsp --> [error code] %rip now points to code to handle the trap [how did processor know what to load into %rip? Answer: kernel needs to set up an interrupt descriptor table (IDT), by which CPU knows where to go; btw, page fault exception is #14.] error code: [see handout week9.a] [ ................................ U/S | W/R | P] U/S: user mode fault / supervisor mode fault R/W: access was read / access was write P: not-present page / protection violation %cr2 holds the faulting virtual address Question: why we need the faulting virtual address? [answer: the kernel needs to know which VA went wrong so that it can fix the problem by, for example, allocating a physical page and mapping it to the virtual page containing the VA. (of course, kernel can do other things, like kill the process.)] --intent: when page fault happens, the kernel sets up the process's page entries properly, or kills the process B. Uses of page faults --Best example: overcommitting physical memory (the classical use of "virtual memory") --your program thinks it has, say, 64 GB of memory, but your hardware has only 16 GB of memory --the way that this worked is that the disk was (is) used to store memory pages --Rough implementation: --for pages in memory: setting their PTEs with P=1 (P is present bit in PTE; see handout week9.a.) --for pages on disk (not in memory): setting their PTEs with P=0 --when accessing the pages that are not in memory, a page fault --on a page fault, the kernel reads in the faulting page from the disk to the memory. [Another point that I haven't talked in class: --QUESTION: what is listed in the page structures? how does kernel know whether the address is invalid, in memory, paged, what? --kernel may need to send a page to disk (under what conditions? answer: two conditions must hold for kernel to HAVE to write to disk) (1) kernel is out of memory (2) the page that it selects to write out is dirty ] --advantage: address space looks huge --disadvantage: accesses to "paged" memory (as disk pages that live on the disk are known) are sllooooowwwww: --Computers have lots of memory, so less common to hear the sound of swapping these days. You would need multiple large memory consumers running on the same computer. --in fact, today you sometimes experience the other way around: you may want to run file system in memory (ramfs) for performance. --Many other uses --Distributed Shared Memory store memory pages across the network! --basic idea was that on a page fault, the page fault handler went and retrieved the needed page from some other machine --charming idea, but impractical. Why? too expensive. --classic trade-off: transparency vs. efficiency -- a general case: low-level APIs expose more information (less transparency) but have better performance --copy-on-write --when creating a copy of another process, don't copy its memory. just copy its page tables, mark the pages as read-only --QUESTION: do you need to mark the parent's pages as read-only as well? [answer: of course, otherwise the child will see data changes from the parent.] --program semantics aren't violated when programs do reads --when a write happens, a page fault results. at that point, the kernel allocates a new page, copies the memory over, and restarts the user program to do a write --then, only do copies of memory when there is a fault as a result of a write --this idea is all over the place; used in fork(), mmap(), etc. --accounting --good way to sample what percentage of the memory pages are written to in any time slice: mark a fraction of them not present, see how often you get faults --performance tricks --remember the JVM trick I mentioned --use page fault to get rid of an if-else branch --if you are interested in this, check out the paper "Virtual Memory Primitives for User Programs", by Andrew W. Appel and Kai Li, Proc. ASPLOS, 1991. --high-level idea: by giving kernel (or even user-level program) the opportunity to do interesting things on page faults, you can build interesting functionality C. costs --What does paging from the disk cost? --let's look at average memory access time (AMAT) --AMAT = (1-p)*memory access time + p * page fault time, where p is the prob. of a page fault. memory access time ~ 100ns SSD access time ~ 1 ms = 1^6 ns --QUESTION: what does p need to be to ensure that paging hurts performance by less than 10%? 1.1*t_M = (1-p)*t_M + p*t_D p = .1*t_M / (t_D - t_M) ~ 10^1 ns / 10^6 ns = 10^{-5} so only one access out of 100,000 can be a page fault!! --basically, page faults are super-expensive (good thing the machine can do other things during a page fault) Concept is much larger than OSes: need to pay attention to the slow case if it's really slow and common enough to matter. D. Page replacement policies --the fundamental problem/question: (also known as cache eviction problem) --some entity holds a cache of entries and gets a cache miss. The entity now needs to decide which entry to throw away. How does it decide? --make sure you understand why page faults that result from "page-not-present in memory" are a particular kind of cache miss --(the answer is that in the world of virtual memory, the pages resident in memory are basically a cache to the backing store on the disk; make sure you see why this claim, about virtual memory vis-a-vis the disk, is true.) --the system needs to decide which entry to throw away, which calls for a *replacement policy*. --let's cover some policies.... Specific policies * FIFO: throw out oldest (results in every page spending the same number of references in memory. not a good idea. pages are not accessed uniformly.) * OPT: throw away the entry that won't be used for the longest time. this is optimal. * LRU: throw out the least recently used (this is often a good idea, but it depends on the future looking like the past. what if we chuck a page from our cache and then were about to use it?) --implementing LRU [usually by a doubly linked list and a hash table] --reasonable to do in application programs like Web servers that cache pages (or dedicated Web caches). --in OS, LRU itself does not sound great. would be doubling memory traffic (after every reference, have to move some structure to the head of some list) --and in hardware, it's way too much work to timestamp each reference and keep the list ordered --how can we approximate LRU? --another algorithm: * CLOCK --arrange the slots in a circle. hand sweeps around, clearing a bit. the bit is set when the page is accessed. just evict a page if the hand points to it when the bit is clear. --approximates LRU ... because we're evicting pages that haven't been used in a while....though of course we may not be evicting the *least* recently used one (why not?) --Summary: --optimal is known as OPT --LRU is usually a good approximation to optimal --Implementing LRU in hardware or at OS/hardware interface is a pain --So implement CLOCK ... decent approximations to LRU, which is in turn good approximation to OPT *assuming that past is a good predictor of the future* (this assumption does not always hold!) 3. mmap --a syscall; a cool way to bring some ideas together. --recall some syscalls: fd = open(pathname, mode) write(fd, buf, sz) read(fd, buf, sz) --we've learned fds before, but what's an fd in kernel? --indexes into a table maintained by the kernel on behalf of the process --syscall: void* mmap(void* addr, size_t len, int prot, int flags, int fd, off_t offset); --means, roughly, "map the specified open file (fd) into a region of my virtual memory (close to addr, or at a kernel-selected place if addr is 0), and return a pointer to it" [see handout] NOTE: the "disk image" here is the file we've mmap()'ed, not the process's usual backing store. The idea is that mmap() lets the programmer "inject" pages from a regular file on disk into the process's backing store (which would otherwise be part of a swap file). --after this, loads and stores to addr[x] are equivalent to reading and writing to the file at offset+x. --why is this cool? - example: mmap enables copying a file to stdout without transferring data to user space see handout Question: mmap vs. normal read/write, which is faster? by how much? [Answer: --mmap is faster --"by how much" depends on the file size and the underlying OS and machine. --for ~1G file on my machine (MacOS, M1 Macbook Air), mmap is 13.5x faster. ] NOTE: the process never itself dereferences a pointer to memory containing file data. NOTE: this saves two sets of memory-to-memory copies (kernel-to-user, user-to-kernel), versus the "naive" solution of read()ing into a buffer in user space, and then write()ing [Acknowledgments: Mike Walfish, David Mazieres, Mike Dahlin]