Week 9.a. CS 5600 11/02 2021 On the board ------------ 1. x86-64 page table, continued 2. TLBs 3. Where does the OS live? 4. page fault ------------------------------------------------------------- Admin: -if you're panicking over work/performance: --don't panic; there's still time left --workload intended to be heavy (that's life), and potentially cause you to recalibrate what it means to be challenged. -- good time to rethink your expectation from CS5600 --do work on time management; some advice: -- Covey's four-quadrant TODO -- interrupts are bad for scheduling; they are equally bad for your time management (another type of scheduling) -- Worthwhile to watch Randy's time management lecture: https://www.youtube.com/watch?v=oTugjssqOT0&t=1934s&ab_channel=CarnegieMellonUniversity (The four-quadrant TODO appears at 21:00) --advice, in response to common anti-patterns: --"I work late at night and feel tired": Don't work when you're exhausted! (Wake up the morning, stop working, whatever.) Working in a fog does more harm than good. --"I'm digging with a spoon because I don't have time to make a shovel": take the time to make the shovel! (Read the guides, internalize the tools, learn the syntax.) --"My job is to finish this lab and this exercise." No! your job is to acquire the basics and figure out what's going on. If you have done that, the exercises do not take very long. --------- 1. x86-64 page table, continued x86 page table: translate a VA to PA Practice: -- This is the standard x86 32-bit two-level page table structure (not x86-64; we use 32-bit for simplicity). -- The permission bits of page directory entries and page table entries are set to 0x7. (what does 0x7 mean? answer: page present, read-write, and user-mode; see handout week8.b. This means that the virtual addresses are valid, and that user programs can read (load) from and write (store) to the virtual address.) -- The memory pages are listed below. On the left side of the pages are their addresses. (For example, the address of the "top-left" memory block (4 bytes) is 0xf0f02ffc, and its content is 0xf0f03007.) %cr3: 0xffff1000 +------------+ +------------+ +------------+ +------------+ 0xf0f02ffc | 0xf00f3007 | 0xff005ffc | 0xbebeebee | 0xffff1ffc | 0xd5202007 | 0xffff5ffc | 0xdeadbeef | +------------+ +------------+ +------------+ +------------+ | ... | | ... | | ... | | ... | +------------+ +------------+ +------------+ +------------+ 0xf0f02800 | 0xff005007 | 0xff005800 | 0xf00f8000 | 0xffff1800 | 0xef005007 | 0xffff5800 | 0xff005000 | +------------+ +------------+ +------------+ +------------+ | ... | | ... | | ... | | ... | +------------+ +------------+ +------------+ +------------+ 0xf0f02000 | 0xffff5007 | 0xff005000 | 0xc5201000 | 0xffff1000 | 0xf0f02007 | 0xffff5000 | 0xc5202000 | +------------+ +------------+ +------------+ +------------+ -- What's the output of the following C excerpt? int *ptr1 = (int *) 0x0; printf("%x\n", *ptr1); // this will be your HW5 // int *ptr2 = (int *) 0x200ffc; // printf("%x %x\n", *ptr1, *ptr2); [Note: %x in printf means printing out the integer in hexadecimal format.] Answer: "0xc5202000" [update 11/6: was "0xc5201000", which is wrong] In particular, here is walking the page tables: 0x0 => [0][0][0] (10bit, 10bit, 12bit) [note: in x86-64, 0x0 will be organized as [9bit, 9bit, 9bit, 9bit, 12bit]) (%cr3) -> 0xffff1000 (L1 PT) +--[index:0]-> 0xf0f02000 (L2 PT) +--[index:0]-> 0xffff5000 (data page) + 0 (offset) +--[PA]-> 0xffff5000 The content of PA 0xffff5000 is "0xc5202000" Why "content"? because C code "*ptr1" means _dereferencing_ the pointer "ptr1", namely fetching the memory content pointed by "ptr1" (pointer = an address). --note: all addresses in this process are physical addresses. page table entries (PTE): bunch of bits includes dirty, acccessed (set by hardware) present, U/S, R/W (set by OS) what will happen if the present bit is 0 but a program accesses the memory? [answer: page fault] what if the permission (U/S and R/W) is violated? [answer: again, page fault] Large pages: Can get 2MB (resp, 1 GB pages) on x86: each L3 (resp, L2) page table now points to the page instead of another page table + page tables smaller, less page table walking - more wasted memory to enable this, set bit 7 (PS) bit example: set bit PS in L3 table result is 2MB pages page walking is L1, L2, L3; no L4 page tables 2. TLB --so it looks like the CPU (specifically its MMU) has to go out to memory on every memory reference? --called "walking the page tables" --Question: to finish one memory access (e.g., movq 0xbebeebee, %rax), how many physical pages CPU (or MMU) has to touch? [answer: 5; 4 for L1/2/3/4 page tables, and 1 for the data page] [Notice that registers, for example %rax, are *inside* CPU. Registers are part of CPU, not memory. (CPU and memory are two different chips in most PCs.)] --performance-wise, this is awful. to make this fast, we need a cache --TLB: translation lookaside buffer hardware that stores virtual address --> physical address; the reason that all of this page table walking does not slow down the process too much --hardware managed? (x86, ARM.) hardware populates TLB --software managed? (MIPS. OS's job is to load the TLB when the OS receives a "TLB miss". Not the same thing as a page fault.) [see today's handout] --questions: --does TLB miss imply page fault? (no!) --does page fault imply TLB miss? (no!) (imagine a page that is mapped read-only. user-level process tries to write to it. TLB knows about the mapping, so no TLB miss. But this is still a protection violation. To cut down on terminology, we will lump this kind of violation in with "page fault".) --x86: --Question: what happens to the TLB when %cr3 is loaded? does kernel need to remove all the TLB entries? [answer: yes; called flushing TLB] --can we flush individual entries in the TLB otherwise? [yes, INVLPG addr] -- TLB structures -- there are instruction TLB, data TLB, and shared TLB -- also has 4KB page translation and large page (2MB) translation [see Intel Skylake: https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake] 3. Where does the OS live? First, background, kernel vs. application -- two CPU modes, many names -- "user mode" and "kernel/supervisor mode" -- "ring 0" and "ring 3" -- "restricted mode" and "privileged mode" -- How CPU differs the two modes? [answer: by two bits (called CPL) in a register (code selector register, CS). if CPL=0, then the code running is in "kernel mode"/"ring 0"; if CPL=3, then in "user mode"/"ring 3". Also, CPL automatically changes when system call instructions (sysenter, sysexit) are called.] -- What are the differences between the two modes? -- privileged instructions (for example, shutdown the interrupt, I/O instructions) -- read/write registers (like %cr3) -- memory access to pages with U/S bit set to 0 [if you want to know more about CPU modes, read: https://sites.google.com/site/masumzh/articles/x86-architecture-basics/x86-architecture-basics] Question: Where does the OS live? Option 1: In its own address space? -- Can't do this on most hardware (e.g., syscall instruction won’t switch address spaces) -- Also would make it harder to parse syscall arguments passed as pointers Option 2: kernel is actually in the same address space as all processes (choice of real systems) [see handout for picture] * not precisely true post-Meltdown, but close enough (in that some of the kernel is mapped into all user processes). [spectre: speculation + time channel (simplied and pseudocode) -- password is a secret and process 1 cannot read -- process 1 code: if (read_first_letter(password) == 'C') { touch valid memory Y } -- process 2 time touching memory Y -- if fast: the first letter of password is 'C' -- if slow: the first letter of password is not 'C' ] -- Use protection bits to prohibit user code from reading/writing kernel -- Typically all kernel text, most data at same VA in *every* address space (every process has virtual addresses that map to the physical memory that stores the kernel's instructions and data) -- In Linux, the kernel is mapped at the top of the address space, along with per-process data structures. -- Physical memory also mapped up top, which gives the kernel a convenient way to access physical memory. NOTE: that means that physical memory that is in use is mapped in at least two places (once into a process's virtual address space and once into this upper region of the virtual space).