Week 10.b CS 5600 03/15 2023 1. page table practice 2. TLBs 3. Where does the OS live? ------------------------------------------------- Admin: - Lab4 is released --------- 0. Last time --x86-64 virtual address translation --VA: 48bits --PA: 52bits --translation: 4-level page table Q: page table pages form a tree. Where is the root of this tree? [%cr3 or the physical page pointed by %cr3] 1. page table practice A: memory that different PTEs can address --Question: how much memory can one L1 page entry address? --answer: each entry in the L1 page table corresponds to 512GB of virtual address space ("corresponds to" means "selects the next-level page tables that actually govern the mapping"). for others: --each entry in the L2 page table corresponds to 1 GB of virtual address space --each entry in the L3 page table corresponds to 2 MB of virtual address space --each entry in the L4 page table corresponds to 1 page (4 KB) of virtual address space --Question: so how much virtual memory is each L4 page *table* responsible for translating? 4KB? 2MB? 1GB? [answer: 2MB] --each page table itself consumes 4KB of physical memory, i.e., each one of these fits on a page B. Allocating memory [from cs61, 2018] https://cs61.seas.harvard.edu/site/2018/Section4/ What is the minimum number of physical pages required on x86-64 to allocate the following allocations? Draw an example pagetable mapping for each scenario (start from scratch each time). 1 byte of memory = [5 phys pages] 1 allocation of size 2^12 bytes of memory = [5 phys pages] 2^9 allocations of size of 2^12 bytes of memory each = [512 + 4 = 516 phys pages] 2^9 + 1 allocations of size of 2^12 bytes of memory each = [512 + 4 + (1 + 1) = 518 phys pages] 2^18 + 1 allocations of size 2^12 bytes of memory each = [1 (L1) + 1 (L2) + 2 (L3) + (2^9 + 1) (L4) + (2^18 + 1) (the memory)] C. page table walk x86 page table: translate a VA to PA Practice: -- This is the standard x86 32-bit two-level page table structure (not x86-64; we use 32-bit for simplicity). -- The permission bits of page directory entries and page table entries are set to 0x7. (what does 0x7 mean? answer: page present, read-write, and user-mode; see handout week8.b. This means that the virtual addresses are valid, and that user programs can read (load) from and write (store) to the virtual address.) -- The memory pages are listed below. On the left side of the pages are their addresses. (For example, the address of the "top-left" memory block (4 bytes) is 0xf0f02ffc, and its content is 0xf0f03007.) %cr3: 0xffff1000 +------------+ +------------+ 0xf0f02ffc | 0xf00f3007 | 0xff005ffc | 0xbebeebee | +------------+ +------------+ | ... | | ... | +------------+ +------------+ 0xf0f02800 | 0xff005007 | 0xff005800 | 0xf00f8000 | +------------+ +------------+ | ... | | ... | +------------+ +------------+ 0xf0f02000 | 0xffff5007 | 0xff005000 | 0xc5201000 | +------------+ +------------+ +------------+ +------------+ 0xffff1ffc | 0xd5202007 | 0xffff5ffc | 0xdeadbeef | +------------+ +------------+ | ... | | ... | +------------+ +------------+ 0xffff1800 | 0xef005007 | 0xffff5800 | 0xff005000 | +------------+ +------------+ | ... | | ... | +------------+ +------------+ 0xffff1000 | 0xf0f02007 | 0xffff5000 | 0xc5202000 | +------------+ +------------+ -- What's the output of the following C excerpt? int *ptr1 = (int *) 0x0; printf("%x\n", *ptr1); // this will be your homework // int *ptr2 = (int *) 0x200ffc; // printf("%x\n", *ptr2); [Note: %x in printf means printing out the integer in hexadecimal format.] Answer: "0xc5202000" In particular, here is walking the page tables: 0x0 => [0][0][0] (10bit, 10bit, 12bit) [note: in x86-64, 0x0 will be organized as [9bit, 9bit, 9bit, 9bit, 12bit]) (%cr3) -> 0xffff1000 (L1 PT) +--[index:0]-> 0xf0f02000 (L2 PT) +--[index:0]-> 0xffff5000 (data page) + 0 (offset) +--[PA]-> 0xffff5000 The content of PA 0xffff5000 is "0xc5202000" Why "content"? because C code "*ptr1" means _dereferencing_ the pointer "ptr1", namely fetching the memory content pointed by "ptr1" (pointer = an address). --note: all addresses in this process are physical addresses. 2. TLB --so it looks like the CPU (specifically its MMU) has to go out to memory on every memory reference? --called "walking the page tables" --Question: to finish one memory access (e.g., movq (0xbebeebee), %rax), how many physical pages CPU (or MMU) has to touch? [answer: 5 (assuming the instruction is already fetched) 4 for L1/2/3/4 page tables, and 1 for the data page] --performance-wise, this is awful. to make this fast, we need a cache --TLB: translation lookaside buffer hardware that stores virtual address --> physical address; the reason that all of this page table walking does not slow down the process too much --Who control the TLB? --hardware managed? (x86, ARM.) hardware populates TLB --software managed? (MIPS. OS's job is to load the TLB when the OS receives a "TLB miss". Not the same thing as a page fault.) --TLB is one type of cache -- CPU caches [see today's handout] common parameters: * cache line size (usually, 64B for x86) * 2^s sets (s is the number of bits in addresses to reference sets) * E-way (number of cache line in each set) (for example, 8-way means that there are 8 cache lines in one set) given an adress, split it into: | tag | index | offset | --the index is going to pick the "sets" --offset is going to choose bytes within one cache line --tag is used to compared if cache hit -- an example, a cache of * cache line size: 64B (=> offset is 6 bits) * 64 sets (=> index is 6 bits) * 8-way (the L1 d-cache on the end-to-end Core i7 Address Translation) Assume the memory of address, 0xffffff, is cached. How to locate where the data is? [go through cache read on handout] -- TLB structures -- there are instruction TLB, data TLB, and shared TLB -- also has 4KB page translation and large page (2MB) translation data TLB that your computer might use: 4 KB page: 64 entries; 4-way set associative [this is handout's TLB] Question: if TLB is full, how much memory's VA translation has been cached? [64*4KB = 256KB this means if your program is smaller than 256KB, after warming-up, your program likely will never encounter instruction TLB miss!! ] [ TLB Sizes (for those who are interested) instruction TLB: 4KB page: 128 entries; 8-way set associative 2 MB page: 8 entries; fully associative data TLB: 4 KB page: 64 entries; 4-way set associative 2 MB page: 32 entries; 4-way set associative 1G page: 4 entries; 4-way associative shared TLB: 4 KB + 2 MB page: 1536 entries; 12-way set associative 1 GB page: 16 entries; 4-way set associative see also Intel Skylake: https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake ] --x86: --Question: what happens to the TLB when %cr3 is loaded? does kernel need to remove all the TLB entries? [answer: yes; called flushing TLB] --can we flush individual entries in the TLB otherwise? [yes, INVLPG addr] --Question: should TLB also cache R/W and U/S bits in PTE? [Yes! Otherwise, the CPU are unable to enforce isolation and permissions.] 3. Where does the OS live? First, kernel vs. application -- two modes, many names -- "user mode" and "kernel/supervisor mode" -- "ring 0" and "ring 3" -- "restricted mode" and "privileged mode" -- How CPU differs the two modes? [answer: by two bits (called CPL) in a register (code selector register, CS). if CPL=0, then the code running is in "kernel mode"/"ring 0"; if CPL=3, then in "user mode"/"ring 3". Also, CPL automatically changes when system call instructions (sysenter, sysexit) are called.] -- What are the differences between the two modes? -- memory access to pages with U/S bit set to 0 -- read/write registers (like %cr3) -- privileged instructions (for example, shutdown the interrupt, I/O instructions) [if you want to know more about CPU modes, read: https://sites.google.com/site/masumzh/articles/x86-architecture-basics/x86-architecture-basics] Question: Where does the OS live? Option 1: In its own address space? -- will be super expensive -- on most hardware, syscall instruction won’t switch address spaces -- Also would make it harder to parse syscall arguments passed as pointers -- WeensyOS uses it; see slides Option 2: kernel is actually in the same address space as all processes (choice of real systems) [see handout for picture] * not precisely true post-Meltdown, but close enough (in that some of the kernel is mapped into all user processes). -- Use protection bits to prohibit user code from reading/writing kernel -- Typically all kernel text, most data at same VA in *every* address space (every process has virtual addresses that map to the physical memory that stores the kernel's instructions and data) -- In Linux, the kernel is mapped at the top of the address space, along with per-process data structures. -- Physical memory also mapped up top, which gives the kernel a convenient way to access physical memory. NOTE: that means that physical memory that is in use is mapped in at least two places (once into a process's virtual address space and once into this upper region of the virtual space). [Acknowledgments: Mike Walfish, Aurojit Panda, David Mazieres, Mike Dahlin]