Week 10.b
CS 5600
03/15 2023

1. page table practice
2. TLBs
3. Where does the OS live?
-------------------------------------------------

Admin:

- Lab4 is released

---------

0. Last time

  --x86-64 virtual address translation
    --VA: 48bits
    --PA: 52bits
    --translation: 4-level page table

  Q: page table pages form a tree. Where is the root of this tree?
  [%cr3 or the physical page pointed by %cr3]

1. page table practice

   A: memory that different PTEs can address

    --Question: how much memory can one L1 page entry address?

    --answer: each entry in the L1 page table corresponds to 512GB of
    virtual address space ("corresponds to" means "selects the
    next-level page tables that actually govern the mapping").

    for others:

    --each entry in the L2 page table corresponds to 1 GB of virtual
    address space

    --each entry in the L3 page table corresponds to 2 MB of virtual
    address space

    --each entry in the L4 page table corresponds to 1 page (4 KB)
    of virtual address space

    --Question: so how much virtual memory is each L4 page *table*
    responsible for translating? 4KB? 2MB? 1GB? 

        [answer: 2MB]

    --each page table itself consumes 4KB of physical memory,
    i.e., each one of these fits on a page


  B. Allocating memory

    [from cs61, 2018]
      https://cs61.seas.harvard.edu/site/2018/Section4/

    What is the minimum number of physical pages required on x86-64 to
    allocate the following allocations?
    Draw an example pagetable
    mapping for each scenario (start from scratch each time).

    1 byte of memory
        = [5 phys pages]

    1 allocation of size 2^12 bytes of memory
        = [5 phys pages]

    2^9 allocations of size of 2^12 bytes of memory each
        = [512 + 4 = 516 phys pages]

    2^9 + 1 allocations of size of 2^12 bytes of memory each
        = [512 + 4 + (1 + 1) = 518 phys pages]

    2^18 + 1 allocations of size 2^12 bytes of memory each
        = [1 (L1) + 1 (L2) + 2 (L3) + (2^9 + 1) (L4) + (2^18 + 1) (the memory)]


  C. page table walk

    x86 page table: translate a VA to PA

    Practice:

    -- This is the standard x86 32-bit two-level page table structure
       (not x86-64; we use 32-bit for simplicity).
    -- The permission bits of page directory entries and page table entries are set to 0x7.
       (what does 0x7 mean? 
        answer: page present, read-write, and user-mode; see handout week8.b.
        This means that the virtual addresses are valid, and that user programs
        can read (load) from and write (store) to the virtual address.)

    -- The memory pages are listed below.
       On the left side of the pages are their addresses.
       (For example, the address of the "top-left" memory block (4 bytes) is
       0xf0f02ffc, and its content is 0xf0f03007.)

  %cr3:  0xffff1000

              +------------+            +------------+ 
  0xf0f02ffc  | 0xf00f3007 | 0xff005ffc | 0xbebeebee | 
              +------------+            +------------+ 
              | ...        |            | ...        | 
              +------------+            +------------+ 
  0xf0f02800  | 0xff005007 | 0xff005800 | 0xf00f8000 | 
              +------------+            +------------+ 
              | ...        |            | ...        | 
              +------------+            +------------+ 
  0xf0f02000  | 0xffff5007 | 0xff005000 | 0xc5201000 | 
              +------------+            +------------+ 


              +------------+            +------------+
  0xffff1ffc  | 0xd5202007 | 0xffff5ffc | 0xdeadbeef |
              +------------+            +------------+
              | ...        |            | ...        |
              +------------+            +------------+
  0xffff1800  | 0xef005007 | 0xffff5800 | 0xff005000 |
              +------------+            +------------+
              | ...        |            | ...        |
              +------------+            +------------+
  0xffff1000  | 0xf0f02007 | 0xffff5000 | 0xc5202000 |
              +------------+            +------------+


    -- What's the output of the following C excerpt?

       int *ptr1 = (int *) 0x0;
       printf("%x\n", *ptr1);

       // this will be your homework
       // int *ptr2 = (int *) 0x200ffc;
       // printf("%x\n", *ptr2);

    [Note: %x in printf means printing out the integer in hexadecimal format.]

    Answer: "0xc5202000"

    In particular, here is walking the page tables:

      0x0 => [0][0][0] (10bit, 10bit, 12bit)
        [note: in x86-64, 0x0 will be organized as [9bit, 9bit, 9bit, 9bit, 12bit])

      (%cr3) -> 0xffff1000 (L1 PT) 
                +--[index:0]-> 0xf0f02000 (L2 PT) 
                               +--[index:0]-> 0xffff5000 (data page) + 0 (offset)
                                              +--[PA]-> 0xffff5000

      The content of PA 0xffff5000 is "0xc5202000"
      Why "content"?
      because C code "*ptr1" means _dereferencing_ the pointer "ptr1",
      namely fetching the memory content pointed by "ptr1" (pointer = an address).

    --note: all addresses in this process are physical addresses.

2. TLB

    --so it looks like the CPU (specifically its MMU) has to go out
    to memory on every memory reference?
        --called "walking the page tables"

    --Question: to finish one memory access (e.g., movq (0xbebeebee), %rax),
    how many physical pages CPU (or MMU) has to touch?

      [answer: 5 (assuming the instruction is already fetched)
      4 for L1/2/3/4 page tables, and 1 for the data page]

    --performance-wise, this is awful.
      to make this fast, we need a cache

    --TLB: translation lookaside buffer

    hardware that stores virtual address --> physical address;
    the reason that all of this page table walking does not slow
    down the process too much

    --Who control the TLB?

        --hardware managed? (x86, ARM.) hardware populates TLB

        --software managed? (MIPS. OS's job is to load the TLB when
        the OS receives a "TLB miss". Not the same thing as a page
        fault.)

    --TLB is one type of cache

    -- CPU caches
       [see today's handout]

       common parameters:
         * cache line size (usually, 64B for x86)
         * 2^s sets  (s is the number of bits in addresses to reference sets)
         * E-way     (number of cache line in each set)
           (for example, 8-way means that there are 8 cache lines in one set)

      given an adress, split it into:
         | tag | index | offset |

      --the index is going to pick the "sets"
      --offset is going to choose bytes within one cache line
      --tag is used to compared if cache hit

    -- an example, a cache of
       * cache line size: 64B  (=> offset is 6 bits)
       * 64 sets               (=> index is 6 bits)
       * 8-way
       (the L1 d-cache on the end-to-end Core i7 Address Translation)

       Assume the memory of address, 0xffffff, is cached.
       How to locate where the data is?
         [go through cache read on handout]

    -- TLB structures

       -- there are instruction TLB, data TLB, and shared TLB
       -- also has 4KB page translation and large page (2MB) translation

       data TLB that your computer might use:
          4 KB page: 64 entries; 4-way set associative
          [this is handout's TLB]

       Question: if TLB is full, how much memory's VA translation has been cached?
        [64*4KB = 256KB
         this means if your program is smaller than 256KB, after warming-up,
         your program likely will never encounter instruction TLB miss!!
        ]

     [ TLB Sizes (for those who are interested)

        instruction TLB:
          4KB page: 128 entries; 8-way set associative
          2 MB page: 8 entries; fully associative

        data TLB:
          4 KB page: 64 entries; 4-way set associative
          2 MB page: 32 entries; 4-way set associative
          1G page: 4 entries; 4-way associative

        shared TLB:
          4 KB + 2 MB page: 1536 entries; 12-way set associative
          1 GB page: 16 entries; 4-way set associative

       see also Intel Skylake:
        https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake
    ]

    --x86:

        --Question: what happens to the TLB when %cr3 is loaded?
          does kernel need to remove all the TLB entries?
          [answer: yes; called flushing TLB]

        --can we flush individual entries in the TLB otherwise? 
          [yes, INVLPG addr]

        --Question: should TLB also cache R/W and U/S bits in PTE?
          [Yes! Otherwise, the CPU are unable to enforce isolation and
        permissions.]


3. Where does the OS live?

    First, kernel vs. application

      -- two modes, many names
        -- "user mode" and "kernel/supervisor mode"
        -- "ring 0" and "ring 3"
        -- "restricted mode" and "privileged mode"

      -- How CPU differs the two modes?
        [answer: by two bits (called CPL) in a register (code selector register, CS).
            if CPL=0, then the code running is in "kernel mode"/"ring 0";
            if CPL=3, then in "user mode"/"ring 3". 

        Also, CPL automatically changes when system call instructions
        (sysenter, sysexit) are called.]


      -- What are the differences between the two modes?
        -- memory access to pages with U/S bit set to 0
        -- read/write registers (like %cr3)
        -- privileged instructions (for example, shutdown the interrupt, I/O instructions)

        [if you want to know more about CPU modes, read:
        https://sites.google.com/site/masumzh/articles/x86-architecture-basics/x86-architecture-basics]

    Question: Where does the OS live? 

      Option 1: In its own address space?

        -- will be super expensive
        -- on most hardware, syscall instruction won’t switch address spaces

        -- Also would make it harder to parse syscall arguments
        passed as pointers

        -- WeensyOS uses it; see slides

      Option 2: kernel is actually in the same address space as
      all processes (choice of real systems)

      [see handout for picture]

      * not precisely true post-Meltdown, but close enough (in that
      some of the kernel is mapped into all user processes).

    -- Use protection bits to prohibit user code from reading/writing kernel

    -- Typically all kernel text, most data at same VA in *every*
    address space (every process has virtual addresses that map to the
    physical memory that stores the kernel's instructions and data)

    -- In Linux, the kernel is mapped at the top of the address space,
    along with per-process data structures.

    -- Physical memory also mapped up top, which gives the kernel a
    convenient way to access physical memory.

        NOTE: that means that physical memory that is in use is mapped
        in at least two places (once into a process's virtual address
        space and once into this upper region of the virtual space).


[Acknowledgments: Mike Walfish, Aurojit Panda, David Mazieres, Mike Dahlin]