Week 10.b
CS 5600
03/23 2022

On the board
------------

1. TLBs
2. Where does the OS live?
3. Meltdown and Spectre

---------------------------------------------------------------------

Admin:

-midterm private challenge today

---------

0. Last time

  --x86-64 virtual address translation
    --VA: 48bits
    --PA: 52bits
    --translation: 4-level page table

  Q: page table pages form a tree. Where is the root of this tree?
  [%cr3 or the physical page pointed by %cr3]


1. TLB

    --so it looks like the CPU (specifically its MMU) has to go out
    to memory on every memory reference?
        --called "walking the page tables"

    --Question: to finish one memory access (e.g., movq (0xbebeebee), %rax),
    how many physical pages CPU (or MMU) has to touch?

      [answer: 5 (assuming the instruction is already fetched)
      4 for L1/2/3/4 page tables, and 1 for the data page]

    --performance-wise, this is awful.
      to make this fast, we need a cache

    --TLB: translation lookaside buffer

    hardware that stores virtual address --> physical address;
    the reason that all of this page table walking does not slow
    down the process too much

    --Who control the TLB?

        --hardware managed? (x86, ARM.) hardware populates TLB

        --software managed? (MIPS. OS's job is to load the TLB when
        the OS receives a "TLB miss". Not the same thing as a page
        fault.)

    --TLB is one type of cache

      ** Crash course of CPU caches
      [see today's handout]

       common parameters:
         * cache line size (usually, 64B for x86)
         * 2^s sets  (s is the number of bits in addresses to reference sets)
         * E-way     (number of cache line in each set)
           (for example, 8-way means that there are 8 cache lines in one set)

      given an adress, split it into:
         | tag | index | offset |

      --the index is going to pick the "sets"
      --offset is going to choose bytes within one cache line
      --tag is used to compared if cache hit


    -- TLB structures

       -- there are instruction TLB, data TLB, and shared TLB
       -- also has 4KB page translation and large page (2MB) translation

       data TLB that your computer might use:
          4 KB page: 64 entries; 4-way set associative
          [this is handout's TLB]

       Question: if TLB is full, how much memory's VA translation has been cached?
        [64*4KB = 256KB
         this means if your program is smaller than 256KB, after warming-up,
         your program likely will never encounter instruction TLB miss!!
        ]

     [ TLB Sizes (for those who are interested)

        instruction TLB:
          4KB page: 128 entries; 8-way set associative
          2 MB page: 8 entries; fully associative

        data TLB:
          4 KB page: 64 entries; 4-way set associative
          2 MB page: 32 entries; 4-way set associative
          1G page: 4 entries; 4-way associative

        shared TLB:
          4 KB + 2 MB page: 1536 entries; 12-way set associative
          1 GB page: 16 entries; 4-way set associative

       see also Intel Skylake:
        https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake
    ]

    --questions about page faults vs. TLB misses:

      --recall page faults:
        --access invalid memory (P=0) 
        --or fail permission checks (write a RO page)

      --does TLB miss imply page fault? (no!)

      --does page fault imply TLB miss? (no!)
          (imagine a page that is mapped read-only. user-level
          process tries to write to it. TLB knows about the mapping,
          so no TLB miss. But this is still a protection violation.
          To cut down on terminology, we will lump this kind of
          violation in with "page fault".)

    --x86:

        --Question: what happens to the TLB when %cr3 is loaded?
          does kernel need to remove all the TLB entries?
          [answer: yes; called flushing TLB]

        --can we flush individual entries in the TLB otherwise? 
          [yes, INVLPG addr]

        --Question: should TLB also cache R/W and U/S bits in PTE?
          [Yes! Otherwise, the CPU are unable to enforce isolation and
        permissions.]


2. Where does the OS live?

    First, kernel vs. application

      -- two modes, many names
        -- "user mode" and "kernel/supervisor mode"
        -- "ring 0" and "ring 3"
        -- "restricted mode" and "privileged mode"

      -- How CPU differs the two modes?
        [answer: by two bits (called CPL) in a register (code selector register, CS).
            if CPL=0, then the code running is in "kernel mode"/"ring 0";
            if CPL=3, then in "user mode"/"ring 3". 

        Also, CPL automatically changes when system call instructions
        (sysenter, sysexit) are called.]


      -- What are the differences between the two modes?
        -- memory access to pages with U/S bit set to 0
        -- read/write registers (like %cr3)
        -- privileged instructions (for example, shutdown the interrupt, I/O instructions)

        [if you want to know more about CPU modes, read:
        https://sites.google.com/site/masumzh/articles/x86-architecture-basics/x86-architecture-basics]

    Question: Where does the OS live? 

      Option 1: In its own address space?

        -- Can't do this on most hardware (e.g., syscall instruction
        won’t switch address spaces)

        -- Also would make it harder to parse syscall arguments
        passed as pointers

      Option 2: kernel is actually in the same address space as
      all processes (choice of real systems)

      [see handout for picture]

      * not precisely true post-Meltdown, but close enough (in that
      some of the kernel is mapped into all user processes).

    -- Use protection bits to prohibit user code from reading/writing kernel

    -- Typically all kernel text, most data at same VA in *every*
    address space (every process has virtual addresses that map to the
    physical memory that stores the kernel's instructions and data)

    -- In Linux, the kernel is mapped at the top of the address space,
    along with per-process data structures.

    -- Physical memory also mapped up top, which gives the kernel a
    convenient way to access physical memory.

        NOTE: that means that physical memory that is in use is mapped
        in at least two places (once into a process's virtual address
        space and once into this upper region of the virtual space).


3. Meltdown and Spectre

   Handout's memory layout is nice, but...
     ...if the HW isolation is broken, nothing works...
     ...and unfortunately, HW (CPU) today is broken...

   We have Meltdown and Spectre (2018).
     see: https://meltdownattack.com/

   """
   Q: Am I affected by the vulnerability?
   A: Most certainly, yes.

   Q: Can I detect if someone has exploited Meltdown or Spectre against me?
   A: Probably not. The exploitation does not leave any traces in traditional log files.

   Q: What can be leaked?
   A: If your system is affected, our proof-of-concept exploit can read the memory
   content of your computer. This may include passwords and sensitive data stored
   on the system.

   Q: Has Meltdown or Spectre been abused in the wild?
   A: We don't know.
   """

   -- backgrounds

     -- side channel attack

        We have caches all over the places to accelerate memory accesses.

        Cache side-channel attacks exploit timing differences that are
        introduced by the caches.

        An attacker frequently flushes a targeted memory location.

        By measuring the time of reloading the data, another process (the
        attacker) can determine whether data was loaded into the cache.

     -- speculative execution

        Motivation:

        given a piece of code:

        if (read_a_bool_from_memory) {
          foo()
        }

        reading from memory can be slow (hundreds of cycles).
        Before knowing the result, in principle, CPU could do nothing.

        In fact, CPU will predict the branch and might speculatively run foo():
          if bool is false, CPU discards all the results of foo() => as if nothing happens
          if bool is true, hooray, CPU save a lot of time!

        -- Speculative execution on modern CPUs can run several hundred
        instructions ahead.

   -- spectre: speculation + time channel
      (simplied and pseudocode)

      -- first, run code:

         if (x < array1_size) {
            y = array2[array1[x] * 4096];
         }

        ** where x is way larger than array1_size, which ends up on some secret.
           (meaning the address of (array1 + x) points to some secret)
        ** we have all pages in array2 uncached

      -- then test which page has been touched:

        for (int i=0; i<256; i++) {
            test how long to read array2[i << 12]
        }

        if ith round is faster than others,
          then we know the secrete is "i"

    [skipped most of the pieces]
    -- meltdown: out-of-order execution + time channel
      (simplied and pseudocode)

       --out-of-order execution
         CPU doesn't have to run code line by line.
         It might be running them out-of-order to accelerate the execution.

       --first, run code:

        byte = read_one_byte_from_kernel() // will throw exception
        // the line below should have been never reached 
        int x = array[byte << 12]

       --then, run:

        for (int i=0; i<256; i++) { // 256 = 2^8 (8bits = 1byte)
            test how long to read array[i << 12]
        }

    -- mitigation:
      [will talk about it next time]

[Acknowledgments: Mike Walfish, David Mazieres, Mike Dahlin]