Week 11.a
CS 5600
03/28 2022

On the board
------------

1. Last time
2. page fault
   - intro
   - usage
   - costs
   - thrashing
3. mmap

-------------------------------------------------------

Admin:

--Lab2 cheating detection results
  --confess before Tue Midnight

---------


1. Last time

  -- TLB
    -- Virtual memory overview
      [show slides]

  -- where does OS live?

  -- meltdown and spectre

  --post-meltdown kernel (mitigation: KAISER)

    notes from Aurojit Panda about kernel-being-mapped-into-each-process:

      [AP: This answer is complicated
      (https://www.usenix.org/system/files/login/articles/login_winter18_03_gruss.pdf),
      but see below for attempt to explain.

    * In the post-meltdown KAISER/KPTI/KVA/XNU Double Map (all names for
      similar mitigations) each process has two (logical) page tables:

        - One, the user mode page table, for use when the process is
        executing usermode code, unmaps most (but not all) of the kernel,
        this includes some of the kernel stack, and a few other things. The
        aim of all mitigations has been to minimize the number of kernel
        pages in the user mode page table, but different tradeoffs are
        selected for how much this is minimized.

        - The second, the kernel mode page table, has exactly the same
        layout as what [is mentioned above], i.e., the kernel is in all address
        spaces.

    On entry to kernel, the OS tries as rapidly as possible to switch from
    user mode page table to kernel mode one.

    It switches back before return. Having a kernel mode page table per
    process is in part to minimize how much one needs to change in the
    kernel, so maybe one can argue that this is not the "best" possible
    solution, but it is pretty good.]


2. Page faults

   A. intro and mechanics

    We've discussed these a bit. Let's go into a bit more detail...

    Concept:

        a reference is illegal, either because it's not mapped in the
        page tables or because there is a protection violation.

        requires the OS to get involved

        this mechanism turns out to be hugely powerful, as we will see.

    Mechanics

        --what happens on the x86?
          [show slides]

        --processor constructs a trap frame and
        transfers execution to an interrupt or trap handler

            [see handout week11.a]

                   [trap frame]
          %rsp --> [error code]

        %rip now points to code to handle the trap
            [how did processor know what to load into %rip?
            Answer: kernel needs to set up an interrupt descriptor table (IDT),
            by which CPU knows where to go; btw, page fault exception is #14.]

        Question: why isn't %cr3 stored in the trap frame?
        [answer: though trapping to kernel, the address space is unchanged.
        Recall "where is the OS live".]

        error code:

            [ ................................ U/S | W/R | P]

            U/S: user mode fault / supervisor mode fault
            R/W: access was read / access was write
            P: not-present page / protection violation

        on a page fault, %cr2 holds the faulting virtual address

        Question: why does OS need the faulting virtual address?
        [answer: to walk page table to fix the page fault]

        --intent: when page fault happens, the kernel sets up the
        process's page entries properly, or kills the process

   B. Uses of page faults

    --Best example: overcommitting physical memory (the classical
    use of "virtual memory")

    --your program thinks it has, say, 64 GB of memory, but your
    hardware has only 16 GB of memory

    --the way that this worked is that the disk was (is) used to
    store memory pages

    --advantage: address space looks huge

    --disadvantage: accesses to "paged" memory (as disk pages that
    live on the disk are known) are sllooooowwwww:

    --Rough implementation:

        --on a page fault, the kernel reads in the faulting page

        --QUESTION: what is listed in the page structures?
        how does kernel know whether the address is invalid,
        in memory, paged, what?

        --kernel may need to send a page to disk (under what
        conditions? answer: two conditions must hold for kernel to
        HAVE to write to disk)

          (1) kernel is out of memory

          (2) the page that it selects to write out is dirty

    --Computers have lots of memory, so less common to hear the
    sound of swapping these days. You would need multiple large memory
    consumers running on the same computer.

    --in fact, today you sometimes experience the other way around:
      you may want to run file system in memory (ramfs) for performance.

    --Many other uses

    a) store memory pages across the network! (Distributed Shared
    Memory)

        --basic idea was that on a page fault, the page fault
        handler went and retrieved the needed page from some
        other machine

        --charming idea, but impractical. Why? too expensive.

        --classic trade-off: transparency vs. efficiency
          -- a general case: low-level APIs expose more information
          (less transparency) but have better performance

    b) copy-on-write

        --when creating a copy of another process, don't copy
        its memory. just copy its page tables, mark the pages as
        read-only

        --COW is how fork is implemented

        --QUESTION: do you need to mark the parent's pages
        as read-only as well?
        [answer: of course, otherwise the child will see data
        changes from the parent.]

        --program semantics aren't violated when programs do
        reads

        --when a write happens, a page fault results. at that
        point, the kernel allocates a new page, copies the 
        memory over, and restarts the user program to do a write

        --then, only do copies of memory when there is a
        fault as a result of a write

        --this idea is all over the place; used in fork(), mmap(),
        etc.

    [skipped]
    c) accounting

        --good way to sample what percentage of the memory pages
        are written to in any time slice: mark a fraction of
        them not present, see how often you get faults

    d) performance tricks
      --remember the JVM trick I mentioned
      --use page fault to get rid of an if-else branch

    --the paper
    "Virtual Memory Primitives for User Programs", by Andrew W.
    Appel and Kai Li, Proc. ASPLOS, 1991.
      explores the possible usage of virtual memory 3 decades ago:

        --high-level idea: by giving user-level program the opportunity to do
        interesting things on page faults, you can build interesting
        functionality

        [interesting applications below; introduce some given time]

        [briefly mentioned]
        ** Concurrent GC

           --GC motivation: 
             --manually malloc and free are tedious and error-prone
               (recall your memory bugs in your labs)
             --can we ask program/runtime/library/OS to fix this?
             --Yes, garbage collection (GC)!

           --many GC algorithms
             --one classic GC algorithm: copying 
             --two memory regions: one is in-use; the other is free
             --when doing GC, copy the useful objects to the other region
             --swap the regions

           --challenge: concurrently run the garbage collector and the program?
             --what if program reaches the old memory? may cause stale data

  [skipped]
  C. costs

    --What does paging from the disk cost?

    --let's look at average memory access time (AMAT)

    --AMAT = (1-p)*memory access time + p * page fault time,
    where p is the prob. of a page fault.

    memory access time ~ 100ns
    SSD access time   ~ 1 ms = 1^6 ns

    --QUESTION: what does p need to be to ensure that paging hurts
    performance by less than 10%?

    1.1*t_M = (1-p)*t_M + p*t_D
    p = .1*t_M / (t_D - t_M) ~ 10^1 ns / 10^6 ns = 10^{-5}

    so only one access out of 100,000 can be a page fault!!

    --basically, page faults are super-expensive (good thing the
    machine can do other things during a page fault)

    Concept is much larger than OSes: need to pay attention to the
    slow case if it's really slow and common enough to matter.


  D. Page replacement policies

    --the fundamental problem/question:
      (also known as cache eviction problem)

    --some entity holds a cache of entries and gets a cache miss.
    The entity now needs to decide which entry to throw away. How
    does it decide?

    --make sure you understand why page faults that result from
    "page-not-present in memory" are a particular kind of cache miss

        --(the answer is that in the world of virtual memory, the
        pages resident in memory are basically a cache to the
        backing store on the disk; make sure you see why this claim,
        about virtual memory vis-a-vis the disk, is true.)

    --the system needs to decide which entry to throw away, which calls
    for a *replacement policy*.

    --let's cover some policies....

    Specific policies

    * FIFO: throw out oldest (results in every page spending
    the same number of references in memory. not a good idea.  pages
    are not accessed uniformly.)

    * LRU: throw out the least recently used (this is often a good
    idea, but it depends on the future looking like the past. what
    if we chuck a page from our cache and then were about to use it?)

    --implementing LRU

    --reasonable to do in application programs like Web servers that
    cache pages (or dedicated Web caches).
        [use queue to track least recently accessed and use hash map
        to implement the (k,v) lookup]

    --in OS, LRU itself does not sound great. would be doubling
    memory traffic (after every reference, have to move some
    structure to the head of some list)

    --and in hardware, it's way too much work to timestamp each
    reference and keep the list ordered (remember that the TLB may
    also be implementing these solutions)

    --how can we approximate LRU?

    --another algorithm:
        * CLOCK

        [draw clock on board]

        --arrange the slots in a circle. hand sweeps around, clearing
        a bit. the bit is set when the page is accessed. just evict a
        page if the hand points to it when the bit is clear.

        --approximates LRU ... because we're evicting pages that haven't
        been used in a while....though of course we may not be evicting
        the *least* recently used one (why not?)

    --Summary:

        --LRU is usually a good approximation to optimal policy

            * optimal policy: throw away the entry that won't
            be used for the longest time. this is optimal.

        --Implementing LRU in hardware or at OS/hardware interface is a pain

        --So implement CLOCK ... decent approximations to
        LRU, which is in turn good approximation to OPT *assuming that
        past is a good predictor of the future* (this assumption does
        not always hold!)


  E. Thrashing

    [The points below apply to any caching system, but for the sake of
    concreteness, let's assume that we're talking about page replacement
    in particular.]

    What is thrashing?

    Processes require more memory than system has

    Specifically, each time a page is brought in, another page, whose
    contents will soon be referenced, is thrown out

    Example:

        --one program touches 50 pages (each equally likely); only 
          have 40 physical page frames 

        --If we have enough physical pages, 100ns/ref 

        --If we have too few physical pages (40 pages), assume every 5th
        reference leads to a page fault 

        Question: if the SSD latency is 1ms, how many times slowdown do we have?

          --4refs x 100ns  and 1 page fault x 1ms for SSD I/O 

          --this gets us
          5 refs per (1ms + 400ns) ~ 0.2ms/ref = 2,000x slowdown!!! 

    --What we wanted: virtual memory the size of disk with access
    time the speed of physical memory 

    --What we have here: memory with access time roughly of SSD
    (0.2 ms/mem_ref compare to 1 ms/SSD_access)

    As stated earlier, this concept is much larger than OSes: need
    to pay attention to the slow case if it's really slow and common
    enough to matter.


    Reasons/cases:

    (1) process doesn't reuse memory (or has no temporal locality)

    (2) process reuses memory but the memory that is absorbing
    most of the accesses doesn't fit.

    (3) individually, all processes fit, but too much for the system

    what do we do?

    --well, in the first two reasons above, there's nothing you can
    do, other than restructuring your computation or buying memory
    (e.g., expensive hardware that keeps entire customer database in
    RAM)

    --in the third case, can and must shed load. how?

    two approaches:
    a. working set
    b. page fault frequency

    a. working set

    --only run a set of processes s.t. the union of their
    working sets fit in memory

    --definition of working set (short version): the pages a
    process has touched over some trailing window of time

    b. page fault frequency

    --track the metric (# page faults/instructions executed)

    --if that thing rises above a threshold, and there is not enough
    memory on the system, swap out the process


[Acknowledgments: Mike Walfish, David Mazieres, Mike Dahlin]