Week 9.b.
CS 5600
11/05 2021

On the board
------------

1. Last time
2. page fault
   - intro
   - usage
   - costs
   - page replacement policies
3. mmap

-------------------------------------------------------


1. Last time

  -- x86-64 page table walk

  -- TLBs

  -- where does OS live?

    -- Question: where is page tables? do they exist in
    the virtual address space of a process?

    [answer: yes. Though the process or application code cannot see them (page
    tables) but kernel maps them in the kernel space twice: one copy in the
    "physical memory" section (see handout week9.a panel 7), and the other in
    its own data structure.]

2. Page faults

   A. intro and mechanics

    We've discussed these a bit. Let's go into a bit more detail...

    Concept:

        a reference is illegal, either because it's not mapped in the
        page tables or because there is a protection violation.

        requires the OS to get involved

        this mechanism turns out to be hugely powerful, as we will see.

    Mechanics

        --what happens on the x86?

        --processor constructs a trap frame and
        transfers execution to an interrupt or trap handler

                   [trap frame]
          %rsp --> [error code]

        %rip now points to code to handle the trap
            [how did processor know what to load into %rip?
            Answer: kernel needs to set up an interrupt descriptor table (IDT),
            by which CPU knows where to go; btw, page fault exception is #14.]

        error code:

            [see handout week9.a]

            [ ................................ U/S | W/R | P]

            U/S: user mode fault / supervisor mode fault
            R/W: access was read / access was write
            P: not-present page / protection violation

        %cr2 holds the faulting virtual address

           Question: why we need the faulting virtual address?
           [answer: the kernel needs to know which VA went wrong
           so that it can fix the problem by, for example,
           allocating a physical page and mapping it to the virtual page
           containing the VA.
           (of course, kernel can do other things, like kill the process.)]

        --intent: when page fault happens, the kernel sets up the
        process's page entries properly, or kills the process

   B. Uses of page faults

    --Best example: overcommitting physical memory (the classical
    use of "virtual memory")

    --your program thinks it has, say, 64 GB of memory, but your
    hardware has only 16 GB of memory

    --the way that this worked is that the disk was (is) used to
    store memory pages

    --Rough implementation:

        --for pages in memory: setting their PTEs with P=1
           (P is present bit in PTE; see handout week9.a.)

        --for pages on disk (not in memory): setting their PTEs with P=0

        --when accessing the pages that are not in memory, a page fault

        --on a page fault, the kernel reads in the faulting page
          from the disk to the memory.


        [Another point that I haven't talked in class:

        --QUESTION: what is listed in the page structures?
        how does kernel know whether the address is invalid,
        in memory, paged, what?

        --kernel may need to send a page to disk (under what
        conditions? answer: two conditions must hold for kernel to
        HAVE to write to disk)

          (1) kernel is out of memory

          (2) the page that it selects to write out is dirty
        ]

    --advantage: address space looks huge

    --disadvantage: accesses to "paged" memory (as disk pages that
    live on the disk are known) are sllooooowwwww:

    --Computers have lots of memory, so less common to hear the
    sound of swapping these days. You would need multiple large memory
    consumers running on the same computer.

    --in fact, today you sometimes experience the other way around:
      you may want to run file system in memory (ramfs) for performance.

    --Many other uses

    --Distributed Shared Memory
      store memory pages across the network!

        --basic idea was that on a page fault, the page fault
        handler went and retrieved the needed page from some
        other machine

        --charming idea, but impractical. Why? too expensive.

        --classic trade-off: transparency vs. efficiency
          -- a general case: low-level APIs expose more information
          (less transparency) but have better performance

    --copy-on-write

        --when creating a copy of another process, don't copy
        its memory. just copy its page tables, mark the pages as
        read-only

        --QUESTION: do you need to mark the parent's pages
        as read-only as well?
        [answer: of course, otherwise the child will see data
        changes from the parent.]

        --program semantics aren't violated when programs do
        reads

        --when a write happens, a page fault results. at that
        point, the kernel allocates a new page, copies the 
        memory over, and restarts the user program to do a write

        --then, only do copies of memory when there is a
        fault as a result of a write

        --this idea is all over the place; used in fork(), mmap(),
        etc.

    --accounting

        --good way to sample what percentage of the memory pages
        are written to in any time slice: mark a fraction of
        them not present, see how often you get faults

    --performance tricks
      --remember the JVM trick I mentioned
      --use page fault to get rid of an if-else branch

    --if you are interested in this, check out the paper
    "Virtual Memory Primitives for User Programs", by Andrew W.
    Appel and Kai Li, Proc. ASPLOS, 1991.

        --high-level idea: by giving kernel (or even user-level
        program) the opportunity to do interesting things on page
        faults, you can build interesting functionality


  C. costs

    --What does paging from the disk cost?

    --let's look at average memory access time (AMAT)

    --AMAT = (1-p)*memory access time + p * page fault time,
    where p is the prob. of a page fault.

    memory access time ~ 100ns
    SSD access time   ~ 1 ms = 1^6 ns

    --QUESTION: what does p need to be to ensure that paging hurts
    performance by less than 10%?

    1.1*t_M = (1-p)*t_M + p*t_D
    p = .1*t_M / (t_D - t_M) ~ 10^1 ns / 10^6 ns = 10^{-5}

    so only one access out of 100,000 can be a page fault!!

    --basically, page faults are super-expensive (good thing the
    machine can do other things during a page fault)

    Concept is much larger than OSes: need to pay attention to the
    slow case if it's really slow and common enough to matter.


  D. Page replacement policies

    --the fundamental problem/question:
      (also known as cache eviction problem)

    --some entity holds a cache of entries and gets a cache miss.
    The entity now needs to decide which entry to throw away. How
    does it decide?

    --make sure you understand why page faults that result from
    "page-not-present in memory" are a particular kind of cache miss

        --(the answer is that in the world of virtual memory, the
        pages resident in memory are basically a cache to the
        backing store on the disk; make sure you see why this claim,
        about virtual memory vis-a-vis the disk, is true.)

    --the system needs to decide which entry to throw away, which calls
    for a *replacement policy*.

    --let's cover some policies....

    Specific policies

    * FIFO: throw out oldest (results in every page spending
    the same number of references in memory. not a good idea.  pages
    are not accessed uniformly.)

    * OPT: throw away the entry that won't
    be used for the longest time. this is optimal.

    * LRU: throw out the least recently used (this is often a good
    idea, but it depends on the future looking like the past. what
    if we chuck a page from our cache and then were about to use it?)

    --implementing LRU
      [usually by a doubly linked list and a hash table]

    --reasonable to do in application programs like Web servers that
    cache pages (or dedicated Web caches).

    --in OS, LRU itself does not sound great. would be doubling
    memory traffic (after every reference, have to move some
    structure to the head of some list)

    --and in hardware, it's way too much work to timestamp each
    reference and keep the list ordered

    --how can we approximate LRU?

    --another algorithm:
        * CLOCK

    --arrange the slots in a circle. hand sweeps around, clearing
    a bit. the bit is set when the page is accessed. just evict a
    page if the hand points to it when the bit is clear.

    --approximates LRU ... because we're evicting pages that haven't
    been used in a while....though of course we may not be evicting
    the *least* recently used one (why not?)

    --Summary:

    --optimal is known as OPT

    --LRU is usually a good approximation to optimal

    --Implementing LRU in hardware or at OS/hardware interface is a
    pain

    --So implement CLOCK ... decent approximations to
    LRU, which is in turn good approximation to OPT *assuming that
    past is a good predictor of the future* (this assumption does
    not always hold!)


3. mmap

    --a syscall; a cool way to bring some ideas together.

    --recall some syscalls:

      fd = open(pathname, mode)
      write(fd, buf, sz)
      read(fd, buf, sz)

    --we've learned fds before, but what's an fd in kernel?
    --indexes into a table maintained by the kernel on behalf of the process

    --syscall:

    void* mmap(void* addr, size_t len, int prot, int flags,
           int fd, off_t offset);

        --means, roughly, "map the specified open file (fd) into a
        region of my virtual memory (close to addr, or at a kernel-selected
        place if addr is 0), and return a pointer to it"

        [see handout]

        NOTE: the "disk image" here is the file we've mmap()'ed, not the
        process's usual backing store. The idea is that mmap() lets the
        programmer "inject" pages from a regular file on disk into the
        process's backing store (which would otherwise be part of a swap
        file).

    --after this, loads and stores to addr[x] are
    equivalent to reading and writing to the file at offset+x.

    --why is this cool?

        - example: mmap enables copying a file to stdout without
        transferring data to user space

            see handout

            Question: mmap vs. normal read/write, which is faster?
              by how much?

            [Answer: 
             --mmap is faster
             --"by how much" depends on the file size and the underlying OS and machine.
             --for ~1G file on my machine (MacOS, M1 Macbook Air), mmap is 13.5x faster.
            ]

            NOTE: the process never itself dereferences a pointer to
            memory containing file data.

            NOTE: this saves two sets of memory-to-memory copies
            (kernel-to-user, user-to-kernel), versus the "naive"
            solution of read()ing into a buffer in user space, and then
            write()ing


[Acknowledgments: Mike Walfish, David Mazieres, Mike Dahlin]