Week 8
CS4973/CS6640
02/26 2025
https://naizhengtan.github.io/25spring/

Memory protection introduction
□ 1. Memory protection, the problem statement
□ 2. Segmentaion (x86-32)
□ 3. PMP (RISC-V)
□ 4. Paging (brief)
□ 5. Meltdown & its consequences
---

1. Problem statement

  Q: Motivation? why do we need memory protection?
    * avoid a process access kernel's memory
    * avoid a process access another process's memory
    * avoid a process accidentally modifying its own memory (bugs)
    * what else?

  - the fundamental problem: access control

      subj --[access]--> obj

    for us,
      CPU --[op]--> memory

    Concretely,
        subj: instructions
              + context (privilege levels and others)
        p: r/w/x
        obj: memory address

    Q: is this the only way to formulate memory protection?

    Q: Why "op" only has r/w/x?

      This is defined by memory abstraction:
      -- data cells, identified by addresses
      -- ops: read/write
      -- instructions are part of the memory

   Now, memory protection is a yes/no question:
       allow [instruction+ctx] --[r/w/x]-->[a set of memory addresses]?

   Q: what are invalid accesses?
    -- invalid subj (e.g., user program reading kernel memory)
    -- invalid op (e.g., execute 0xdeadbeef)
    -- invalid obj (e.g., writing to a readonly memory)

    a straightforward solution:
      associate a ACL to obj
      => for each memory obj, attach a ACL of which subj can do what op

    Multiple questions:
    Q1: what are memory objs? (memory granularity)
    Q2: who is the subj? (how to define subj)
    Q3: where to store the ACL?


2. x86 segmentation

  - segfault example:

      In x86:
      """
      int main() {
          int *ptr = (int *)0xdeadbeef;
          *ptr = 1;
      }
      """

      You will see something like:
      "zsh: segmentation fault  ./segfault"

    Q: Why it is called segmentation fault?

  - x86 memory segmentation, a brief history

    * Segmentation was introduced on the Intel 8086 in 1978 as a way to allow
      programs to address more than 64KB of memory.
      // real mode
      // 16bit => 20bit
      // address = (base << 4) | offset

    * The Intel 80286 introduced a second version of segmentation in 1982 that
      added support for memory protection.

    * x86-32 (starting from 80386, 32-bit, 1985)

      How it works:
        * registers---CS, DS, SS, ES (FS and GS)---contain "selectors"
        * selectors point to GDT/LDT (global/local descriptor table)
            which has "descriptors"
        * descriptor contains base address, limit, access right
        * finally, the CPU will combine the requested
           address with the base address

      memory protection (checks):
        - check access right
        - check privilege levels:

        max(CPL, RPL) <= DPL
        // in x86, lower is more privileged

        segment selector (RPL)
        segment descriptor (DPL)
        current privilege level (CPL)

        [show how Linux uses segment]

    * The x86-64 architecture, introduced in 2003, has largely dropped support
      for segmentation in 64-bit mode.


  - How does segmentation answer the previous questions?

    Q1: what is the granularity of memory obj?
    A: defined by segment's base and limit, a dynamic region of memory.

    Q2: how to define the subj?
    A: defined by instruction

    Q3: where to store the ACL?
    A: registers (selectors), memory (descriptors in LDT/GDT)

3. PMP

    * PMP configuration registers (pmpcfg0–pmpcfg15), 16 of them
    * PMP address register (pmpaddr0–pmpaddr63), 64 of them

    Mapping:
      * each pmpaddr is associated with one pmpcfg;
      * one pmpcfg maps to four pmpaddr reigsters

  - How does PMP answer the previous questions?

    Q1: what is the granularity of memory obj?
    A: defined by PMP address registers; dynamically defined

    Q2: how to define the subj?
    A: defined by instruction

    Q3: where to store the ACL?
    A: PMP registers (pmpcfg0–pmpcfg15, pmpaddr0–pmpaddr63)


4. paging (brief)

   - super brief history

     * designed for swapping in/out memory (1962)
       [T Kilburn, R B Payne, D J Howarth, The Atlas Supervisor
       http://www.chilton-computing.org.uk/acl/technology/atlas/p019.htm]

    * later, adding memory protections

   - Paging is the fundamental source of isolation
     for almost all today's systems

  - How does paging answer the previous questions?

    Q1: what is the granularity of memory obj?
    A: a page (4KB), a fixed size

    Q2: how to define the subj?
    A: defined by (1) instruction, (2) privileged levels,
       and (3) root of the page table (cr3 in x86; satp in RISC-V)

    Q3: where to store the ACL?
    A: memory (page table entries)


5. Meltdown & its consequences

  * Meltdown
     allows malicious user code to read kernel memory, despite page protections.
     surprising and disturbing
     one of a number of recent "micro-architectural" attacks
       exploiting hidden implementation details of CPUs
     fixable, but people fear an open-ended supply of related surprises

  * will skip the technical details

  * consequences:

    what OS used to do: put itself in the address space of user processes'
    but isolate itself by using privilege levels (set "user-cannot-read" bit on
    page table entries)

    now, a bummer: user can violate isolation and access kernel memory

    a fix (simplified): run OS in its own memory
      [something deep here;]

    consequences: performance

    [see paper, An Analysis of Performance Evolution of Linux's Core Operations, SOSP'19
        https://dl.acm.org/doi/10.1145/3341301.3359640]

===

Virtual memory
□ 1. Virtual memory intro
□ 2. Paging
□ 3. Page table
□ 4. Today's virtual memory
------

1. Virtual memory intro

  - very important idea in computer systems

  - "to virtualize" means "to lie" or "to fool". we'll say how this is
    implemented in a few moments. for now, let's look at the benefits of
    being interposed on memory accesses.

  - Q: Why virtual memory? What's wrong with using physical addresses?

  - benefits:

    (a) programmability

      (i) huge contiguous memory:
        program *thinks* it has lots of memory, organized in a
        contiguous space

      (ii) multiplexing addresses:
        programs can use "easy-to-use" addresses like 0,
        0x20000, whatever. compiler and linker don't have to worry
        about where the program actually lives in physical memory
        when it executes.
         -- good for modularization

    (b) protection:

      (i) separate address space:
      processes cannot read or write each other's memory

      --this protection is at the heart of the isolation among
      processes that is provided by the OS

          --prevents bug in one process from corrupting another
          process. (non-adversarial scenarios)

          --don't even want a process to observe another
          process's memory (like if that process has secret or
          sensitive data). (adversarial scenarios)

      --the idea is that:
         if you cannot name something, you cannot use it.
         this is a deep idea.

      --Question: can you think of another example of this naming idea?
        (answer: file descriptor)

      (ii) access control

      prevent incorrect modifications to data (e.g., writing to RO memory regions),
      adversarial execution on stack/heap (e.g., bufferoverflow)

    (c) effective use of resources:

      (i) overcommitting memory:
      programmers don't have to worry that the sum of the memory
      consumed by all active processes is larger than physical
      memory.

      (ii) secure sharing:
      processes share memory under controlled circumstances,
      but that physical memory may show up at very different
      virtual addresses
      --that is, two processes have a different way to refer
      to the same physical memory cells


  Q: can physical addresses achieve the above benefits?
     Why and why not?

  - today's virtual memory, two main duties:

   (a) translation:
       VA (virtual address)
         =>
       PA (physical address)

   (b) isolation & protection:
       access control

  - for now, we mainly focus on the translation job

2. Paging

  - the translation problem:
      VA => PA
    and hope this translation
      (i) runs fast,
      (ii) has small memory overhead,
      (iii) can be updated quickly.

  - For example:
     * Segmentation: base + VA => PA

  - Q: what's wrong with the segmentation?
       how will you implement this translation?

  A. pages

    -Basic concept: divide all of memory (physical and virtual)
    into *fixed-size* chunks.

      --these chunks are called *PAGES*.

      --they have a size called the PAGE SIZE.
      (different hardware architectures specify different sizes)

      --in the traditional x86, the PAGE SIZE will be 
         4096B = 4KB = 2^{12}
        It is the same for us, RV32.

    Q: does "paging" fundamentally change the translation problem?
    A: not really; now mapping page numbers, instead of memory addresses

    --it is proper and fitting to talk about pages having **NUMBERS**.

      --page 0:   [0,4095]
      --page 1:   [4096, 8191]
      --page 2:   [8192, 12277]
      --page 3:   [12777, 16384]
      ...
      --page 2^{20}-1 [ ..., 2^{32} - 1]

    --unfortunately, it is also proper and fitting to talk about _both_
    virtual and physical pages having numbers.

    --sometimes we will try to be clear with terms like:
        VPN: virtual page number
        PPN: physical page number

  B. per-process translation

    Q: who owns the translation?
      - If a CPU can only run one translation, is it useful?
      - For a multi-core CPU, if a core can only run one translation,
        is it useful?

    - Today, each process (or a program) has a separate mapping

      --And each page separately mapped

      --we will allow the OS to gain control on certain operations
        --Read-only pages trap to OS on write (store)
        --Invalid pages trap to OS on read (load) or write (store)
        --OS can change mapping and resume application


3. Page table

  Q: if you were re-inventing virtual memory,
     what data structure would you use for the translation?

    Recall our requirements about the translation:
      (i) runs fast,
      (ii) has small memory overhead,
      (iii) can be updated quickly.

    A. Problem statement

        page table conceptually implements a map from 
            VPN --> PPN

        page table is conceptually an index. 

         the address is broken up into bits:

                 [.............|........]
                 [ VPN         | offset ]
                    |             |
                    |             +
                    |             |
                    --> TABLE --> PPN
                                   =
                                 address

         top bits index into page table. contents at that index are the PPN.

         bottom bits are the offset. not changed by the mapping

         physical address = PPN + offset

           (note: "+" here means "concatenate":
              for example, 123 "+" 456 => 123456)

         result is that each page table entry expresses a mapping about a
         contiguous group of addresses.

    B. a naive proposal:

      (assume 32-bit addresses and 4KB pages)

    there is in the sky a 2^{20} sized array that maps the
    virtual address to a *physical* page

      table[VPN] => PPN

    EXAMPLE: 

       if OS wants a program to be able to use address 0x402000
       to refer to physical address 0x3000, then the OS
       conceptually adds an entry:

           table[0x402] = 0x3

       (this is the 1026th virtual page being mapped to the 3rd
       physical page.). in decimal: table[1026] = 3

      NOTE: top 28 bits are doing the indirection. bottom 12 bits just
      figure out where on the page the access should take place.

          --bottom bits are called _offset_.

    --so now all we have to do is create this mapping

    --why is this hard? why not just create the mapping?

    --Question: how large is this table?
    --answer: then you need, per process, roughly 4MB
            (2^{20} entries * 4 bytes per entry).

        [why 4 bytes per entry? in practice, it's convenient to have
        the entry size be the same as a data type on the machine]

    --too much! let's deal with this...

    [draw a black-box MMU: inputs are VAs; outputs are PAs;]

    --Question: if you were MMU designer, how would you design the table?

    C. what we use today: page table

      PT: a radix tree
      [see a radix tree figure]

      Q: Why radix trees? Can you use other trees like binary tree?

     * PT design space:
       -- offset
       -- page size
       -- address length
       -- addressable memory unit
       -- depth of the PT
       -- PTE size

4. Today's virtual memory

  - What we have today:
   [virtual memory -> paging -> page table -> RV32]

   introducing the alternatives:
     * VA but not paging: segmentation
     * paging but not page table: hashing-based translation
     * page table but not RISC-V: x86 and ARM

  - how is this translation implemented?

    --software(OS)-hardware(MMU) co-design

    --in modern systems, hardware does it. this hardware is
    configured by the OS.

    --this hardware is called the MMU, for memory management unit,
    and is part of the CPU

    --why doesn't OS just translate itself? similar to asking why we
    don't execute programs by running them on an emulation of a
    processor (too slow)

  - things to remember in what follows:
    --OS is going to be setting up data structures that
    the hardware sees
    --these data structures are *per-process* PTs


===
Virtual memory implementation
□ 1. RV32 page table intro
□ 2. (manually) walking page table
---

1. Intro

  [show overview fig]

  * a toy example:
    VA: 0x80001fec

  * split it into L1/L2 index and offset:
   0x80001fec =>
     - l1: 1000 0000 00 => 0x200 => 2*16*16=512
     - l2: 0x1 => 1
     - offset: 0xfec

  detail info:
  (1) satp
    [handout]

  (2) PTE
    [handout]

  Q: what if a PTE has W but not R?
    [undefined behavior]

  (3) VA
    [handout]

2. walking page table

   * a real example
     take a look at helloworld.c

   int main(int args, char **argv) {
       m_uint32 data = 0xdeadbeef;
       printf("%p\n", &data);

       asm("ecall");
   }

   * run it; it returns

   0x80001fec
   // error

   Q: what do you think
   gdb> p/x *0x80001fec
   [A: 0]

   Q: Why? We're in M-mode. Only see physical memory.

  * simulate CPU: manual page walk

    Goal: see which physical address holds data "0xdeadbeef".
    Method: we will use gdb to simulate page walk.
    Steps:
      (1) split the VA to l1/l2 indexes and offest
      (2) get the root of page table
      (3) calc the L1 page
      (4) calc the L2 page
      (5) calc the physical address

  * now, let's do it:

 (1) split the va to l1/l2 indexes and offest

    0x80001fec =>
      l1: 0x200 (512)
      l2: 0x1
      offset: 0xfec

 (2) get the root of page table
   gdb> p/x $satp
    // 0x80080048

 (3) find the L1 page:

   Q: how to interpret 0x80080048?
   [the most significant bit is MODE]

   Q: what is the L1 page?

     gdb> p/x 0x80048 << 12
     // 0x80048000

   Q: is this physial or virtual?
      [physical]

   Q: what is the L1 PTE address?

     gdb> p/x (0x80048 << 12) + 0x200*4
     // 0x200 is the L1 index in VA

   Q: what is the L1 PTE content?

     gdb> p/x *((0x80048 << 12) + 0x200*4)
     // get 0x20014401 (L1 PTE)

   Q: what does "0x1" in the end mean?
     [it is the valid bit]

 (4) find the L2 page:

   Q: what's the L2 page?

     gdb> p/x (0x2001401 & ~(0x3ff)) >> 10 << 12
     // 0x80051000

   Q: what's the L2 PTE address?

     gdb> p/x 0x80051000 + 0x1*4
     // 0x1 is the L2 index in VA

   Q: what's the L2 PTE content?

     gdb> p/x *(0x80051000 + 0x1*4)
     // 0x200158df

 (5) find the physical address

   Q: what is "0xdf" in 0x200158df?
     [check out PTE fig]

   Q: what is the data page?

     gdb> p/x (0x200158df & ~(0x3ff)) << 2
     // 0x80056000

   Q: what is the PA?

     gdb> p/x ((0x200158df & ~(0x3ff)) << 2) + 0xfec
     // 0x80056fec
     // you will see 0xdeadbeef (the data) when inferencing the address