Week 12.b
CS 5600
04/06 2022

On the board
------------

1. Last time
2. SSD
3. Intro to file systems

-------------------------------------------

Admin:

-Lab4
  --CS5600 File system
  --will be released this Sunday (04/10)
  --due in two weeks

---------

1. last time

  -- I/O

  -- disk continued

    C. Common #s and performance

    --capacity: high 100s of GB, now TBs common

    --platters: 8

    --number of cylinders: tens of thousands or more

    --sectors per track: ~1000

    --RPM: 10000

    --transfer rate: 50-150 MB/s

    Question: can you guess how long is disk's "mean time between failures"?

    --mean time between failures: ~1 million hours (~100years)
        (for disks in data centers, it's vastly less; for a provider
        like Google, even if they had very reliable disks, they'd still need
        an automated way to handle failures because failures would
        be common (imagine 10 million disks: *some* will be on the
        fritz at any given moment). so what they do is to buy
        defective and cheap disks, which are far cheaper. lets them
        save on hardware costs. they get away with it because they
        *anyway* needed software and systems -- replication and
        other fault-tolerance schemes -- to handle failures.)

    D. How driver interfaces to disk

    --Sectors

        [see again handout's bootloader code from last time]

        --Disk interface presents linear array of **sectors**

        --traditionally 512 bytes (moving to 4KB)

    --disk maps logical sector # to physical sectors

        --Zoning: puts more sectors on longer tracks

        --Track skewing: sector 0 position varies by track, but let
        the disk worry about it. Why? (for speed when doing
        sequential access)

        --Sparing: flawed sectors remapped elsewhere

    --all of this is invisible to OS. stated more precisely, the OS
    does not know the logical to physical sector mapping.

    --in old days (before 1990ish): the OS specifies a platter, track, sector
    (CHS, Cylinder-head-sector); but who knows where it really is?

    --nowadays, the OS sees a disk as an array of sectors (LBA, logical block
    addressing); normally each sector is of size 512B.

    --Question: how many bits do we need to address a 1TB disk?
      (note: we will simplify here, assuming 1TB=10^40B;
      in reality, in the context of storage, 1TB=1000,000,000,000B,
      or 1 trillion bytes)

      [answer:
        1 sector is 512B = 2^9 Bytes;
        the entire disk has 1TB/512B = 2^40 / 2^9 = 2^31 sectors;
        to address each sector, we need at least 31-bits

        In fact:
        "The current 48-bit LBA scheme was introduced in 2003 with the ATA-6
        standard,[4] raising the addressing limit to 2^48 × 512 bytes, which is
        exactly 128 PiB or approximately 144 PB."
        (from wiki: https://en.wikipedia.org/wiki/Logical_block_addressing)
      ]

    E. technology and systems trends

    --unfortunately, while seeks and rotational delay are getting a
    little faster, they have not kept up with the huge growth
    elsewhere in computers.

    --transfer bandwidth has grown about 10x per decade

    --the thing that is growing fast is disk density
    (byte_stored/$). that's because density is less about the
    mechanical limitations

    --to improve density, need to get the head close to the surface.

        --[aside: what happens if the head contacts the surface? called
        "head crash": scrapes off the magnetic material ... and,
        with it, the data.]

    --Disk accesses a huge system bottleneck and getting worse. So
    what to do?

        --Bandwidth increase lets system (pre-)fetch large chunks
        for about the same cost as small chunk.

        --So trade latency for bandwidth if you can get lots of
        related stuff at roughly the same time. How to do that?

        --By clustering the related stuff together on the disk. can
        grab huge chunks of data without incurring a big cost since
        we already paid for the seek + rotation.

       In fact, local network latency is much smaller than disk now.

       [Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and
       Mendel Rosenblum. Fast Crash Recovery in RAMCloud. SOSP'11]

    --The saving grace for big systems is that memory size
    is increasing faster than typical workload size

        --result: more and more of workload fits in file cache,
        which in turn means that the profile of traffic to the disk
        has changed: now mostly writes and new data.

        --which means logging and journaling become viable (more on
        this over next few classes)


2. SSD: solid state drives

  [see handout week11.a]

  --hardware organization

    --semiconductor-based flash memory

    --storing data electrically, instead of magnetically

    --a flash bank contains blocks
      --blocks (or erase blocks) are of size 128KB or 256KB

    --a block contains pages
      --pages are of 4KB to 16KB

  --operations

    --read: a page

    --erase: a block, resetting all bits to 1

    --program: a page, setting some bits to 0
      (you cannot program a page twice without erasing)

    --(logical) write: a combination of erase and program operations
    --Question: can you imagine how to update a page A in a single block flash?
      (which of course is a little bit too small...)
        [answer:
          1. copy other pages in the block to other places (where? anywhere, memory or disk)
          2. erase the entire block
          3. write page A with wanted contents
          4. copy other pages back to their positions
        ]

      -- this echos "writes are more expensive than reads"
         which appears in many places.
         (probably something deeper about it.)

  --performance
    --read: tens of us
    --erase: several ms
    --program: hundreds of us

  --a bummer: wear-out
    -- a block can bare about 10,000 to 100,000 times erasing,
       then becomes unusable


  --FTL: flash translation layer

    --read/write logic blocks -->FTL--> read/erase/program physical blocks
      (note, the "blocks" in logical blocks and physical blocks are different:
        logical blocks as in the device interface,
        physical blocks as in the flash hardware)

    --Question: if you were FTL, how would you mitigate wear-outs?
      [answer: evenly spread the erase/program to blocks.]

    --a log-structured FTL

      --idea:
        --when write, appending the write to the next free page
          (called logging).
        --when read, keeping track where the data are by mapping logical data
          to physical pages.

      ** an example:
      --Given a flash bank has three blocks; each has two pages.
      --there are four writes to pages:
        wirte(logic_page_1)   [short as LP1]
        wirte(logic_page_10)  [short as LP10]
        wirte(logic_page_99)  [short as LP99]

      --what will happen:

              +-----------------------------+
      blocks  | block 0 | block 1 | block 2 |
              +---------+---------+---------+
      pages   | P1 | P2 | P3 | P4 | P5 | P6 |
              +----+----+----+----+----+----+
      data    |LP1 |LP10|LP99|    |    |    |
              +----+----+----+----+----+----+

      mapping:
        LP1 => P1, LP10 => P2, LP99 => P3


    Question:
      what will happen if the following op is write(logic_page_1')?
    [answer:

              +-----------------------------+
      blocks  | block 0 | block 1 | block 2 |
              +---------+---------+---------+
      pages   | P1 | P2 | P3 | P4 | P5 | P6 |
              +----+----+----+----+----+----+
      data    |LP1 |LP10|LP99|LP1'|    |    |
              +----+----+----+----+----+----+

      mapping:
        LP1 => P4, LP10 => P2, LP99 => P3

    ]

    --Notice: P1 now contains old (invalid) data and is useless.

    --hence, SSD requires *garbage collection* (GC)
      -- want to GC block0
      -- move the useful pages (i.e., P2[LP10]) in the same block
         to other free places
      -- erase block 0 (now bot P1 and P2 can be used again)

  --complicated internals, hence sometimes unpredictable latency

    --predicting latency? see below

    [Hao, Mingzhe, Levent Toksoz, Nanqinqin Li, Edward Edberg Halim, Henry
    Hoffmann, and Haryadi S. Gunawi. "LinnOS: Predictability on unpredictable
    flash storage with a light neural network.", OSDI'20]


3. Intro to file systems

    --what does a FS do?

      1. provide persistence (don't go away ... ever)

      2. give a way to "name" a set of bytes on the disk (files)

      3. give a way to map from human-friendly-names to "names" (directories)

    --a few quick notes about disks in the context of FS design

    --disk/SSD are the first thing we've seen that (a) doesn't go away;
    and (b) we can modify (BIOS ROM, hardware configuration, etc.
    don't go away, but we weren't able to modify these things). two
    implications here:

        (i) we're going to have to put all of our important state on
        the disk

        (ii) we have to live with what we put on the disk! scribble
        randomly on memory --> reboot and hope it doesn't happen
        again. scribbe randomly on the disk --> now what? (answer:
        in many cases, we're hosed.)

    --where are FSes implemented?

      --can implement them on disk, over network, in memory, in NVRAM
      (non-volatile RAM), on tape, with paper (!!!!)