Week 13.a
CS 5600
04/03 2023

1. SSD continued
2. Intro to fs
3. Files
--------------------------------------

Admin:

- Lab3 grader and regrading
- Lab5 will be released on Wed (04/05) due in two weeks
  -- you will get grading scripts with the Lab5
- the schedule got updated

---------


1. SSD continued

  --operations
    --read: a page
    --erase: a block, resetting all bits to 1
    --program: a page, setting some bits to 0
      (you cannot program a page twice without erasing)

  --performance
    --read: tens of us
    --erase: several ms
    --program: hundreds of us

  --a bummer: wear-out
    -- a block can bare about 10,000 to 100,000 times erasing,
       then becomes unusable

  QUESTION: what if one needs to frequently updating a page?

  --FTL: flash translation layer

    --read/write logic blocks -->FTL--> read/erase/program physical blocks
      (note, the "blocks" in logical blocks and physical blocks are different:
        logical blocks as in the device interface,
        physical blocks as in the flash hardware)

    --Question: if you were FTL, how would you mitigate wear-outs?
      [answer: evenly spread the erase/program to blocks.]

    --a log-structured FTL

      --idea:
        --when write, appending the write to the next free page
          (called logging).
        --when read, keeping tracks where the data are by mapping logical data
          to physical pages.

      ** an example:
      --Given a flash bank has two blocks; each has two pages.
      --there are four writes to pages:
        wirte(logic_page_1)
        wirte(logic_page_10)
        wirte(logic_page_99)

      --what will happen:

              +-------------------+
      blocks  | block 0 | block 1 |
              +---------+---------+
      pages   | P1 | P2 | P3 | P4 |
              +----+----+----+----+
      data    |LP1 |LP10|LP99|    |
              +----+----+----+----+

      mapping:
        LP1 => P1, LP10 => P2, LP99 => P3


    Question:
      what will happen if the following op is write(logic_page_1')?
    [answer:

              +-------------------+
      blocks  | block 0 | block 1 |
              +---------+---------+
      pages   | P1 | P2 | P3 | P4 |
              +----+----+----+----+
      data    |LP1 |LP10|LP99|LP1'|
              +----+----+----+----+

      mapping:
        LP1 => P4, LP10 => P2, LP99 => P3

    ]

    --Notice: P1 now contains old (invalid) data and is useless.

    --hence, SSD requires *garbage collection* (GC)
      -- want to GC block0
      -- move the useful pages (i.e., P2[LP10]) in the same block
         to other free places
      -- erase block 0 (now bot P1 and P2 can be used again)

  --complicated internals, hence sometimes unpredictable latency

    --predicting latency? see below

    [Hao, Mingzhe, Levent Toksoz, Nanqinqin Li, Edward Edberg Halim, Henry
    Hoffmann, and Haryadi S. Gunawi. "LinnOS: Predictability on unpredictable
    flash storage with a light neural network.", OSDI'20]


2. Intro to file systems

    --what does a FS do?

      1. provide persistence (don't go away ... ever)

      2. give a way to "name" a set of bytes on the disk (files)

      3. give a way to map from human-friendly-names to "names" (directories)

    --a few quick notes about disks in the context of FS design

    --disk/SSD are the first thing we've seen that (a) doesn't go away;
    and (b) we can modify (BIOS ROM, hardware configuration, etc.
    don't go away, but we weren't able to modify these things). two
    implications here:

        (i) we're going to have to put all of our important state on
        the disk

        (ii) we have to live with what we put on the disk! scribble
        randomly on memory --> reboot and hope it doesn't happen
        again. scribbe randomly on the disk --> now what? (answer:
        in many cases, we're hosed.)

    --where are FSes implemented?

      --can implement them on disk, over network, in memory, in NVRAM
      (non-volatile RAM), on tape, with paper (!!!!)

      -- Lab5 simulates a disk that abstracts an array of blocks


3. Files

    --what is a file?
        --answer from user's view: a bunch of named bytes on the disk
        --answer from FS's view: collection of disk blocks

    --big job of a FS: map name and offset to disk blocks

                                 FS
                   {file,offset} --> disk address

        [also called file mapping.
         see an introduction in S2 of this paper:
           (which btw is a cool paper about persistent memory)
          Neal, Ian, Gefei Zuo, Eric Shiple, Tanvir Ahmed Khan, Youngjin Kwon, Simon
          Peter, and Baris Kasikci. "Rethinking File Mapping for Persistent Memory.", FAST'21
        ]
        --operations are create(file), delete(file), read(), write()


    -- note that we have seen translation/indirection before:
       (introducing inode)

        per-process:

               page table
            VA ----------> PA


        per-file:

                   SOMETHING
            offset --------->  disk block address

            In fs, this SOMETHING is called inode.

        the, how'd we get the inode?

                       directory
            file name ------------> inode


    [Analogies:
       per-process memory <-> File
       multiple processes <-> Multiple files
       page table         <-> inode
       memory chip        <-> disk/SSD
    ]

   - File design parameters:

      * small files (most files are small)
          vs.
        large files (much of the disk is allocated to large files)

      * access patterns:
          sequential access vs.
          random accesses vs. 
          keyed accesses

      * disk utilization (metadata overhead and fragmentation)

   - access patterns we could imagine supporting:

    (i) Sequential:
        --File data processed in sequential order
        --By far the most common mode
        --Example: editor writes out new file, compiler reads in file, etc

    (ii) Random access:
        --Address any block in file directly without linear scanning
        --Examples: large data set, demand paging, databases

    (iii) Keyed access
        --Search for block with particular values
        --Examples: associative database, index
        --This thing is everywhere in the field of databases,
        search engines, but....
        --...usually not provided by a FS in OS

    Question: if you were a fs designer, which structure will you use for
    different parameters?
      [answer: depends on what workloads the fs going to meet]

    candidate designs........

        A. contiguous
        B. linked files
        C. indexed files

    A. contiguous allocation

      "extent based"
      --when creating a file, make user pre-specify its length, and
      allocate the space at once
      --file metadata contains location and size

      --example: IBM OS/360

        [<free> a1 a2 a3 <5 free> b1 b2 <free> ]

        what if a file c needs 7 sectors?!

      +: simple
      +: fast access, both sequential and random
      -: fragmentation

       [somewhat regain interests due to PM:
        Li, Ruibin, Xiang Ren, Xu Zhao, Siwei He, Michael Stumm, and Ding
        Yuan. "ctFS: Replacing file indexing with hardware memory translation
        through contiguous file allocation for persistent memory." FAST'22]

    B. linked files

        "linked list based"
        --keep a linked list of free blocks
        --metadata: pointer to file's first block
        --each block holds pointer to next one

      +: no more fragmentation
      +: sequential access easy (and probably mostly fast, assuming
         decent free space management, since the pointers will point
         close by)
      -: random access is a disaster
      -: pointers take up room in blocks;
         (overhead: sizeof(pointer) / sizeof(block))
         messes up alignment of data

    C. indexed files

        [DRAW PICTURE]

        --Each file has an array holding all of its block pointers
        --like a page table, so similar issues crop up

        --Allocate this array on file creation

        --Allocate blocks on demand (using free list or bitmap)

        +: sequential and random access are both easy
        -: need to somehow store the array

        --large possible file size --> lots of unused entries in the
        block array

        --large actual block size --> huge contiguous disk chunk
        needed

        --solve the problem the same way we did for page tables.

        --okay, so now we're not wasting disk blocks, but what's the
        problem? (answer: equivalent issues as for page table
        walking: here, it's extra disk accesses to look up the blocks)

        --this motivates the classic Unix file system 

    --Unix inode:

        [draw on board]

        permisssions
        times for file access, file modification, and inode-change
        link count (# directories containing file)
        ptr 1  --> data block
        ptr 2  --> data block
        ptr 3  --> data block
        .....
        ptr 11  --> indirect block 
                      ptr --> 
                      ptr --> 
                      ptr --> 
                      ptr -->
                      ptr -->
        ptr 12 --> indirect block
        ptr 13 --> double indirect block
        ptr 14 --> triple indirect block

    This is just a tree.

    [deja vu? 
    that's right, we talked similar things when we argue for
    different designs of page tables.]

    Question: why is this tree intentionally imbalanced?

        (Answer: optimize for short files. each level of this tree
        requires a disk seek...)

    Pluses/minuses:

    +: Simple, easy to build, fast access to small files

    +: Maximum file length can be enormous, with
       multiple levels of indirection 

    -: worst case # of accesses pretty bad

    -: worst case overhead (such as 11 block file) pretty bad

    -: Because you allocate blocks by taking them off unordered
       freelist, metadata and data get strewn across disk


  * Notes about inodes:

    --stored in a fixed-size array

    --Size of array fixed when disk is initialized; can't be changed
      [why? easier for fs to find inodes,
            and fewer disk accesses (better performance!)]

    --Multiple inodes in a disk block
      (which is not the case for your Lab5)

    --Question: how many files can the following toy-fs has?
        sizeof(inode) = 128B
        sizeof(block) = 512B
        toy-fs uses 1000 blocks to store inodes

       [answer:
         the number of inodes (hence files) is
           (521B / 128B) * 1000 = 4000
       ]

       -- use "$ df -i ~" to see how many inodes you can use.

    --inode lives in known location, originally at one side of disk,
    now lives in pieces across disk (helps keep metadata close
    to data)

    --The index of an inode in the inode array is called an
    ***i-number***

    --Internally, the OS refers to files by i-number

    --When a file is opened, the inode brought in memory

    --Written back when modified and file closed or time elapses