Week 13.b
CS 5600
04/13 2022

On the board
------------

1. Last time
2. fs5600 interface
3. Crash recovery
    --intro
    --ad-hoc
    --copy on write

------------------------------------------------------------
Admin

 --no class next Monday

 --mmap HW
   --some report faster in seconds
   --those report speedup: mmap is faster ranging from 5x--1000x

 --many people start working on lab4; great; encourage you to start early

-------

1. Last time

   --files
    --Unix inode
      [a pretty cool recent work: ctFS, where they reuse VM for fs.
        Li, R., Ren, X., Zhao, X., He, S., Stumm, M. and Yuan, D.
        ctFS: Replacing file indexing with hardware memory translation through
        contiguous file allocation for persistent memory. FAST'22
      ]
    --fs5600 inode

   --directories
     --special file with <name, inode number> mapping
     --links: hard links & soft links

     Question: if I run
         $ touch /tmp/a; ln /tmp/a /tmp/b; ln -s /tmp/a /tmp/sb; rm /tmp/a

       What is the output of this cmd ?
         $ cat /tmp/b
       What is the output of this cmd?
         $ cat /tmp/sb
       and why?
         [see note from last time]

     --link cycles
       Question: can I create link cycles?  
          Yes.
       Question: can I create hardlink only cycles?  
          No. (by stop creating hardlink to dirs)
       Question: what can go wrong with hardlink cycles?

   --fs5600 design
     --layout, superblock, bitmap, root dir, inode, dir entry, mode, ...
     --see Lab4 instructions


2. fs5600 interface:

    ** path walk

       int inum = path2inum(char *pat);

       --how? for example, "/a/b/c/file"
       [answer:
         1. split path to tokens: ["a", "b", "c", "file"]
         2. starting from root dir (block 2)
         3. find "a" in "/" (block 2)
         4. get "a"'s inode
         5. get "b" in "/a/"
         6. ...
       ]

    ** fs_read - read data from a file

      --how? for example, "read("/a/file", buf, len, offset)" (pseudocode)
      [answer:
        1. path walk to find the file's inode
        2. find offset's block
           (all data blocks compose a linear space)
        3. read len bytes
      ]

    ** fs_write - write to a file

      --how? for example, "write("/a/file", buf, len, offset)" (pseudocode)
        [answer:
          1. path walk to find the file's inode
          2. allocate blocks if needed (len+offset > size)
          3. find offset's block
          4. write len bytes to the file
        ]

        Question: can you think of any metadata to update?
        [answer:
          - mtime
          - size
        ]

    ** fs_create - create a new (empty) file

      --how? for example, "create("/a/file", mode)" (pseudocode)
       [answer:
            1. path walk to get the inode of the parent folder ("/a/")
            2. allocate a block as "file"'s inode
            3. add "file" to "/a/"
       ]

      --Question: how many "block_write()" do you think will happen in a "fs_create"?
      [answer: at least 4 times:
        - 1 to bitmap (for allocating new blocks)
        - 1 to parent dir inode (metadata update, mtime)
        - 1 to parent dir data block (adding "file")
        - 1 to the "file"'s inode

        - and likely 1 data block to file's inode
      ]

     ** mkdir("/dir1/", 0644)

       --Question: what does "0644" mean?
       [answer: "rw-r--r--": owner can RW; group can R; others can R]

       [draw fs5600 layout and inode for "/"]

       --how it works?

        1. path walk to get the inode of the parent folder ("/")
        2. allocate an inode for "dir1" (say block#10)
        3. allocate a block for data and init the block (say block#11)
        4. add direntry "dir1" to parent dir's data (say block#3)
        5. update parent inode (for example, mtime)

      --Question: how many "block_write()" do you think will happen in this process?

      [answer: 5 times:
        - block#1:  bitmap, for allocating new blocks
        - block#10: create "dir1" inode
        - block#11: init data block
        - block#3:  add direntry "dir1" to the parent dir ("/")
        - block#2:  update metadata of the parent dir
      ]

      --in fact these writes can happen in different order
        --depends on your implementation
        --OS buffer-cache
        --underlying storage (the hardware)


3. Crash recovery
    --intro
    --ad-hoc
    --copy-on-write
    --journaling

    --There are a lot of data structures used to implement the file
    system: bitmap of free blocks, directories, inodes, indirect blocks,
    data blocks, etc.

        --We want these data structures to be *consistent*: we want
        invariants to hold

        --We also want to ensure that data on the disk remains consistent.

        --Thorny issue: *crashes* or power failures.

    --Making the problem worse is:
       (a) write-back caching and (b) non-ordered disk writes.

       --(a) means the OS delays writing back modified disk blocks.

       --(b) means that the modified disk blocks can go to the disk in an
         unspecified order.

    --Example: the above mkdir("/dir1", 0644)

      There are five writes:
        1. block#1:  bitmap, for allocating new blocks
        2. block#10: create "dir1" inode
        3. block#11: init data block
        4. block#3:  add direntry "dir1" to the parent dir ("/")
        5. block#2:  update metadata of the parent dir

        [note: writing to one block is guaranteed to be atomic by hardware.]

        crash.

        restart.

        uh-oh.

    --Question: assume synchronous writes, what the consequences of crash
      when happening in-between:

      1 and 2?  [losing track of two blocks]

      2 and 3?  [unreachable "dir1", garbage in "dir1"]

      3 and 4?  [unreachable "dir1"]

      4 and 5?  [inconsistent metadata in "/"]


    --Solution: the system requires a notion of atomicity

        --How to think about this stuff: imagine that a crash can happen
        at any time. (The only thing that happens truly atomically is a
        write of one or a few 512-byte disk sector.) So you want to 
        arrange for the world to look sane, regardless of where a 
        crash happens.

            --> A challenge here is that metadata and data is spread across
            several disk blocks (and hence several sectors), so increasing
            size of atomic unit is not sufficient.

            --> Your leverage, as file system designer, is that you can
            arrange for some disk writes to happen *synchronously*
            (meaning that the system won't do anything until these disk
            writes complete), and you can impose some ordering on the
            actual writes to the disk.

        --So we need to arrange for higher-level operations ("add data
        to file") to _look_ atomic: an update either occurs or it
        doesn't.

        --Potentially useful analogy: during our concurrency unit, we
        had to worry about arbitrary interleavings (which we then tamed
        with concurrency primitives). Here, we have to worry that a
        crash can happen at any time (and we will tame this with
        abstractions like transactions). The response in both cases is a
        notion of atomicity.

    --We will mention three approaches to crash recovery in file
    systems:

        A. Ad-hoc (OSTEP calls this "fsck")
        B. copy-on-write approaches 
        C. Journaling (also known as write-ahead logging)


    A. Ad-hoc

        --Goal: metadata consistency, not data consistency (rationale:
        too expensive to provide data consistency; cannot live without
        metadata consistency.)

        --Approach: arrange to send file system updates to the disk in
        such a way that, if there is a crash, **fsck** can clean up
        inconsistencies

    --example: mkdir("/dir1", 0644)
        1. block#1:  bitmap, for allocating new blocks
        2. block#10: create "dir1" inode
        3. block#11: init data block
        4. block#3:  add direntry "dir1" to the parent dir ("/")
        5. block#2:  update metadata of the parent dir


    here are the fixes for this case:

      1 and 2?  => recycle blocks

      2 and 3?  => re-init dir1
                   [dangerous when there is seemingly correct info
                    a fix: checksum]

      3 and 4?  => send to "/lost+found/"

      4 and 5?  => ? [likely ignore...]

    some other example cases:

      inode not marked allocated in bitmap --> only writes were to
      unallocated, unreachable blocks; the result is that the write
      "disappears"

      inode allocated, data blocks not marked allocated in bitmap -->
      fsck must update bitmap


    Disadvantages to this ad-hoc approach:

        (a) fsck's guarantees are unclear (hence ad-hoc)

        (b) need to get ad-hoc reasoning exactly right
            (sometimes based on fs implementations)

        (c) poor performance (synchronous writes of metadata) 

        --multiple updates to same block require that they be
        issued separately. for example, imagine two updates to
        same directory block. requires first complete before
        doing the second (otherwise, not synchronous)

        --more generally, cost of crash recoverability is
        enormous. (a job like "untar" could be 10-20x slower)

        (d) slow recovery: fsck must scan entire disk

        --recovery gets slower as disks get bigger. if fsck
        takes one minute, what happens when disk gets 10 times
        bigger?

            --essentially, fsck has to scan the entire disk

    B. Copy-on-write approaches

        -- Goal: provide both metadata and data consistency, by using
        more space. Rationale: disks have gotten larger, space is not
        at a premium.

        -- Used by filesystems like ZFS, btrfs and APFS.
           [For more details read The Zettabyte File System by
           Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee and
           Mark Shellenbaum. 
           https://www.cs.hmc.edu/~rhodes/courses/cs134/sp19/readings/zfs.pdf]

        -- Approach: never modify a block, instead always make a new
        copy. In detail:

            * The filesystem has a root block, which we refer to as the
            Uberblock (copying terminology from ZFS). The uberblock is
            the **only** block in the filesystem that is ever _modified_
            (as opposed to being fully written, which the rest of the
            blocks are).

            * An abstract example: update a leaf block

              [draw a tree with checksum]

              - remember: _never modify, only copy_. so the file
              system allocates a new block, and writes the new version
              of the data to the new block

             - but that in turn necessitates writing a new version of
             the inode (to point to the new version of the block)

             - and that in turn _changes the inode number_, which
             means that parents and any directories hard-linking to
             the file have to change
                 (for this to work, the inode has to store the
                 list of hard links.)

             - and that in turn means that _those_ directories'
             inodes have to change

             - and so on up to the uberblock.

             - the change is _committed_ -- in the crucial sense that
             after a crash the new version will be visible -- when
             and only when the uberblock is modified on disk.

        [continue next time]