Week 12.a
CS 5600
11/23 2021

On the board
------------

1. Last time
2. Crash recovery
    --intro
    --ad-hoc
    --copy on write

--------------------------------------------------------------
Admin

 --lab3 challenge grading

-------

1. Last time

   --Unix inode

   --directory
     --special file with <name, inode number> mapping
     --links: hard links & soft links

   --fs5600 design
     --layout, superblock, bitmap, root dir, inode, dir entry, mode, ...
     --see Lab4 instructions:
        https://naizhengtan.github.io/21fall/labtutorials/lab4/

   --interfaces:
     --read, write, create, ...

    --What does "rwx" mean for a dir?

      [try these examples:

        $ cd /tmp/
        $ mkdir r_dir w_dir x_dir
        $ ls -ld *_dir        // see the initial permissions

        $ chmod u-wx r_dir   // now "r_dir" has permission "r--"
        $ ls r_dir           // try these cmds; what're the results?
        $ cd r_dir
        $ touch r_dir/file

        $ chmod u-rw x_dir   // now "x_dir" has permission "--x"
        $ ls x_dir           // try these cmds; what're the results?
        $ cd x_dir
        $ touch x_dir/file

        $ chmod u-rx w_dir   // now "w_dir" has permission "-w-"
        $ ls w_dir           // try these cmds; what're the results?
        $ cd w_dir
        $ touch w_dir/file   // note: you will fail this; why?
                             // [hint: add "x" to "w_dir" ("-wx")
                             // to see if "touch" succeeds]
      ]

     --mkdir("/dir1/", 0644)

       --Question: what does "0644" mean?
       [answer: "rw-r--r--": owner can RW; group can R; others can R]

       [draw fs5600 layout and inode for "/"]

       --how it works?

        1. get the parent folder's inode ("/")
        2. allocate an inode for "dir1" (say block#10)
        3. allocate a block for data  (say block#11) and init the block
        4. add direntry "dir1" to parent dir's data (say block#3)
        5. update parent inode for metadata (for example, mtime)

      --Question: how many "block_write()" do you think will happen in this process?

        [answer: 5 times:
          - block#1:  bitmap, for allocating new blocks
          - block#10: create "dir1" inode
          - block#11: init data block
          - block#3:  add direntry "dir1" to the parent dir ("/")
          - block#2:  update metadata of the parent dir
        ]

      --in fact these writes can happen in different orders
        depending on:
        --your implementation (you may choose to swap some of the writes)
        --OS buffer-cache
        --underlying storage (the hardware)


2. Crash recovery

    --There are a lot of data structures used to implement the file
    system: bitmap of free blocks, directories, inodes, indirect blocks,
    data blocks, etc.

        --We want these data structures to be *consistent*: we want
        invariants to hold

        --We also want to ensure that data on the disk remains consistent.

        --Thorny issue: *crashes* or power failures.

    --Example: the above mkdir("/dir1", 0644)

      There are five writes:
        1. block#1:  bitmap, for allocating new blocks
        2. block#10: create "dir1" inode
        3. block#11: init data block
        4. block#3:  add direntry "dir1" to the parent dir ("/")
        5. block#2:  update metadata of the parent dir

        [note: writing to one block is guaranteed to be atomic by hardware.]

        crash.

        restart.

        uh-oh.

    --Question: assume synchronous writes, what the consequences of crash
      when happening in-between:

      1 and 2?  [losing track of two blocks]

      2 and 3?  [unreachable "dir1", garbage in "dir1"]

      3 and 4?  [unreachable "dir1"]

      4 and 5?  [inconsistent metadata in "/"]

    [skipped in class:
    --Solution: the system requires a notion of atomicity

        --How to think about this stuff: imagine that a crash can happen
        at any time. (The only thing that happens truly atomically is a
        write of one or a few 512-byte disk sector.) So you want to 
        arrange for the world to look sane, regardless of where a 
        crash happens.

            --> A challenge here is that metadata and data is spread across
            several disk blocks (and hence several sectors), so increasing
            size of atomic unit is not sufficient.

            --> Your leverage, as file system designer, is that you can
            arrange for some disk writes to happen *synchronously*
            (meaning that the system won't do anything until these disk
            writes complete), and you can impose some ordering on the
            actual writes to the disk.

        --So we need to arrange for higher-level operations ("add data
        to file") to _look_ atomic: an update either occurs or it
        doesn't.

        --Potentially useful analogy: during our concurrency unit, we
        had to worry about arbitrary interleavings (which we then tamed
        with concurrency primitives). Here, we have to worry that a
        crash can happen at any time (and we will tame this with
        abstractions like transactions). The response in both cases is a
        notion of atomicity.
    ]

    --We will mention three approaches to crash recovery in file
    systems:

        A. Ad-hoc (OSTEP calls this "fsck")
        B. copy-on-write approaches 
        C. Journaling (also known as write-ahead logging)


    A. Ad-hoc

        --Goal: metadata consistency, not data consistency (rationale:
        too expensive to provide data consistency; cannot live without
        metadata consistency.)

        --Approach: arrange to send file system updates to the disk in
        such a way that, if there is a crash, **fsck** can clean up
        inconsistencies

    --example: mkdir("/dir1", 0644)
        1. block#1:  bitmap, for allocating new blocks
        2. block#10: create "dir1" inode
        3. block#11: init data block
        4. block#3:  add direntry "dir1" to the parent dir ("/")
        5. block#2:  update metadata of the parent dir


    here are the fixes for this case (what fsck can do?):

      crash in-between 1 and 2? 
        => fsck can recycle blocks

      2 and 3?  => re-init dir1
                   [dangerous when there is seemingly correct info
                    a fix: checksum]

      3 and 4?  => send "dir1" to "/lost+found/"

      4 and 5?  => ? [likely ignore...]

    some other example cases:

      inode not marked allocated in bitmap --> only writes were to
      unallocated, unreachable blocks; the result is that the write
      "disappears"

      inode allocated, data blocks not marked allocated in bitmap -->
      fsck must update bitmap


    Disadvantages to this ad-hoc approach:

        (a) fsck's guarantees are unclear (hence ad-hoc)

        (b) need to get ad-hoc reasoning exactly right
            (sometimes based on fs implementations)

        (c) poor performance (synchronous writes of metadata) 

        --multiple updates to same block require that they be
        issued separately. for example, imagine two updates to
        same directory block. requires first complete before
        doing the second (otherwise, not synchronous)

        --more generally, cost of crash recoverability is
        enormous. (a job like "untar" could be 10-20x slower)

        (d) slow recovery: fsck must scan entire disk

        --recovery gets slower as disks get bigger. if fsck
        takes one minute, what happens when disk gets 10 times
        bigger?

            --essentially, fsck has to scan the entire disk

    B. Copy-on-write approaches

        -- Goal: provide both metadata and data consistency, by using
        more space. Rationale: disks have gotten larger, space is not
        at a premium.

        -- Used by filesystems like ZFS, btrfs and APFS.
           [For more details read The Zettabyte File System by
           Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee and
           Mark Shellenbaum. 
           https://www.cs.hmc.edu/~rhodes/courses/cs134/sp19/readings/zfs.pdf]

        -- Approach: never modify a block, instead always make a new
        copy. In detail:

            * The filesystem has a root block, which we refer to as the
            Uberblock (copying terminology from ZFS). The uberblock is
            the **only** block in the filesystem that is ever _modified_
            (as opposed to being fully written, which the rest of the
            blocks are).

            * An abstract example: update a leaf block

              [draw a tree with checksum]

              - remember: _never modify, only copy_. so the file
              system allocates a new block, and writes the new version
              of the data to the new block

             - but that in turn necessitates writing a new version of
             the inode (to point to the new version of the block)

             - and that in turn _changes the inode number_, which
             means that parents and any directories hard-linking to
             the file have to change
                 (for this to work, the inode has to store the
                 list of hard links.)

             - and that in turn means that _those_ directories'
             inodes have to change

             - and so on up to the uberblock.

             - the change is _committed_ -- in the crucial sense that
             after a crash the new version will be visible -- when
             and only when the uberblock is modified on disk.

            * A concrete example: a modification to a file in an
            existing block
             [see handout]

            * Note that the same thing happens when a user appends to a
            file, creating another block (and thus changing the inode,
            and so on). 

            * And the same thing happens when creating a file (because
            the directory inode has to change)

       -- Note that to enable this picture, the uberblock is designed to
       fit in a sector, in order to allow **atomic updates**.

        -- Benefits:

         * Most changes can be committed in **any order**.
             * The only requirement is that all changes be committed before the
             uberblock is updated.
             * The ability to reorder writes in this manner has performance benefits.

         * On disk structure and data is **always** consistent. Do not need to use
           fsck, or run recovery after crash.
             * Most of these filesystems also make use of checksums to handle cases
             where data is corrupted for other reasons.

         * Filesystem incorporates versioning similar to Git and other version control
           tools you may have used.
             * This requires not throwing away the old versions of
             the blocks after writing the new ones.

        -- Disadvantages:

         * Significant write amplification: any writes require changes to several
         disk blocks.

         * Significant space overheads: the filesystem needs enough space to copy
         metadata blocks in order to make any changes. 

        --Question: When a COW fs is almost full, is it a good idea to delete files?

          [answer: no! think of deleting a file that locates in a 10-depth dir:
          it requires to copy all the 10 dir inode to finish the delete...which
          may run out of disk space.]

         * Generally necessitates the use of a garbage collection daemon in order to
         reclaim blocks from old versions of the file-system.

    [next time]
    C. Journaling