Week 14.a
CS 5600
04/10 2023

1. last time
2. fs5600 interfaces (continued)
3. crash recovery
--------------------------------------

1. last time

  -- directories: briefly mention "hierarchical fs is dead"

  -- fs5600
    -- disk is an array of pages (4KB)
    -- first three pages have special meanings: superblock, bitmap, and root-inode
    -- fs5600 inode is 4KB; its data structure
    -- dir vs. file
    -- interfaces

2. fs5600 interfaces (continued)

   [continued from last time]

    ** fs_write - write to a file

      --how? for example, "write("/a/file", buf, len, offset)" (pseudocode)
        [answer:
          1. path walk to find the file's inode
          2. allocate blocks if needed (len+offset > size)
          3. find offset's block
          4. write len bytes to the file
        ]

        Question: can you think of any metadata to update?
        [answer:
          - mtime
          - size
        ]

    ** fs_create - create a new (empty) file

      --how? for example, "create("/a/file", mode)" (pseudocode)
       [answer:
            1. path walk to get the inode of the parent folder ("/a/")
            2. allocate a block as "file"'s inode
            3. add "file" to "/a/"
       ]

      --Question: how many "block_write()" do you think will happen in a "fs_create"?
      [answer: at least 4 times:
        - 1 to bitmap (for allocating new blocks)
        - 1 to parent dir inode (metadata update, mtime)
        - 1 to parent dir data block (adding "file")
        - 1 to the "file"'s inode

        - and likely 1 data block to file's inode
      ]


    ** mkdir("/dir1/", 0644)

     --Question: what does "0644" mean?
     [answer: "rw-r--r--": owner can RW; group can R; others can R]

     [draw fs5600 layout and inode for "/"]

     --how it works?

       1. path walk to get the inode of the parent folder ("/")
       2. allocate an inode for "dir1" (say block#10)
       3. allocate a block for data and init the block (say block#11)
       4. add direntry "dir1" to parent dir's data (say block#3)
       5. update parent inode (for example, mtime)

      --Question: how many "block_write()" do you think will happen in this process?

      [answer: 5 times:
        - block#1:  bitmap, for allocating new blocks
        - block#10: create "dir1" inode
        - block#11: init data block
        - block#3:  add direntry "dir1" to the parent dir ("/")
        - block#2:  update metadata of the parent dir
      ]

      --in fact these writes can happen in different order
        --depends on your implementation
        --OS buffer-cache
        --underlying storage (the hardware)


3. Crash recovery
  --intro
  --ad-hoc
  --copy-on-write
  --journaling

  --There are a lot of data structures used to implement the file
    system: bitmap of free blocks, directories, inodes, indirect blocks,
    data blocks, etc.

      --We want these data structures to be *consistent*: we want
      invariants to hold

      --We also want to ensure that data on the disk remains consistent.

      --Thorny issue: *crashes* or power failures.

  --Making the problem worse is:
     (a) write-back caching and (b) non-ordered disk writes.

     --(a) means the OS delays writing back modified disk blocks.

     --(b) means that the modified disk blocks can go to the disk in an
       unspecified order.

  --Example: the above mkdir("/dir1", 0644)

    There are five writes:
      1. block#1:  bitmap, for allocating new blocks
      2. block#10: create "dir1" inode
      3. block#11: init data block
      4. block#3:  add direntry "dir1" to the parent dir ("/")
      5. block#2:  update metadata of the parent dir

      [note: writing to one block is guaranteed to be atomic by hardware.]

      crash.

      restart.

      uh-oh.

  --Question: assume synchronous writes, what the consequences of crash
    when happening in-between:

    1 and 2?  [losing track of two blocks]

    2 and 3?  [unreachable "dir1", garbage in "dir1"]

    3 and 4?  [unreachable "dir1"]

    4 and 5?  [inconsistent metadata in "/"]


  --Solution: the system requires a notion of atomicity

      --How to think about this stuff: imagine that a crash can happen
      at any time. (The only thing that happens truly atomically is a
      write of one or a few 512-byte disk sector.) So you want to 
      arrange for the world to look sane, regardless of where a 
      crash happens.

          --> A challenge here is that metadata and data is spread across
          several disk blocks (and hence several sectors), so increasing
          size of atomic unit is not sufficient.

          --> Your leverage, as file system designer, is that you can
          arrange for some disk writes to happen *synchronously*
          (meaning that the system won't do anything until these disk
          writes complete), and you can impose some ordering on the
          actual writes to the disk.

      --So we need to arrange for higher-level operations ("add data
      to file") to _look_ atomic: an update either occurs or it
      doesn't.

      --Potentially useful analogy: during our concurrency unit, we
      had to worry about arbitrary interleavings (which we then tamed
      with concurrency primitives). Here, we have to worry that a
      crash can happen at any time (and we will tame this with
      abstractions like transactions). The response in both cases is a
      notion of atomicity.

  --We will mention three approaches to crash recovery in file
  systems:

      A. Ad-hoc (OSTEP calls this "fsck")
      B. copy-on-write approaches 
      C. Journaling (also known as write-ahead logging)


  A. Ad-hoc

      --Goal: metadata consistency, not data consistency (rationale:
      too expensive to provide data consistency; cannot live without
      metadata consistency.)

      --Approach: arrange to send file system updates to the disk in
      such a way that, if there is a crash, **fsck** can clean up
      inconsistencies

  --example: mkdir("/dir1", 0644)
      1. block#1:  bitmap, for allocating new blocks
      2. block#10: create "dir1" inode
      3. block#11: init data block
      4. block#3:  add direntry "dir1" to the parent dir ("/")
      5. block#2:  update metadata of the parent dir


  here are the fixes for this case:

    1 and 2?  => recycle blocks

    2 and 3?  => re-init dir1
                 [dangerous when there is seemingly correct info
                  a fix: checksum]

    3 and 4?  => send to "/lost+found/"

    4 and 5?  => ? [likely ignore...]

  some other example cases:

    inode not marked allocated in bitmap --> only writes were to
    unallocated, unreachable blocks; the result is that the write
    "disappears"

    inode allocated, data blocks not marked allocated in bitmap -->
    fsck must update bitmap


  Disadvantages to this ad-hoc approach:

      (a) fsck's guarantees are unclear (hence ad-hoc)

      (b) need to get ad-hoc reasoning exactly right
          (sometimes based on fs implementations)

      (c) poor performance (synchronous writes of metadata) 

      --multiple updates to same block require that they be
      issued separately. for example, imagine two updates to
      same directory block. requires first complete before
      doing the second (otherwise, not synchronous)

      --more generally, cost of crash recoverability is
      enormous. (a job like "untar" could be 10-20x slower)

      (d) slow recovery: fsck must scan entire disk

      --recovery gets slower as disks get bigger. if fsck
      takes one minute, what happens when disk gets 10 times
      bigger?

          --essentially, fsck has to scan the entire disk

  B. Copy-on-write approaches

      -- Goal: provide both metadata and data consistency, by using
      more space. Rationale: disks have gotten larger, space is not
      at a premium.

      -- Used by filesystems like ZFS, btrfs and APFS.
         [For more details read The Zettabyte File System by
         Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee and
         Mark Shellenbaum. 
         https://www.cs.hmc.edu/~rhodes/courses/cs134/sp19/readings/zfs.pdf]

      -- Approach: never modify a block, instead always make a new
      copy. In detail:

          * The filesystem has a root block, which we refer to as the
          Uberblock (copying terminology from ZFS). The uberblock is
          the **only** block in the filesystem that is ever _modified_
          (as opposed to being fully written, which the rest of the
          blocks are).

          * An abstract example: update a leaf block

            [draw a tree with checksum]

            - remember: _never modify, only copy_. so the file
            system allocates a new block, and writes the new version
            of the data to the new block

           - but that in turn necessitates writing a new version of
           the inode (to point to the new version of the block)

           - and that in turn _changes the inode number_, which
           means that parents and any directories hard-linking to
           the file have to change
               (for this to work, the inode has to store the
               list of hard links.)

           - and that in turn means that _those_ directories'
           inodes have to change

           - and so on up to the uberblock.

           - the change is _committed_ -- in the crucial sense that
           after a crash the new version will be visible -- when
           and only when the uberblock is modified on disk.


          * A concrete example: a modification to a file in an
          existing block
           [see handout]

            (a note: handout figures are for demonstration purpose. If you read
            the above zfs paper, you will find that the "tree" is about disk
            blocks, which is an abstraction below the notions of files and
            directories.)

          * Note that the same thing happens when a user appends to a
          file, creating another block (and thus changing the inode,
          and so on). 

          * And the same thing happens when creating a file (because
          the directory inode has to change)

     -- Note that to enable this picture, the uberblock is designed to
     fit in a sector, in order to allow **atomic updates**.

      -- Benefits:

       * Most changes can be committed in **any order**.
           * The only requirement is that all changes be committed before the
           uberblock is updated.
           * The ability to reorder writes in this manner has performance benefits.

       * On disk structure and data is **always** consistent. Do not need to use
         fsck, or run recovery after crash.
           * Most of these filesystems also make use of checksums to handle cases
           where data is corrupted for other reasons.

       * Filesystem incorporates versioning similar to Git and other version control
         tools you may have used.
           * This requires not throwing away the old versions of
           the blocks after writing the new ones.

      -- Disadvantages:

       * Significant write amplification: any writes require changes to several
       disk blocks.

       * Significant space overheads: the filesystem needs enough space to copy
       metadata blocks in order to make any changes. 

      --Question: When a COW fs is almost full, is it a good idea to delete files?

        [answer: no! think of deleting a file that locates in a 10-depth dir:
        it requires to copy all the 10 dir inode to finish the delete...which
        may run out of disk space.]

       * Generally necessitates the use of a garbage collection daemon in order to
       reclaim blocks from old versions of the file-system.

  C. Journaling

      -- Copy on write showed that crash consistency is achievable when
      modifications **do not** modify (or destroy) the current copy. 

      Golden rule of atomicity, per Saltzer-Kaashoek:
      "never modify the only copy"

      -- Problem is that copy-on-write carries significant write and space overheads.
      Want to do better without violating the golden rule of atomicity.

      -- Going to do so by borrowing ideas from how transactions are implemented in databases.

      -- Core idea: Treat file system operations as transactions. Concretely, this means that
         after a crash, failure recovery ensures that:
          * Committed file system operations are reflected in on-disk data structures.
          * Uncommitted file system operations are not visible after crash recovery.

      -- Core mechanism: Record enough information to finish applying committed operations 
         (*redo operations*) and/or roll-back uncommitted operations (*undo operations*). 
         This information is stored in a redo log or undo log. Discuss this in detail next.

  --concept: commit point---the point at which there's no turning back.

      --actions always look like this:
      --first step
      ....            [can back out, leaving no trace]
      --commit point
      .....           [completion is inevitable]
      --last step

      --Question: what's commit point when buying a house?

      --Question: what's the commit point in in the copy-on-write
        protocol above?

        [answer: the uberblock is updated.]

      -- Redo logging
          * Used by Ext3 and Ext4 on Linux, going to discuss in that context.

          * Log is a fixed length ring buffer placed at the beginning of the disk
            (see handout).

          * Basic operations

              Step 1: planning
              filesystem computes what would change due to an operation. For instance,
              creating a new file involves changes to directory inodes, appending to a file 
              involves changes to the file's inode and data blocks.

              Step 2: begin txn
              the file system computes where in the log it can write this transaction,
              and writes a transaction begin record there (TxnBegin in the handout). This 
              record contains a transaction ID, which needs to be unique. The file system 
              **does not** need to wait for this write to finish and can immediately proceed to
              the next step.

              Step 3: journal write
              the file system writes a record or records detailing all the changes it computed in 
              step 1 to the log. The file system **must** now wait for these log changes and
              the TxnBegin record (step 2) to finish being written to disk.

              Step 4: commit txn
              once the TxnBegin record, and all the log records from step 3 have been
              written, the system writes a transaction end record (TxnEnd in the handout). 
              This record contains the same transaction ID as was written in Step 2, and the 
              transaction is considered committed once the TxEnd has been successfully written to disk.

              Step 5: checkpointing
              Once the TxnEnd record has been written, the filesystem asynchronously
              performs the actual file system changes; this process is called **checkpointing**. 
              While the system is free to perform checkpointing whenever it is convenient, 
              the checkpoint rate dictates the size of the log that the system must reserve.

          --Question: which step is  the commit point?
              [answer: step 4; why? see recovery below]

          --Now, let's revisit crash in these five steps.
            convince yourself that we're good when fs crashes at any moment.

          * Crash recovery: During crash recovery, the filesystem needs to read through the logs,
            determine the set of **committed** operations, and then apply them. Observe that:
            -- The filesystem can determine whether a transaction is committed or not by comparing 
               transaction IDs in TxnBegin and TxnEnd records.
            -- It is safe to apply the same redo log multiple times. 

            Operationally, when the system is recovering from a crash, the system 
            does the following:

              Step 1: The file system starts scanning from the beginning of the log. 
              Step 2: Every time it finds a TxnBegin entry, it searches for a 
                  corresponding TxnEnd entry.
              Step 3: If matching TxnBegin and TxnEnd entries are found -- indicating that
                  the transaction is committed -- the file system applies (checkpoints) the
                  changes.
              Step 4: Recovery is completed once the entire log is scanned.

              Note, for redo logs, filesytems generally begin scanning the log from the
              **start of the log**.

          * What to log? 
          Observe that logging can double the amount of data written to disk.
          To improve performance, Ext3 and 4 allow users to choose what to log.
              * Default is to log only metadata. The idea here is that many people
                are willing to accept data loss/corruption after a crash, but 
                keeping metadata consistent is important. This is because if metadata is
                  inconsistent the FS may become unusable, as the data
                  structures no longer have integrity.
              * Can change settings to force data to be logged, along with metadata.
                This incurs additional overheads, but prevents data loss on crash.

      -- Undo logging
         [next time]