Week 11.b CS3650 03/20 2024 https://naizhengtan.github.io/24spring/ 1. fs updates 2. Crash recovery --intro --ad-hoc --copy-on-write ------ * Lab3 challenge 1. fs updates -- your lab4 is read-only -- one missing part: managing the blocks -- bitmap (say block#1) -- one bit represents a block: 0 - free; 1 - used ** mkdir("/dir1/", 0644) --Question: what does "0644" mean? [answer: "rw-r--r--": owner can RW; group can R; others can R] [draw fs3650 layout and inode for "/"] --how it works? 1. path walk to get the inode of the parent folder ("/") 2. allocate an inode for "dir1" (say block#10) 3. allocate a block for data and init the block (say block#11) 4. add direntry "dir1" to parent dir's data (say block#3) 5. update parent inode (for example, mtime) --Question: how many block writes will happen in this process? [answer: 5 times: - block#1: bitmap, for allocating new blocks - block#10: create "dir1" inode - block#11: init data block - block#3: add direntry "dir1" to the parent dir ("/") - block#2: update metadata of the parent dir ] --in fact these writes can happen in different order --depends on your implementation --OS buffer-cache --underlying storage (the hardware) 2. Crash recovery --There are a lot of data structures used to implement the file system: bitmap of free blocks, directories, inodes, indirect blocks, data blocks, etc. --We want these data structures to be *consistent*: we want invariants to hold --We also want to ensure that data on the disk remains consistent. --Thorny issue: *crashes* or power failures. --Making the problem worse is: (a) write-back caching and (b) non-ordered disk writes. --(a) means the OS delays writing back modified disk blocks. --(b) means that the modified disk blocks can go to the disk in an unspecified order. --Example: the above mkdir("/dir1", 0644) There are five writes: 1. block#1: bitmap, for allocating new blocks 2. block#10: create "dir1" inode 3. block#11: init data block 4. block#3: add direntry "dir1" to the parent dir ("/") 5. block#2: update metadata of the parent dir [note: writing to one block is guaranteed to be atomic by hardware.] crash. restart. uh-oh. --Question: assume synchronous writes, what the consequences of crash when happening in-between: 1 and 2? [losing track of two blocks] 2 and 3? [unreachable "dir1", garbage in "dir1"] 3 and 4? [unreachable "dir1"] 4 and 5? [inconsistent metadata in "/"] --Solution: the system requires a notion of atomicity --How to think about this stuff: imagine that a crash can happen at any time. (The only thing that happens truly atomically is a write of one or a few 512-byte disk sector.) So you want to arrange for the world to look sane, regardless of where a crash happens. --> A challenge here is that metadata and data is spread across several disk blocks (and hence several sectors), so increasing size of atomic unit is not sufficient. --> Your leverage, as file system designer, is that you can arrange for some disk writes to happen *synchronously* (meaning that the system won't do anything until these disk writes complete), and you can impose some ordering on the actual writes to the disk. --So we need to arrange for higher-level operations ("add data to file") to _look_ atomic: an update either occurs or it doesn't. --Potentially useful analogy: during our concurrency unit, we had to worry about arbitrary interleavings (which we then tamed with concurrency primitives). Here, we have to worry that a crash can happen at any time (and we will tame this with abstractions like transactions). The response in both cases is a notion of atomicity. --We will mention three approaches to crash recovery in file systems: A. Ad-hoc (OSTEP calls this "fsck") B. copy-on-write approaches C. Journaling (also known as write-ahead logging) A. Ad-hoc --Goal: metadata consistency, not data consistency (rationale: too expensive to provide data consistency; cannot live without metadata consistency.) --Approach: arrange to send file system updates to the disk in such a way that, if there is a crash, **fsck** can clean up inconsistencies --example: mkdir("/dir1", 0644) 1. block#1: bitmap, for allocating new blocks 2. block#10: create "dir1" inode 3. block#11: init data block 4. block#3: add direntry "dir1" to the parent dir ("/") 5. block#2: update metadata of the parent dir here are the fixes for this case: 1 and 2? => recycle blocks 2 and 3? => re-init dir1 3 and 4? => send to "/lost+found/" 4 and 5? => ? [likely ignore...] some other example cases: inode not marked allocated in bitmap --> only writes were to unallocated, unreachable blocks; the result is that the write "disappears" inode allocated, data blocks not marked allocated in bitmap --> fsck must update bitmap Disadvantages to this ad-hoc approach: (a) fsck's guarantees are unclear (hence ad-hoc) (b) need to get ad-hoc reasoning exactly right (sometimes based on fs implementations) (c) poor performance (synchronous writes of metadata) --multiple updates to same block require that they be issued separately. for example, imagine two updates to same directory block. requires first complete before doing the second (otherwise, not synchronous) --more generally, cost of crash recoverability is enormous. (a job like "untar" could be 10-20x slower) (d) slow recovery: fsck must scan entire disk --recovery gets slower as disks get bigger. if fsck takes one minute, what happens when disk gets 10 times bigger? --essentially, fsck has to scan the entire disk B. Copy-on-write approaches -- Goal: provide both metadata and data consistency, by using more space. Rationale: disks have gotten larger, space is not at a premium. -- Used by filesystems like ZFS, btrfs and APFS. [For more details read The Zettabyte File System by Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee and Mark Shellenbaum. https://www.cs.hmc.edu/~rhodes/courses/cs134/sp19/readings/zfs.pdf] -- Approach: never modify a block, instead always make a new copy. In detail: * The filesystem has a root block, which we refer to as the Uberblock (copying terminology from ZFS). The uberblock is the **only** block in the filesystem that is ever _modified_ (as opposed to being fully written, which the rest of the blocks are). * An abstract example: update a leaf block - remember: _never modify, only copy_. so the file system allocates a new block, and writes the new version of the data to the new block - but that in turn necessitates writing a new version of the inode (to point to the new version of the block) - and that in turn _changes the inode number_, which means that parents and any directories hard-linking to the file have to change (for this to work, the inode has to store the list of hard links.) - and that in turn means that _those_ directories' inodes have to change - and so on up to the uberblock. - the change is _committed_ -- in the crucial sense that after a crash the new version will be visible -- when and only when the uberblock is modified on disk. * A concrete example: a modification to a file in an existing block [see handout] (a note: handout figures are for demonstration purpose. If you read the above zfs paper, you will find that the "tree" is about disk blocks, which is an abstraction below the notions of files and directories.) * Note that the same thing happens when a user appends to a file, creating another block (and thus changing the inode, and so on). * And the same thing happens when creating a file (because the directory inode has to change) -- Note that to enable this picture, the uberblock is designed to fit in a sector, in order to allow **atomic updates**. -- Benefits: * Most changes can be committed in **any order**. * The only requirement is that all changes be committed before the uberblock is updated. * The ability to reorder writes in this manner has performance benefits. * On disk structure and data is **always** consistent. Do not need to use fsck, or run recovery after crash. * Filesystem incorporates versioning similar to Git and other version control tools you may have used. * This requires not throwing away the old versions of the blocks after writing the new ones. -- Disadvantages: * Significant write amplification: any writes require changes to several disk blocks. * Significant space overheads: the filesystem needs enough space to copy metadata blocks in order to make any changes. --Question: When a COW fs is almost full, is it a good idea to delete files? [answer: no! think of deleting a file that locates in a 10-depth dir: it requires to copy all the 10 dir inode to finish the delete...which may run out of disk space.] * Generally necessitates the use of a garbage collection daemon in order to reclaim blocks from old versions of the file-system.