Week 12.a CS 5600 11/23 2021 On the board ------------ 1. Last time 2. Crash recovery --intro --ad-hoc --copy on write -------------------------------------------------------------- Admin --lab3 challenge grading ------- 1. Last time --Unix inode --directory --special file with mapping --links: hard links & soft links --fs5600 design --layout, superblock, bitmap, root dir, inode, dir entry, mode, ... --see Lab4 instructions: https://naizhengtan.github.io/21fall/labtutorials/lab4/ --interfaces: --read, write, create, ... --What does "rwx" mean for a dir? [try these examples: $ cd /tmp/ $ mkdir r_dir w_dir x_dir $ ls -ld *_dir // see the initial permissions $ chmod u-wx r_dir // now "r_dir" has permission "r--" $ ls r_dir // try these cmds; what're the results? $ cd r_dir $ touch r_dir/file $ chmod u-rw x_dir // now "x_dir" has permission "--x" $ ls x_dir // try these cmds; what're the results? $ cd x_dir $ touch x_dir/file $ chmod u-rx w_dir // now "w_dir" has permission "-w-" $ ls w_dir // try these cmds; what're the results? $ cd w_dir $ touch w_dir/file // note: you will fail this; why? // [hint: add "x" to "w_dir" ("-wx") // to see if "touch" succeeds] ] --mkdir("/dir1/", 0644) --Question: what does "0644" mean? [answer: "rw-r--r--": owner can RW; group can R; others can R] [draw fs5600 layout and inode for "/"] --how it works? 1. get the parent folder's inode ("/") 2. allocate an inode for "dir1" (say block#10) 3. allocate a block for data (say block#11) and init the block 4. add direntry "dir1" to parent dir's data (say block#3) 5. update parent inode for metadata (for example, mtime) --Question: how many "block_write()" do you think will happen in this process? [answer: 5 times: - block#1: bitmap, for allocating new blocks - block#10: create "dir1" inode - block#11: init data block - block#3: add direntry "dir1" to the parent dir ("/") - block#2: update metadata of the parent dir ] --in fact these writes can happen in different orders depending on: --your implementation (you may choose to swap some of the writes) --OS buffer-cache --underlying storage (the hardware) 2. Crash recovery --There are a lot of data structures used to implement the file system: bitmap of free blocks, directories, inodes, indirect blocks, data blocks, etc. --We want these data structures to be *consistent*: we want invariants to hold --We also want to ensure that data on the disk remains consistent. --Thorny issue: *crashes* or power failures. --Example: the above mkdir("/dir1", 0644) There are five writes: 1. block#1: bitmap, for allocating new blocks 2. block#10: create "dir1" inode 3. block#11: init data block 4. block#3: add direntry "dir1" to the parent dir ("/") 5. block#2: update metadata of the parent dir [note: writing to one block is guaranteed to be atomic by hardware.] crash. restart. uh-oh. --Question: assume synchronous writes, what the consequences of crash when happening in-between: 1 and 2? [losing track of two blocks] 2 and 3? [unreachable "dir1", garbage in "dir1"] 3 and 4? [unreachable "dir1"] 4 and 5? [inconsistent metadata in "/"] [skipped in class: --Solution: the system requires a notion of atomicity --How to think about this stuff: imagine that a crash can happen at any time. (The only thing that happens truly atomically is a write of one or a few 512-byte disk sector.) So you want to arrange for the world to look sane, regardless of where a crash happens. --> A challenge here is that metadata and data is spread across several disk blocks (and hence several sectors), so increasing size of atomic unit is not sufficient. --> Your leverage, as file system designer, is that you can arrange for some disk writes to happen *synchronously* (meaning that the system won't do anything until these disk writes complete), and you can impose some ordering on the actual writes to the disk. --So we need to arrange for higher-level operations ("add data to file") to _look_ atomic: an update either occurs or it doesn't. --Potentially useful analogy: during our concurrency unit, we had to worry about arbitrary interleavings (which we then tamed with concurrency primitives). Here, we have to worry that a crash can happen at any time (and we will tame this with abstractions like transactions). The response in both cases is a notion of atomicity. ] --We will mention three approaches to crash recovery in file systems: A. Ad-hoc (OSTEP calls this "fsck") B. copy-on-write approaches C. Journaling (also known as write-ahead logging) A. Ad-hoc --Goal: metadata consistency, not data consistency (rationale: too expensive to provide data consistency; cannot live without metadata consistency.) --Approach: arrange to send file system updates to the disk in such a way that, if there is a crash, **fsck** can clean up inconsistencies --example: mkdir("/dir1", 0644) 1. block#1: bitmap, for allocating new blocks 2. block#10: create "dir1" inode 3. block#11: init data block 4. block#3: add direntry "dir1" to the parent dir ("/") 5. block#2: update metadata of the parent dir here are the fixes for this case (what fsck can do?): crash in-between 1 and 2? => fsck can recycle blocks 2 and 3? => re-init dir1 [dangerous when there is seemingly correct info a fix: checksum] 3 and 4? => send "dir1" to "/lost+found/" 4 and 5? => ? [likely ignore...] some other example cases: inode not marked allocated in bitmap --> only writes were to unallocated, unreachable blocks; the result is that the write "disappears" inode allocated, data blocks not marked allocated in bitmap --> fsck must update bitmap Disadvantages to this ad-hoc approach: (a) fsck's guarantees are unclear (hence ad-hoc) (b) need to get ad-hoc reasoning exactly right (sometimes based on fs implementations) (c) poor performance (synchronous writes of metadata) --multiple updates to same block require that they be issued separately. for example, imagine two updates to same directory block. requires first complete before doing the second (otherwise, not synchronous) --more generally, cost of crash recoverability is enormous. (a job like "untar" could be 10-20x slower) (d) slow recovery: fsck must scan entire disk --recovery gets slower as disks get bigger. if fsck takes one minute, what happens when disk gets 10 times bigger? --essentially, fsck has to scan the entire disk B. Copy-on-write approaches -- Goal: provide both metadata and data consistency, by using more space. Rationale: disks have gotten larger, space is not at a premium. -- Used by filesystems like ZFS, btrfs and APFS. [For more details read The Zettabyte File System by Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee and Mark Shellenbaum. https://www.cs.hmc.edu/~rhodes/courses/cs134/sp19/readings/zfs.pdf] -- Approach: never modify a block, instead always make a new copy. In detail: * The filesystem has a root block, which we refer to as the Uberblock (copying terminology from ZFS). The uberblock is the **only** block in the filesystem that is ever _modified_ (as opposed to being fully written, which the rest of the blocks are). * An abstract example: update a leaf block [draw a tree with checksum] - remember: _never modify, only copy_. so the file system allocates a new block, and writes the new version of the data to the new block - but that in turn necessitates writing a new version of the inode (to point to the new version of the block) - and that in turn _changes the inode number_, which means that parents and any directories hard-linking to the file have to change (for this to work, the inode has to store the list of hard links.) - and that in turn means that _those_ directories' inodes have to change - and so on up to the uberblock. - the change is _committed_ -- in the crucial sense that after a crash the new version will be visible -- when and only when the uberblock is modified on disk. * A concrete example: a modification to a file in an existing block [see handout] * Note that the same thing happens when a user appends to a file, creating another block (and thus changing the inode, and so on). * And the same thing happens when creating a file (because the directory inode has to change) -- Note that to enable this picture, the uberblock is designed to fit in a sector, in order to allow **atomic updates**. -- Benefits: * Most changes can be committed in **any order**. * The only requirement is that all changes be committed before the uberblock is updated. * The ability to reorder writes in this manner has performance benefits. * On disk structure and data is **always** consistent. Do not need to use fsck, or run recovery after crash. * Most of these filesystems also make use of checksums to handle cases where data is corrupted for other reasons. * Filesystem incorporates versioning similar to Git and other version control tools you may have used. * This requires not throwing away the old versions of the blocks after writing the new ones. -- Disadvantages: * Significant write amplification: any writes require changes to several disk blocks. * Significant space overheads: the filesystem needs enough space to copy metadata blocks in order to make any changes. --Question: When a COW fs is almost full, is it a good idea to delete files? [answer: no! think of deleting a file that locates in a 10-depth dir: it requires to copy all the 10 dir inode to finish the delete...which may run out of disk space.] * Generally necessitates the use of a garbage collection daemon in order to reclaim blocks from old versions of the file-system. [next time] C. Journaling