Week 13.b CS 5600 04/13 2022 On the board ------------ 1. Last time 2. fs5600 interface 3. Crash recovery --intro --ad-hoc --copy on write ------------------------------------------------------------ Admin --no class next Monday --mmap HW --some report faster in seconds --those report speedup: mmap is faster ranging from 5x--1000x --many people start working on lab4; great; encourage you to start early ------- 1. Last time --files --Unix inode [a pretty cool recent work: ctFS, where they reuse VM for fs. Li, R., Ren, X., Zhao, X., He, S., Stumm, M. and Yuan, D. ctFS: Replacing file indexing with hardware memory translation through contiguous file allocation for persistent memory. FAST'22 ] --fs5600 inode --directories --special file with mapping --links: hard links & soft links Question: if I run $ touch /tmp/a; ln /tmp/a /tmp/b; ln -s /tmp/a /tmp/sb; rm /tmp/a What is the output of this cmd ? $ cat /tmp/b What is the output of this cmd? $ cat /tmp/sb and why? [see note from last time] --link cycles Question: can I create link cycles? Yes. Question: can I create hardlink only cycles? No. (by stop creating hardlink to dirs) Question: what can go wrong with hardlink cycles? --fs5600 design --layout, superblock, bitmap, root dir, inode, dir entry, mode, ... --see Lab4 instructions 2. fs5600 interface: ** path walk int inum = path2inum(char *pat); --how? for example, "/a/b/c/file" [answer: 1. split path to tokens: ["a", "b", "c", "file"] 2. starting from root dir (block 2) 3. find "a" in "/" (block 2) 4. get "a"'s inode 5. get "b" in "/a/" 6. ... ] ** fs_read - read data from a file --how? for example, "read("/a/file", buf, len, offset)" (pseudocode) [answer: 1. path walk to find the file's inode 2. find offset's block (all data blocks compose a linear space) 3. read len bytes ] ** fs_write - write to a file --how? for example, "write("/a/file", buf, len, offset)" (pseudocode) [answer: 1. path walk to find the file's inode 2. allocate blocks if needed (len+offset > size) 3. find offset's block 4. write len bytes to the file ] Question: can you think of any metadata to update? [answer: - mtime - size ] ** fs_create - create a new (empty) file --how? for example, "create("/a/file", mode)" (pseudocode) [answer: 1. path walk to get the inode of the parent folder ("/a/") 2. allocate a block as "file"'s inode 3. add "file" to "/a/" ] --Question: how many "block_write()" do you think will happen in a "fs_create"? [answer: at least 4 times: - 1 to bitmap (for allocating new blocks) - 1 to parent dir inode (metadata update, mtime) - 1 to parent dir data block (adding "file") - 1 to the "file"'s inode - and likely 1 data block to file's inode ] ** mkdir("/dir1/", 0644) --Question: what does "0644" mean? [answer: "rw-r--r--": owner can RW; group can R; others can R] [draw fs5600 layout and inode for "/"] --how it works? 1. path walk to get the inode of the parent folder ("/") 2. allocate an inode for "dir1" (say block#10) 3. allocate a block for data and init the block (say block#11) 4. add direntry "dir1" to parent dir's data (say block#3) 5. update parent inode (for example, mtime) --Question: how many "block_write()" do you think will happen in this process? [answer: 5 times: - block#1: bitmap, for allocating new blocks - block#10: create "dir1" inode - block#11: init data block - block#3: add direntry "dir1" to the parent dir ("/") - block#2: update metadata of the parent dir ] --in fact these writes can happen in different order --depends on your implementation --OS buffer-cache --underlying storage (the hardware) 3. Crash recovery --intro --ad-hoc --copy-on-write --journaling --There are a lot of data structures used to implement the file system: bitmap of free blocks, directories, inodes, indirect blocks, data blocks, etc. --We want these data structures to be *consistent*: we want invariants to hold --We also want to ensure that data on the disk remains consistent. --Thorny issue: *crashes* or power failures. --Making the problem worse is: (a) write-back caching and (b) non-ordered disk writes. --(a) means the OS delays writing back modified disk blocks. --(b) means that the modified disk blocks can go to the disk in an unspecified order. --Example: the above mkdir("/dir1", 0644) There are five writes: 1. block#1: bitmap, for allocating new blocks 2. block#10: create "dir1" inode 3. block#11: init data block 4. block#3: add direntry "dir1" to the parent dir ("/") 5. block#2: update metadata of the parent dir [note: writing to one block is guaranteed to be atomic by hardware.] crash. restart. uh-oh. --Question: assume synchronous writes, what the consequences of crash when happening in-between: 1 and 2? [losing track of two blocks] 2 and 3? [unreachable "dir1", garbage in "dir1"] 3 and 4? [unreachable "dir1"] 4 and 5? [inconsistent metadata in "/"] --Solution: the system requires a notion of atomicity --How to think about this stuff: imagine that a crash can happen at any time. (The only thing that happens truly atomically is a write of one or a few 512-byte disk sector.) So you want to arrange for the world to look sane, regardless of where a crash happens. --> A challenge here is that metadata and data is spread across several disk blocks (and hence several sectors), so increasing size of atomic unit is not sufficient. --> Your leverage, as file system designer, is that you can arrange for some disk writes to happen *synchronously* (meaning that the system won't do anything until these disk writes complete), and you can impose some ordering on the actual writes to the disk. --So we need to arrange for higher-level operations ("add data to file") to _look_ atomic: an update either occurs or it doesn't. --Potentially useful analogy: during our concurrency unit, we had to worry about arbitrary interleavings (which we then tamed with concurrency primitives). Here, we have to worry that a crash can happen at any time (and we will tame this with abstractions like transactions). The response in both cases is a notion of atomicity. --We will mention three approaches to crash recovery in file systems: A. Ad-hoc (OSTEP calls this "fsck") B. copy-on-write approaches C. Journaling (also known as write-ahead logging) A. Ad-hoc --Goal: metadata consistency, not data consistency (rationale: too expensive to provide data consistency; cannot live without metadata consistency.) --Approach: arrange to send file system updates to the disk in such a way that, if there is a crash, **fsck** can clean up inconsistencies --example: mkdir("/dir1", 0644) 1. block#1: bitmap, for allocating new blocks 2. block#10: create "dir1" inode 3. block#11: init data block 4. block#3: add direntry "dir1" to the parent dir ("/") 5. block#2: update metadata of the parent dir here are the fixes for this case: 1 and 2? => recycle blocks 2 and 3? => re-init dir1 [dangerous when there is seemingly correct info a fix: checksum] 3 and 4? => send to "/lost+found/" 4 and 5? => ? [likely ignore...] some other example cases: inode not marked allocated in bitmap --> only writes were to unallocated, unreachable blocks; the result is that the write "disappears" inode allocated, data blocks not marked allocated in bitmap --> fsck must update bitmap Disadvantages to this ad-hoc approach: (a) fsck's guarantees are unclear (hence ad-hoc) (b) need to get ad-hoc reasoning exactly right (sometimes based on fs implementations) (c) poor performance (synchronous writes of metadata) --multiple updates to same block require that they be issued separately. for example, imagine two updates to same directory block. requires first complete before doing the second (otherwise, not synchronous) --more generally, cost of crash recoverability is enormous. (a job like "untar" could be 10-20x slower) (d) slow recovery: fsck must scan entire disk --recovery gets slower as disks get bigger. if fsck takes one minute, what happens when disk gets 10 times bigger? --essentially, fsck has to scan the entire disk B. Copy-on-write approaches -- Goal: provide both metadata and data consistency, by using more space. Rationale: disks have gotten larger, space is not at a premium. -- Used by filesystems like ZFS, btrfs and APFS. [For more details read The Zettabyte File System by Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee and Mark Shellenbaum. https://www.cs.hmc.edu/~rhodes/courses/cs134/sp19/readings/zfs.pdf] -- Approach: never modify a block, instead always make a new copy. In detail: * The filesystem has a root block, which we refer to as the Uberblock (copying terminology from ZFS). The uberblock is the **only** block in the filesystem that is ever _modified_ (as opposed to being fully written, which the rest of the blocks are). * An abstract example: update a leaf block [draw a tree with checksum] - remember: _never modify, only copy_. so the file system allocates a new block, and writes the new version of the data to the new block - but that in turn necessitates writing a new version of the inode (to point to the new version of the block) - and that in turn _changes the inode number_, which means that parents and any directories hard-linking to the file have to change (for this to work, the inode has to store the list of hard links.) - and that in turn means that _those_ directories' inodes have to change - and so on up to the uberblock. - the change is _committed_ -- in the crucial sense that after a crash the new version will be visible -- when and only when the uberblock is modified on disk. [continue next time]