Week 13.a CS 5600 04/03 2023 1. SSD continued 2. Intro to fs 3. Files -------------------------------------- Admin: - Lab3 grader and regrading - Lab5 will be released on Wed (04/05) due in two weeks -- you will get grading scripts with the Lab5 - the schedule got updated --------- 1. SSD continued --operations --read: a page --erase: a block, resetting all bits to 1 --program: a page, setting some bits to 0 (you cannot program a page twice without erasing) --performance --read: tens of us --erase: several ms --program: hundreds of us --a bummer: wear-out -- a block can bare about 10,000 to 100,000 times erasing, then becomes unusable QUESTION: what if one needs to frequently updating a page? --FTL: flash translation layer --read/write logic blocks -->FTL--> read/erase/program physical blocks (note, the "blocks" in logical blocks and physical blocks are different: logical blocks as in the device interface, physical blocks as in the flash hardware) --Question: if you were FTL, how would you mitigate wear-outs? [answer: evenly spread the erase/program to blocks.] --a log-structured FTL --idea: --when write, appending the write to the next free page (called logging). --when read, keeping tracks where the data are by mapping logical data to physical pages. ** an example: --Given a flash bank has two blocks; each has two pages. --there are four writes to pages: wirte(logic_page_1) wirte(logic_page_10) wirte(logic_page_99) --what will happen: +-------------------+ blocks | block 0 | block 1 | +---------+---------+ pages | P1 | P2 | P3 | P4 | +----+----+----+----+ data |LP1 |LP10|LP99| | +----+----+----+----+ mapping: LP1 => P1, LP10 => P2, LP99 => P3 Question: what will happen if the following op is write(logic_page_1')? [answer: +-------------------+ blocks | block 0 | block 1 | +---------+---------+ pages | P1 | P2 | P3 | P4 | +----+----+----+----+ data |LP1 |LP10|LP99|LP1'| +----+----+----+----+ mapping: LP1 => P4, LP10 => P2, LP99 => P3 ] --Notice: P1 now contains old (invalid) data and is useless. --hence, SSD requires *garbage collection* (GC) -- want to GC block0 -- move the useful pages (i.e., P2[LP10]) in the same block to other free places -- erase block 0 (now bot P1 and P2 can be used again) --complicated internals, hence sometimes unpredictable latency --predicting latency? see below [Hao, Mingzhe, Levent Toksoz, Nanqinqin Li, Edward Edberg Halim, Henry Hoffmann, and Haryadi S. Gunawi. "LinnOS: Predictability on unpredictable flash storage with a light neural network.", OSDI'20] 2. Intro to file systems --what does a FS do? 1. provide persistence (don't go away ... ever) 2. give a way to "name" a set of bytes on the disk (files) 3. give a way to map from human-friendly-names to "names" (directories) --a few quick notes about disks in the context of FS design --disk/SSD are the first thing we've seen that (a) doesn't go away; and (b) we can modify (BIOS ROM, hardware configuration, etc. don't go away, but we weren't able to modify these things). two implications here: (i) we're going to have to put all of our important state on the disk (ii) we have to live with what we put on the disk! scribble randomly on memory --> reboot and hope it doesn't happen again. scribbe randomly on the disk --> now what? (answer: in many cases, we're hosed.) --where are FSes implemented? --can implement them on disk, over network, in memory, in NVRAM (non-volatile RAM), on tape, with paper (!!!!) -- Lab5 simulates a disk that abstracts an array of blocks 3. Files --what is a file? --answer from user's view: a bunch of named bytes on the disk --answer from FS's view: collection of disk blocks --big job of a FS: map name and offset to disk blocks FS {file,offset} --> disk address [also called file mapping. see an introduction in S2 of this paper: (which btw is a cool paper about persistent memory) Neal, Ian, Gefei Zuo, Eric Shiple, Tanvir Ahmed Khan, Youngjin Kwon, Simon Peter, and Baris Kasikci. "Rethinking File Mapping for Persistent Memory.", FAST'21 ] --operations are create(file), delete(file), read(), write() -- note that we have seen translation/indirection before: (introducing inode) per-process: page table VA ----------> PA per-file: SOMETHING offset ---------> disk block address In fs, this SOMETHING is called inode. the, how'd we get the inode? directory file name ------------> inode [Analogies: per-process memory <-> File multiple processes <-> Multiple files page table <-> inode memory chip <-> disk/SSD ] - File design parameters: * small files (most files are small) vs. large files (much of the disk is allocated to large files) * access patterns: sequential access vs. random accesses vs. keyed accesses * disk utilization (metadata overhead and fragmentation) - access patterns we could imagine supporting: (i) Sequential: --File data processed in sequential order --By far the most common mode --Example: editor writes out new file, compiler reads in file, etc (ii) Random access: --Address any block in file directly without linear scanning --Examples: large data set, demand paging, databases (iii) Keyed access --Search for block with particular values --Examples: associative database, index --This thing is everywhere in the field of databases, search engines, but.... --...usually not provided by a FS in OS Question: if you were a fs designer, which structure will you use for different parameters? [answer: depends on what workloads the fs going to meet] candidate designs........ A. contiguous B. linked files C. indexed files A. contiguous allocation "extent based" --when creating a file, make user pre-specify its length, and allocate the space at once --file metadata contains location and size --example: IBM OS/360 [ a1 a2 a3 <5 free> b1 b2 ] what if a file c needs 7 sectors?! +: simple +: fast access, both sequential and random -: fragmentation [somewhat regain interests due to PM: Li, Ruibin, Xiang Ren, Xu Zhao, Siwei He, Michael Stumm, and Ding Yuan. "ctFS: Replacing file indexing with hardware memory translation through contiguous file allocation for persistent memory." FAST'22] B. linked files "linked list based" --keep a linked list of free blocks --metadata: pointer to file's first block --each block holds pointer to next one +: no more fragmentation +: sequential access easy (and probably mostly fast, assuming decent free space management, since the pointers will point close by) -: random access is a disaster -: pointers take up room in blocks; (overhead: sizeof(pointer) / sizeof(block)) messes up alignment of data C. indexed files [DRAW PICTURE] --Each file has an array holding all of its block pointers --like a page table, so similar issues crop up --Allocate this array on file creation --Allocate blocks on demand (using free list or bitmap) +: sequential and random access are both easy -: need to somehow store the array --large possible file size --> lots of unused entries in the block array --large actual block size --> huge contiguous disk chunk needed --solve the problem the same way we did for page tables. --okay, so now we're not wasting disk blocks, but what's the problem? (answer: equivalent issues as for page table walking: here, it's extra disk accesses to look up the blocks) --this motivates the classic Unix file system --Unix inode: [draw on board] permisssions times for file access, file modification, and inode-change link count (# directories containing file) ptr 1 --> data block ptr 2 --> data block ptr 3 --> data block ..... ptr 11 --> indirect block ptr --> ptr --> ptr --> ptr --> ptr --> ptr 12 --> indirect block ptr 13 --> double indirect block ptr 14 --> triple indirect block This is just a tree. [deja vu? that's right, we talked similar things when we argue for different designs of page tables.] Question: why is this tree intentionally imbalanced? (Answer: optimize for short files. each level of this tree requires a disk seek...) Pluses/minuses: +: Simple, easy to build, fast access to small files +: Maximum file length can be enormous, with multiple levels of indirection -: worst case # of accesses pretty bad -: worst case overhead (such as 11 block file) pretty bad -: Because you allocate blocks by taking them off unordered freelist, metadata and data get strewn across disk * Notes about inodes: --stored in a fixed-size array --Size of array fixed when disk is initialized; can't be changed [why? easier for fs to find inodes, and fewer disk accesses (better performance!)] --Multiple inodes in a disk block (which is not the case for your Lab5) --Question: how many files can the following toy-fs has? sizeof(inode) = 128B sizeof(block) = 512B toy-fs uses 1000 blocks to store inodes [answer: the number of inodes (hence files) is (521B / 128B) * 1000 = 4000 ] -- use "$ df -i ~" to see how many inodes you can use. --inode lives in known location, originally at one side of disk, now lives in pieces across disk (helps keep metadata close to data) --The index of an inode in the inode array is called an ***i-number*** --Internally, the OS refers to files by i-number --When a file is opened, the inode brought in memory --Written back when modified and file closed or time elapses