Week 11.b CS6640 11/16 2023 https://naizhengtan.github.io/23fall/ 1. intro to fs 2. Unix files 3. what about now? -- PMem introduction -- modern(?) file mappings --- Last time: page fault * memory-mapped files * idea: allow access to files using load and store can easily read and writes part of a file e.g., don't have to change offset using lseek system call * Unix systems a new system call for m-mapped files: void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); * kernel page-in pages of a file on demand when memory is full, page-out pages of a file that are not frequently used 1. Intro to file systems Q: what does a FS do? 1. provide persistence (don't go away ... ever) 2. give a way to "name" a set of bytes on the disk (files) 3. give a way to map from human-friendly-names to "names" (directories) --persistence comes from hardware: * classic persistent storage SSD/disk/tape (yes, we are still using tapes) --disk/SSD are the first thing we've seen that (a) doesn't go away; and (b) we can modify (BIOS ROM, hardware configuration, etc. don't go away, but we weren't able to modify these things). two implications here: (i) we're going to have to put all of our important state on the disk (ii) we have to live with what we put on the disk! scribble randomly on memory --> reboot and hope it doesn't happen again. scribbe randomly on the disk --> now what? (answer: in many cases, we're hosed.) * new persistent storage: --glass: Project Silica: https://www.microsoft.com/en-us/research/project/project-silica/ --persistent memory (see below) --disk/ssd abstraction an array of blocks --a note about abstracting machines -- CPU: scheduler -- Memory: virtual memory -- disk: file system 2. Files --what is a file? --answer from user's view: a bunch of named bytes on the disk --answer from FS's view: collection of disk blocks Q: what does a file do? --big job of a FS: map name and offset to disk blocks FS {file,offset} --> disk addrss this is called file mapping. --the problem: "file mapping" (file, offset) -> disk addr "70% of the time spent on file mapping" by "the rethinking paper" (see below) --Unix inode: how Unix does file mapping [draw on board] permisssions times for file access, file modification, and inode-change link count (# directories containing file) ptr 1 --> data block ptr 2 --> data block ptr 3 --> data block ..... ptr 11 --> indirect block ptr --> ptr --> ptr --> ptr --> ptr --> ptr 12 --> indirect block ptr 13 --> double indirect block ptr 14 --> triple indirect block This is just an imbalanced tree. --Unix fs: (file, offset) -[inode]-> disk addr Why is inode designed the way in Unix? "fs is a disk data structure": -- granularity: block -- read/write a block is expensive -- sequential access > random access [skipped] Q: How does Unix fs reads/writes files? app / \ / \ mmap read/write --------------------- \ / [kernel] \ / page cache | disk --Unix-style file mappings works well in the past when we have single-core CPU, slow IO, sequential access > random access, read/write granularity is a block,... ...but, do these still hold today? Q: assume that memory is persistent, then do we need fs? how will that fs look like? 3. what about now? A) PMem introduction Persistent memory or Non-volatile memory storage hierarchy: [see also handout] | registers \ | L1-L3 caches \ | DRAM \ | [Pmem] \ | SSD/disk \ | network fs (NFS) \ features: -- fastest persistent storage -- largest byte-addressable storage -- memory DIMM -- (potentially) managed by MMU the only implementation---Intel Optane PMem multiple modes, but what we care is called "App Direct Mode", where Pmem is persistent. performance characteristics (compared with DRAM) -- latency -- throughput -- concurrency -- access size B) Rethinking File Mapping for Persistent Memory [Neal, Ian, Gefei Zuo, Eric Shiple, Tanvir Ahmed Khan, Youngjin Kwon, Simon Peter, and Baris Kasikci. "Rethinking File Mapping for Persistent Memory.", FAST'21] * Q: does file mapping has to be persistent? pros? cons? -- on persistent storage -- in DRAM -- on storage with a cache * high-level view: reads: retrieve file mapping writes: update file mapping + block allocator * FS design parameters: - small files (most files are small) vs. large files (much of the disk is allocated to large files) - sequential access vs. random accesses - prefetching - disk utilization (metadata overhead and fragmentation) C) design choices 1. per-file extent tree extent: a continuous allocation causing fragmentation when appending, deleting, resizing binary search to find the extent O(log(N)) where N is the number of extents Q: pros and cons? pros: + sequential access + disk utilization + prefetching cons: - large file 2. per-file radix tree much like page table in virtual memory a tree; on each level, an index in the "addr" (logical block id) will choose the next node Q: pros and cons? pros: + performance stability: one memory access per level cons: - disk utilization (compared with others) 3. global cuckoo hash table * cuckoo hashing how it works: * two hashing functions * invariant: you can always find the value in one of the two hashing location * read: key -[cuckoo]-> location * insert: for a conflicting insert, kick an existing key out; redo insertion for that key; repeat Q: what do the keys look like? "(, )" Q: what will happen if "read(8KB-of-data)"? Q: what will happen if "append(4KB-of-data)"? insertion to the hash + allocate a block (<= can be expensive) Q: pros and cons? pros: + disk utilization cons: - sequential read - unpredictable and long update latency 4. global (traditional) hash table: HashFS design goal: better insertion (what's wrong with cuckoo hashing insertion? need talk to a "block allocator") co-design: file mapping + block allocator a trick: use index as offset Q: how does "read(file_num=1, logical_block=21)" work? Q: how does "append(file_num=1, logical_block=22)" work? Q: pros and cons? pros: + fast retrieval (compute a hash function) + disk utilization cons: - expensive hash collision - prefetching [skipped] * DISCUSSION: per-file mapping vs. gloabl mapping Q: Why not per-file hash mapping? What's wrong with it? A: no idea of the file size. Q: Why not global extent trees or radix trees? What's wrong with them? A: pull all files in the same namespace (know ahead of time about the number of files) per-file: + locality + concurrency - resizing => - multiple levels of indirections => - fragmentation Q: but what forces per-file designs to resize? true question: can fs plan ahead of time? global: + no resizing => no fragmentation + mapping size - locality (can be mitigated by caching)