Week 2.b CS7670 09/14 2022 https://naizhengtan.github.io/22fall/ 1. Pmem introduction 2. Rethinking file mapping --- 0. Admin ask if encountering challenges in Lab1 introduce some of the conferences FAST, SOSP, OSDI, EuroSys, VLDB, SIGMOD revisit Unix fs: a disk data structure 1. (5min) Pmem introduction [this is a brief introduction; will cover Pmem in later modules] Persistent memory or Non-volatile memory draw storage hierarchy: [see also handout] | registers \ | L1-L3 caches \ | DRAM \ | [Pmem] \ | SSD/disk \ | network fs (NFS) \ features: -- fastest persistent storage -- largest byte-addressable storage -- memory DIMM -- (potentially) managed by MMU architecture [see handout] the only implementation---Intel Optane PMem multiple modes, but what we care is called "App Direct Mode", where Pmem is persistent. naunce of "persistent write" (Intel Optane) 1. store + clwb: write back (w/o evict) cache lines 2. nstore: non-temporal stores, write directly to memory and bypass caches and [see handout] performance characteristics (compared with DRAM) -- latency -- throughput -- concurrency -- access size 2. Rethinking File Mapping for Persistent Memory A) the problem * the problem: "file mapping" (file, offset) -> disk addr * they claim: "70% of the time spent on file mapping" Q: in Unix fs, how to find such a mapping of "(/tmp/hello.c, 5121)" -> disk addr? A: (if first time) 1. find root inode, 2. find "tmp" inode (a dir), 3. find "hello.c" inode (a regular file), 4. fetch the ptr block (pointed by the indirect ptr) 5. fetch the data block (pointed by the first ptr in the ptr block) [skipped] * Unix fs: (file, offset) -[inode]-> disk addr Why is inode designed the way in Unix? "fs is a disk data structure": -- granularity: block -- read/write a block is expensive -- sequential access > random access Q: the paper says, "eliminating memory copies and bypassing the kernel"? (first paragraph) How to understand this on Unix fs? app / \ / \ mmap read/write --------------------- \ / [kernel] \ / page cache | disk * DISCUSSION: does file mapping has to be persistent? pros? cons? mainly three ways (S2.2) -- on persistent storage -- in DRAM -- on storage with a cache [skipped] * JARGON: "shadow paging" "Shadow paging is a copy-on-write technique for avoiding in-place updates of pages. Instead, when a page is to be modified, a shadow page is allocated. Since the shadow page has no references (from other pages on disk), it can be modified liberally, without concern for consistency constraints, etc."---wiki solving a different problem: crash consistency * high-level view: reads: retrieve file mapping writes: update file mapping + block allocator [skipped] * DISCUSSION: the paper says, "PM file systems generally map files at block granularity of at least 4KB in order to constrain the amount of metadata required to track file system space.", above S2.1 How to understand this? and is this still true? B) challenges and non-challenges Challenges -- concurrency (a new problem) -- fragmentation -- locality -- mapping size Q: How Unix fs tackles the latter three challenges? -- fragmentation: tree-like inode -- locality: caching; inode (meta-data + ptrs) -- mapping size: fine * concurrency a relatively new problem. Q: think of Unix fs; what will happen when having multiple concurrent fs ops to one file? case1: multiple reads? case2: reads and writes to the same location? Q: what about dir operations? Example: Linux has "int rename(const char *oldpath, const char *newpath);" T1: rename("/tmp/a/b", "/tmp/b"); T2: rename("/tmp/a", "/tmp/c"); Think of Unix fs; what may be the final results? [skipped] Q: what is the "correctness" spec of concurrent fs ops? [introduce sequential consistency] * fragmentation "a file's data is spread across non-contiguous physical locations on a storage device" Q: why fragmentation is a problem (an unwanted phenomenon)? Sequential reads and writes are still preferred (will be faster). * locality "Accesses with locality are typically accelerated by caching prior accesses and prefetching adjacent ones." Q: Is locality really the problem? the true problem is where is the mapping info stored, in PMem, DRAM, or CPU cache? * mapping size Q: Again, is mapping size the true problem? if you can have 99% of the space for mapping, what will happen? [toy solution: a gigantic hash table that barely has collision (an O(1) access)] assume we have inf large CPU cache, do people really care the size of the meta-data? [no, we cache the gigantic hash table in the CPU cache; done] [my opinion: for "fragmentation", "locality", and "mapping size", the true underlying problem is caching or more specifically: where the mapping info locates in the hardware (CPU cache vs. DRAM vs. PMem)? ] Non-challenges -- page caching (discussed earlier) -- crash consistency (an orthogonal problem/challenge) C) four design choices [next time]