Week 6.b CS7670 10/12 2022 https://naizhengtan.github.io/22fall/ 1. what we've learned 2. hierarchical files systems are dead 3. learned fs proposal --- [Brent's and Will's presentations] 1. What we've learned so far A) background: file system (Lab1) two main functionalities: -- named data (file) -- user-friendly namespace (dirs) which has two core problems: -- file mapping: f(file, offset) - >storage block id -- locating files: g(path_string) -> file [or inode number] B) background: neural network and ML systems (Lab2) * backward propagation * autograd engine * dataflow graph (TensorFlow) C) an NN4Sys: learned index In general, NN4Sys has two main advantages: (1) succinct approximation data structure, (2) discovering sophisticated heuristics Learned index falls into the first category. Learned index is smaller and faster than traditional data structures like B-Tree. D) the following topic, learned scheduling, falls into the second category. 2. Hierarchical files systems are dead file systems vs. databases: Q: Consider you have 1,000 books at home. How will you organize them so that you can easily find the book you want? Q: Consider you are a librarian with 1,000 books. How would you organize books so that when a guest comes, you can find the book for them? Q: Are there any difference of how you organize the books? * hFSD argues, "situation, however, has evolved" i) storage size grows -- 1992: 300MB disk -- 2009: 300GB disk (23 years later) -- 2022: 20TB disk & 8TB SSD (13 years later) -- end of 2022: 200TB SSD -- 2030: 1000TB SSD (8 years later) [https://www.techradar.com/news/1000tb-ssds-could-become-mainstream-by-2030-as-samsung-plans-1000-layer-nand] ii) larger space => file size wasn't growing that fast => more files => harder to manage [Q: does logic flow?] iii) "Google is a verb" what they want instead of where it lives [a sharp observation] iv) "database themselves tend to be too heavy-weight a solution" -- not ideally optimized for a given application -- prevent independent evolution of the data -- painful to install and manage [are these true? what about SQLite?] my opinion: interface is the core problem * SQL vs. FS interface [skipped] * basis of a modern fs: -- backwards compatibility -- separate naming from access (*) -- data agnostic -- direct access to data * hFAD design [see Fig1] -- based on an object-based storage device -- naming (via index) and accessing (via obj store) Q: how about updates? deleting a file needs to update multiple indexes at the same time. Again, index is a data structure that accelerate data retrieval, at the cost of more expensive updates and more space. [skipped] * hFAD implementation [S3.4] Q: how to support efficient "insert" and "truncate"? [B-Tree: key(file offset) -> val(disk addr, lengths) ] [skipped] * open questions Q: anything you think is interesting? [skipped] DISCUSSION: search-based fs vs. hierarchical fs 3. learned fs proposal [trimmed]