Week 6.b
CS7670
10/12 2022
https://naizhengtan.github.io/22fall/

1. what we've learned
2. hierarchical files systems are dead
3. learned fs proposal
---


[Brent's and Will's presentations]


1. What we've learned so far

 A) background: file system
    (Lab1)

    two main functionalities:
     -- named data (file)
     -- user-friendly namespace (dirs)

    which has two core problems:
     -- file mapping:
           f(file, offset) - >storage block id
     -- locating files:
           g(path_string) -> file [or inode number]

 B) background: neural network and ML systems
    (Lab2)

    * backward propagation

    * autograd engine

    * dataflow graph (TensorFlow)


 C) an NN4Sys: learned index

    In general, NN4Sys has two main advantages:
      (1) succinct approximation data structure,
      (2) discovering sophisticated heuristics

    Learned index falls into the first category.
    Learned index is smaller and faster than traditional data structures like B-Tree.

  D) the following topic, learned scheduling, falls into the second category.


2. Hierarchical files systems are dead

  file systems vs. databases:

    Q: Consider you have 1,000 books at home. How will you organize them so that
       you can easily find the book you want?

    Q: Consider you are a librarian with 1,000 books. How would you organize books
       so that when a guest comes, you can find the book for them?

    Q: Are there any difference of how you organize the books?

  * hFSD argues, "situation, however, has evolved"

    i) storage size grows
      -- 1992: 300MB disk
      -- 2009: 300GB disk (23 years later)
      -- 2022: 20TB disk & 8TB SSD (13 years later)
         -- end of 2022: 200TB SSD
      -- 2030: 1000TB SSD (8 years later)
         [https://www.techradar.com/news/1000tb-ssds-could-become-mainstream-by-2030-as-samsung-plans-1000-layer-nand]

    ii) larger space => 
          file size wasn't growing that fast => 
            more files => 
              harder to manage
        [Q: does logic flow?]

    iii) "Google is a verb"
      what they want instead of where it lives
      [a sharp observation]

    iv) "database themselves tend to be too heavy-weight a solution"
        -- not ideally optimized for a given application
        -- prevent independent evolution of the data
        -- painful to install and manage
        [are these true? what about SQLite?]

        my opinion: interface is the core problem
        * SQL vs. FS interface

   [skipped]
   * basis of a modern fs:
     -- backwards compatibility
     -- separate naming from access (*)
     -- data agnostic
     -- direct access to data

   * hFAD design
     [see Fig1]
     -- based on an object-based storage device
     -- naming (via index) and accessing (via obj store)

     Q: how about updates? deleting a file needs to update multiple
        indexes at the same time.

        Again, index is a data structure that accelerate data retrieval,
        at the cost of more expensive updates and more space.

   [skipped]
   * hFAD implementation
     [S3.4]

     Q: how to support efficient "insert" and "truncate"?
       [B-Tree:
         key(file offset) -> val(disk addr, lengths)
       ]

   [skipped]
   * open questions

     Q: anything you think is interesting?


   [skipped]
   DISCUSSION: search-based fs vs. hierarchical fs


3. learned fs proposal
   [trimmed]