Week 5.a
CS7670
10/03 2022
https://naizhengtan.github.io/22fall/

0. Admin
1. Range indexes
2. Learned index, RMI
---

[Write down first 10 lines of Algo 1]

0. Admin

  * Lab2 released
    -- DL basics
    -- learned index
    -- next Wednesday, Brent, Will and Han report Lab2 challenges

  * citations:
    TF: 19,104
    PyTorch: 19,777
    learned index: 710

  [ask how do you like the paper]

  * Q: why this paper was written in this way?
    30 pages, "defensive writing",
      with detailed experiment parameters,
      backed by detailed performance numbers.

1. Range Indexes

  * will cover Range Index only today

  * the problem:
     given a sorted array of data, construct a function "f",
        f(key)->pos

  * baselines:
      -- linear scan
      -- binary search

  * Background: B-Tree
    "The B-tree generalizes the binary search tree, allowing for nodes
    with more than two children." from wiki

    [draw a B-tree]

            +--------+
            | 7 | 16 |
            +--------+
            /    \     \
           /      \      \
         |/_      _\|      \...
    +-----------+  +----+
    | 1 | 2 | 6 |  |9|12|  ...
    +-----------+  +----+

2. Learned index, RMI

  * read the last sentence of the abs:
    "More importantly though, we believe that the idea of replacing core
    components of a data management system through learned models has far
    reaching implications for future systems designs and that this work just
    provides a glimpse of what might be possible."

    [a pioneer paper indeed]

  * Q: what's the major argument in S2.1?
       Do you happy with the argument?

    -- use to arithmetic instructions to replace branches; makes sense
       [Anecdote: memory allocator, getting rid of one branch
         causes 0.5% speedup.]

    -- GPU? TPU?
       latency is a problem

    -- "executing if-statements of CPUs essentially has stagnated"
       [5]: "Moore Law is Dead but GPU will get 1000X faster by 2025."

  DISCUSSION: Moore's Law, dead or alive?
    [core question: is single thread performance still relevant?]
    Jensen Huang vs. Patric Gelsinger

    Facts:
    a) 8GHz CPUs are soon available
       (https://www.gizchina.com/2022/09/12/intel-13th-gen-raptor-lake-to-destroy-amd-with-8ghz-clock-speed/)
    b) 1.5G L3 cache is available

  * an interesting and sharp observation: CDF

  * S2.2, "indexing literally requires learning a data distribution."
    Q: Is this true? Do indexes only capture distributions?

    Quarterly true.
     -- writes
     "A database index is a data structure that improves the speed of data
     retrieval operations on a database table at the cost of additional
     writes and storage space to maintain the index data structure" from wiki
     -- reads
     unsorted data

  [skipped]
  * S2.3, why naive learned index doesn't work?
    There are three points.

  [skipped]
  * S3, RMI; they built 
     learning index framework (LIF),
     recursive-model indexes (RMI),
     and standard-error-based search strategies.
  Q: which part, do you think, is the most valuable?


  * Observation: last-mile accuracy is hard; divide-and-conquer

  [draw an RMI]

  * RMI inference

    what's the output of the first stage model?
    what's the input of the second stage model?

  * RMI training

    1) loss function of L_0:
      L_0 = \sum (f_0(x) - y)^2

    2) loss function of L_l:
      given input x,
       L_l = \sum (f_l^{eq}(x) - y)^2,
         where eq = floor(M_l * (f_{l-1}(x) / N))

       Q: what is M_l? f_{l-1}(x)? N?

       [there are easier way to describe this.]
          f_{l-1}(x)'s prediction
                 |
                 v
       y in |---|---|---|---|... (M_l of ranges)

      Q: how to understand "f_{l-1}(x) recursively executing ..."?

      Q: how to understand "RMI do not have to be trees"?

   3) training data
     Q: will a model in layer l see all xs?
     A: No. The prev layer models will decide the training datasets for the next layer.

   4) went through the training algorithm

   Q: "but also allows representing the entire index as a
       sparse matrix-multiplication for a TPU/GPU."?

       [see: https://github.com/stanleybak/vnncomp2021/blob/main/benchmarks/nn4sys/imgs/network.jpg]

   * RMI inference again
     write down the inference algorithm?

     inference(index[][], key):
       j = 1
       for i <- 1 to (M-1) do:
         j = index[i][j](key) / stages[i+1]
       pred_y = index[M][j](key)
       return binary_search(key, pred_y - err, pred_y + err)

   Q: why authors train L_0 with y,
      instead of indexes of next stage models?
      [Try it in your Lab2!]


  [skipped]
  DISCUSSION: RMI, depth vs. width 
        i.e., training time vs. inference time
        i.e., number of models vs. runtime performance


  [skipped]
  * S3.4, range query (non-existing keys)

    problem: 
      range_query(key1, key2) -> {all keys in [key1, key2]}
    challenge:
      what if key1/key2 do not exist in the database?
      (Learned index hasn't seen them.)


    mitigation:
      -- monotonic models
      -- global search

    another solution: verified NNs