Week 5.b
CS7670
10/05 2022
https://naizhengtan.github.io/22fall/

1. Last time
2. this paper
3. three learned indexes
4. experiments
---

1. Last time

  * RMI

    [draw one on board; say models are NNs]

  * range query (non-existing keys)

    problem:
      range_query(key1, key2) -> {all keys in [key1, key2]}
    challenge:
      what if key1/key2 do not exist in the database?
      (Learned index hasn't seen them.)

  * two desired properties:

    Q: RMI(x) <= RMI(x+1) ?
      (monotonicity)

    Q: if x \not\in dataset, |RMI(x) - y| <= \epsilon?
      (bounded errors for non-existing keys)

  * mitigation:
      -- monotonic models
      -- global search

  * another solution: verified NNs


2. benchmarking learned indexes

  Q: what's the point of this paper?

  [read the second and fourth paragraphs; VLDB'20]
    -- vs. traditional index
    -- vs. newer learned index

  understanding learned index
    (1) a pareto analysis
    (2) why learned index works
    (3) learned index works under "real-world" setup

  limitations:
    (1) read-only
    (2) tight loop
    Q: anything else? (embedded in the problem)

  A) problem statement:

      I: Z -> (Z^+ x Z^+)

   Q: what is the lower bound of a key x?
      LB(x), lower bound: the first element >= x

   * spec:
     \forall x,
       I(x) -> (lo, hi), D_{lo} <= LB(x) <= D_{hi}

     Notice this spec covers non-existing keys.

   * assumptions:
     -- data is sorted
     -- "We assume that data is stored in a way supporting fast random
     access (e.g., an array)."

   Q: why they have to have this assumption?
      what if this is not true?

      [binary search won't work]

   DISCUSSION: where will the data that has fast random access locate?
      And what does that mean?

      [The data is part of the index for disk-based databases.]

   * the third limitation:
     only works for in-memory database
     otherwise, the _true size_ of index should include the in-memory data. WHY?

     [think of all data on disk:
        will a B-Tree work?
        will a learned index work?]

  B) approximating CDF

     [draw Fig2]

     They claim, I_A(x) = ..., (see paper)
      and the errors are bounded by 0.16*|D|.
     Do you agree?

     Q: how to understand "some adjustments may be required for lookups of
     absent keys"? (S2.1, under Figure 3)

     See Fig2; What's the error for key=13?
       LB(key=13) = ? [0.5]

     Q: what is A(13)?

     Now, calculate A = ax + b 
       based on (100, 1.0) and (0, 0.1) [in fact, 99 and 1]
          a = 0.009
          b = 0.1
     So, A(13) = 0.009 * 13 + 0.1 = 0.217
     error is 0.5 - 0.217 = 0.283 [much larger than 0.16]

     In fact, this is not the A they use in the paper,
       because A(12) = 0.208 [not 0.24]
     but with (12, 0.24) only, we cannot infer A


3. Three (non-NN) learned indexes

  A) RMI

    top-down

    Q: RMI's data structures:
       models[][]: n layers; each layer has 1 or more models

    * RMI inference
      write down the inference algorithm?

      inference(models[][], x, data):
        j = 1
        for i <- 1 to (M-1) do:
          j = models[i][j](x) / size(models[i+1])
        pred_y = models[M][j](x)
        return binary_search(data, x, pred_y - err, pred_y + err)

    DISCUSSION: RMI, depth vs. width 
        i.e., training time vs. inference time
        i.e., number of models vs. runtime performance

    Q: Notice that they don't use NN anymore. Why?


  B) RS (radix spline index)

    bottom-up
    "radix" works well for fixed-sized numbers (like 32-bit integers)

    Q: what're the data structure in RS?
       -- radix table (radixTable):
          an array (elements are int to linear spline)
       -- linear spline (splines):
          an array (elements are [key,pos])

    * RS inference:

      inference(radixTable[], splines[], x, data):
        rbits = first r bits of the key x
        start = splines[radixTable[rbits]] // what is this?
        end = splines[radixTable[rbits+1]] // what is this?
        return binary_search(data, key, start, end)


    * two parameters:
      -- "error bound" decides length of the linear spline
      -- "r-bit prefix" decides the size of the radix table

    DISCUSSION: the trade-off of "r"
      "r" is 1: binary search on linear spline
      "r" is log_2 size(key): size(prefix) = 2^r >= size(dataset)

  C) PGM (piecewise geometric model indexes)

     bottom-up

     Q: PGM's data structures?
        multiple arrays[]; elements are (key, linear model (F_i))
        multiple functions F_i[]; one function for each layer

     * inference (assuming 2 levels of PGM):

       inference(arrays[], F_0, F_1[], x):
         // data == array[1]
         pos = F_0(x)
         f_i = binary_search(array[0], x, pos-err, pos+err)
         pos = F_1[f_i](x)
         return binary_search(array[1], x, pos-err, post+err)

     Q: RS vs. PGM, sounds similar. What's the core difference?
        For the fist steps,
           RS: generating spline points
           PGM: producing pieces
        will they partition key space the same way?

    Q: assume the partitions are the same (in fact, they are not),
       what's the other differences?

  DISCUSSION: NN-based learned index vs. ML learned index


[skipped]
4. experiments

  * setup
    "Experiments are conducted on a machine with 256 GB of RAM and
    an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz."
    20-core, 40-threads, cache 27.5MB

    Q: what's the size of the dataset? e.g., wiki
       "200 million unsigned 64-bit integer keys"
       200M * 8B ~= 1.6GB

    Q: the _true size_ (data in memory) of these indexes?

  * Table 2: comparing with Hash tables.
    Q: is this fair?

  DISCUSSION: when did people need sorted dataset? 
   why people focus only on point lookups?

  * pareto analysis

    Fig7: trade-off of sizes and lookup time

    Q: can you imagine the figure where x-axis is not log-scale?

    [read footnote 2 of page 6]

  * log_2 error (Fig12)

    Q: compare PGM and RMI, 
       given the same lookup time, which has smaller error?
       given the same error, which has shorter lookup time?

    Q: does the trend make sense?
       the higher the error the higher the lookup time