Week 5.b CS7670 10/05 2022 https://naizhengtan.github.io/22fall/ 1. Last time 2. this paper 3. three learned indexes 4. experiments --- 1. Last time * RMI [draw one on board; say models are NNs] * range query (non-existing keys) problem: range_query(key1, key2) -> {all keys in [key1, key2]} challenge: what if key1/key2 do not exist in the database? (Learned index hasn't seen them.) * two desired properties: Q: RMI(x) <= RMI(x+1) ? (monotonicity) Q: if x \not\in dataset, |RMI(x) - y| <= \epsilon? (bounded errors for non-existing keys) * mitigation: -- monotonic models -- global search * another solution: verified NNs 2. benchmarking learned indexes Q: what's the point of this paper? [read the second and fourth paragraphs; VLDB'20] -- vs. traditional index -- vs. newer learned index understanding learned index (1) a pareto analysis (2) why learned index works (3) learned index works under "real-world" setup limitations: (1) read-only (2) tight loop Q: anything else? (embedded in the problem) A) problem statement: I: Z -> (Z^+ x Z^+) Q: what is the lower bound of a key x? LB(x), lower bound: the first element >= x * spec: \forall x, I(x) -> (lo, hi), D_{lo} <= LB(x) <= D_{hi} Notice this spec covers non-existing keys. * assumptions: -- data is sorted -- "We assume that data is stored in a way supporting fast random access (e.g., an array)." Q: why they have to have this assumption? what if this is not true? [binary search won't work] DISCUSSION: where will the data that has fast random access locate? And what does that mean? [The data is part of the index for disk-based databases.] * the third limitation: only works for in-memory database otherwise, the _true size_ of index should include the in-memory data. WHY? [think of all data on disk: will a B-Tree work? will a learned index work?] B) approximating CDF [draw Fig2] They claim, I_A(x) = ..., (see paper) and the errors are bounded by 0.16*|D|. Do you agree? Q: how to understand "some adjustments may be required for lookups of absent keys"? (S2.1, under Figure 3) See Fig2; What's the error for key=13? LB(key=13) = ? [0.5] Q: what is A(13)? Now, calculate A = ax + b based on (100, 1.0) and (0, 0.1) [in fact, 99 and 1] a = 0.009 b = 0.1 So, A(13) = 0.009 * 13 + 0.1 = 0.217 error is 0.5 - 0.217 = 0.283 [much larger than 0.16] In fact, this is not the A they use in the paper, because A(12) = 0.208 [not 0.24] but with (12, 0.24) only, we cannot infer A 3. Three (non-NN) learned indexes A) RMI top-down Q: RMI's data structures: models[][]: n layers; each layer has 1 or more models * RMI inference write down the inference algorithm? inference(models[][], x, data): j = 1 for i <- 1 to (M-1) do: j = models[i][j](x) / size(models[i+1]) pred_y = models[M][j](x) return binary_search(data, x, pred_y - err, pred_y + err) DISCUSSION: RMI, depth vs. width i.e., training time vs. inference time i.e., number of models vs. runtime performance Q: Notice that they don't use NN anymore. Why? B) RS (radix spline index) bottom-up "radix" works well for fixed-sized numbers (like 32-bit integers) Q: what're the data structure in RS? -- radix table (radixTable): an array (elements are int to linear spline) -- linear spline (splines): an array (elements are [key,pos]) * RS inference: inference(radixTable[], splines[], x, data): rbits = first r bits of the key x start = splines[radixTable[rbits]] // what is this? end = splines[radixTable[rbits+1]] // what is this? return binary_search(data, key, start, end) * two parameters: -- "error bound" decides length of the linear spline -- "r-bit prefix" decides the size of the radix table DISCUSSION: the trade-off of "r" "r" is 1: binary search on linear spline "r" is log_2 size(key): size(prefix) = 2^r >= size(dataset) C) PGM (piecewise geometric model indexes) bottom-up Q: PGM's data structures? multiple arrays[]; elements are (key, linear model (F_i)) multiple functions F_i[]; one function for each layer * inference (assuming 2 levels of PGM): inference(arrays[], F_0, F_1[], x): // data == array[1] pos = F_0(x) f_i = binary_search(array[0], x, pos-err, pos+err) pos = F_1[f_i](x) return binary_search(array[1], x, pos-err, post+err) Q: RS vs. PGM, sounds similar. What's the core difference? For the fist steps, RS: generating spline points PGM: producing pieces will they partition key space the same way? Q: assume the partitions are the same (in fact, they are not), what's the other differences? DISCUSSION: NN-based learned index vs. ML learned index [skipped] 4. experiments * setup "Experiments are conducted on a machine with 256 GB of RAM and an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz." 20-core, 40-threads, cache 27.5MB Q: what's the size of the dataset? e.g., wiki "200 million unsigned 64-bit integer keys" 200M * 8B ~= 1.6GB Q: the _true size_ (data in memory) of these indexes? * Table 2: comparing with Hash tables. Q: is this fair? DISCUSSION: when did people need sorted dataset? why people focus only on point lookups? * pareto analysis Fig7: trade-off of sizes and lookup time Q: can you imagine the figure where x-axis is not log-scale? [read footnote 2 of page 6] * log_2 error (Fig12) Q: compare PGM and RMI, given the same lookup time, which has smaller error? given the same error, which has shorter lookup time? Q: does the trend make sense? the higher the error the higher the lookup time