Week 5.a CS7670 10/03 2022 https://naizhengtan.github.io/22fall/ 0. Admin 1. Range indexes 2. Learned index, RMI --- [Write down first 10 lines of Algo 1] 0. Admin * Lab2 released -- DL basics -- learned index -- next Wednesday, Brent, Will and Han report Lab2 challenges * citations: TF: 19,104 PyTorch: 19,777 learned index: 710 [ask how do you like the paper] * Q: why this paper was written in this way? 30 pages, "defensive writing", with detailed experiment parameters, backed by detailed performance numbers. 1. Range Indexes * will cover Range Index only today * the problem: given a sorted array of data, construct a function "f", f(key)->pos * baselines: -- linear scan -- binary search * Background: B-Tree "The B-tree generalizes the binary search tree, allowing for nodes with more than two children." from wiki [draw a B-tree] +--------+ | 7 | 16 | +--------+ / \ \ / \ \ |/_ _\| \... +-----------+ +----+ | 1 | 2 | 6 | |9|12| ... +-----------+ +----+ 2. Learned index, RMI * read the last sentence of the abs: "More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible." [a pioneer paper indeed] * Q: what's the major argument in S2.1? Do you happy with the argument? -- use to arithmetic instructions to replace branches; makes sense [Anecdote: memory allocator, getting rid of one branch causes 0.5% speedup.] -- GPU? TPU? latency is a problem -- "executing if-statements of CPUs essentially has stagnated" [5]: "Moore Law is Dead but GPU will get 1000X faster by 2025." DISCUSSION: Moore's Law, dead or alive? [core question: is single thread performance still relevant?] Jensen Huang vs. Patric Gelsinger Facts: a) 8GHz CPUs are soon available (https://www.gizchina.com/2022/09/12/intel-13th-gen-raptor-lake-to-destroy-amd-with-8ghz-clock-speed/) b) 1.5G L3 cache is available * an interesting and sharp observation: CDF * S2.2, "indexing literally requires learning a data distribution." Q: Is this true? Do indexes only capture distributions? Quarterly true. -- writes "A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure" from wiki -- reads unsorted data [skipped] * S2.3, why naive learned index doesn't work? There are three points. [skipped] * S3, RMI; they built learning index framework (LIF), recursive-model indexes (RMI), and standard-error-based search strategies. Q: which part, do you think, is the most valuable? * Observation: last-mile accuracy is hard; divide-and-conquer [draw an RMI] * RMI inference what's the output of the first stage model? what's the input of the second stage model? * RMI training 1) loss function of L_0: L_0 = \sum (f_0(x) - y)^2 2) loss function of L_l: given input x, L_l = \sum (f_l^{eq}(x) - y)^2, where eq = floor(M_l * (f_{l-1}(x) / N)) Q: what is M_l? f_{l-1}(x)? N? [there are easier way to describe this.] f_{l-1}(x)'s prediction | v y in |---|---|---|---|... (M_l of ranges) Q: how to understand "f_{l-1}(x) recursively executing ..."? Q: how to understand "RMI do not have to be trees"? 3) training data Q: will a model in layer l see all xs? A: No. The prev layer models will decide the training datasets for the next layer. 4) went through the training algorithm Q: "but also allows representing the entire index as a sparse matrix-multiplication for a TPU/GPU."? [see: https://github.com/stanleybak/vnncomp2021/blob/main/benchmarks/nn4sys/imgs/network.jpg] * RMI inference again write down the inference algorithm? inference(index[][], key): j = 1 for i <- 1 to (M-1) do: j = index[i][j](key) / stages[i+1] pred_y = index[M][j](key) return binary_search(key, pred_y - err, pred_y + err) Q: why authors train L_0 with y, instead of indexes of next stage models? [Try it in your Lab2!] [skipped] DISCUSSION: RMI, depth vs. width i.e., training time vs. inference time i.e., number of models vs. runtime performance [skipped] * S3.4, range query (non-existing keys) problem: range_query(key1, key2) -> {all keys in [key1, key2]} challenge: what if key1/key2 do not exist in the database? (Learned index hasn't seen them.) mitigation: -- monotonic models -- global search another solution: verified NNs