Week 9.a CS7670 10/31 2022 https://naizhengtan.github.io/22fall/ 1. backgrounds: LSTM & AMD 2. this paper ---- * Q: How's your task going? * Commit your code to github repo: (https://github.com/NEU-CS7670-labs/cs7670-22fall) * Q: how do you like the paper? Give me one reason why you like it and one why you don't like it. 1. backgrounds * CPU hardware prefetcher [see handout] * AMD NN prefetcher * LSTM: Long Short-Term Memory -- LSTM is an RNN with memory cells. -- each memory cell contains an internal state and a number of multiplicative gates -- (i) a given input should impact the internal state (the input gate), -- (ii) the internal state should be flushed to 0 (the forget gate), -- (iii) the internal state of a given neuron should be allowed to impact the cell's output (the output gate). [if interested, read: https://d2l.ai/chapter_recurrent-modern/lstm.html] -- inference y_t y_t+1 y_t+2 ^ ^ ^ | | | +------+ +------+ +------+ | | -> | | -> | |-> | cell | | cell | | cell | | | -> | | -> | |-> +------+ +------+ +------+ ^ ^ ^ | | | x_t x_t+1 x_t+2 2. this paper Q: a claim, "Since memory accesses have an underlying grammar similar to ..." WHY? Do you agree this claim? Q: what are the challenges? and why? (1) model size (2) large traces (3) real-time inference (4) retraining the model online Aren't (1) and (3) the same thing? Q: How does this work addresses these challenges? for (1), compression (or, really, input/output encoding) for (2), ? for (3), encoding => small networks for (4), ? * core idea: a naive compression Q: what is that? * [read Fig1] given a serials of inputs X, for lag=k, the Autocorrelation coefficient (r_k) is calculated as: [write this on-board] r_k = \sum{ (X_i - X_{avg}) * (X_{i+k} - X_{avg}) } / \sum { (X_i - X_{avg})^2 } Q: what is X_i here? * Q: what is the problem? borrowed from "Learning Memory Access Patterns" [https://arxiv.org/pdf/1803.02329.pdf] A) Training data set: [Read S3.2] run programs => memory trace addresses (in trace) =[ordinal encoding]=> integers integers => delta Example: address set: {0x123, 0x233, 0x234, 0x456} trace [0x123, 0x234, 0x233, 0x456] => [1,3,2,4] => [0,3,-1,2] B) model's inputs/outputs: Q: what are the inputs/outputs of the model? [read S4.2] last three prediction (?) -->[LSTM]--> delta of 50K -- calculate the numbers for Table1 * DISCUSSION: memory prefetcher vs. block prefetcher Q: which is harder? and why? Q: do they share the same abstract problem? Q: do they share the same challenges? Q: what component do memory prefetcher? block prefetcher? Q: how about concurrency? * Q: evaluation can be improved -- better caption -- evaluation without inference time. -- what're the problems of Table1? -- missing major points -- what're the problems of Fig3? -- what're the problems of Fig4? -- what're the problems of Fig5? -- vector-fig instead of raster-fig -- what're the problems of Fig6? -- explain or remind the experiment setup