Week 9.a
CS7670
10/31 2022
https://naizhengtan.github.io/22fall/

1. backgrounds: LSTM & AMD
2. this paper
----

* Q: How's your task going?

* Commit your code to github repo:
  (https://github.com/NEU-CS7670-labs/cs7670-22fall)

* Q: how do you like the paper?
     Give me one reason why you like it and one why you don't like it.


1. backgrounds

  * CPU hardware prefetcher
    [see handout]

  * AMD NN prefetcher

  * LSTM: Long Short-Term Memory
    -- LSTM is an RNN with memory cells.
    -- each memory cell contains 
        an internal state and 
        a number of multiplicative gates
         -- (i) a given input should impact the internal state (the input gate),
         -- (ii) the internal state should be flushed to 0 (the forget gate),
         -- (iii) the internal state of a given neuron should be allowed to impact the cell's output (the output gate).

    [if interested, read:
      https://d2l.ai/chapter_recurrent-modern/lstm.html]

    -- inference

            y_t         y_t+1       y_t+2
            ^           ^           ^
            |           |           |
       +------+    +------+    +------+
       |      | -> |      | -> |      |->
       | cell |    | cell |    | cell |
       |      | -> |      | -> |      |->
       +------+    +------+    +------+
        ^           ^           ^
        |           |           |
        x_t         x_t+1       x_t+2


2. this paper

  Q: a claim, "Since memory accesses have an underlying grammar similar to
     ..." WHY? Do you agree this claim?

  Q: what are the challenges? and why?
    (1) model size
    (2) large traces
    (3) real-time inference
    (4) retraining the model online

    Aren't (1) and (3) the same thing?

  Q: How does this work addresses these challenges?
    for (1), compression (or, really, input/output encoding)
    for (2), ?
    for (3), encoding => small networks
    for (4), ?

  * core idea: a naive compression
    Q: what is that?

  * [read Fig1]
    given a serials of inputs X,
    for lag=k, the Autocorrelation coefficient (r_k) is calculated as:

     [write this on-board]
     r_k = \sum{ (X_i - X_{avg}) * (X_{i+k} - X_{avg}) } /
           \sum { (X_i - X_{avg})^2 }

    Q: what is X_i here?

  * Q: what is the problem?
    borrowed from "Learning Memory Access Patterns"
    [https://arxiv.org/pdf/1803.02329.pdf]

    A) Training data set:

    [Read S3.2]
    run programs => memory trace
    addresses (in trace) =[ordinal encoding]=> integers
    integers => delta

    Example:
      address set: {0x123, 0x233, 0x234, 0x456}
      trace [0x123, 0x234, 0x233, 0x456] 
            => [1,3,2,4]
            => [0,3,-1,2]

    B) model's inputs/outputs:

    Q: what are the inputs/outputs of the model?
      [read S4.2]

    last three prediction (?) -->[LSTM]--> delta of 50K

    -- calculate the numbers for Table1


  * DISCUSSION: memory prefetcher vs. block prefetcher
      Q: which is harder? and why?
      Q: do they share the same abstract problem?
      Q: do they share the same challenges?
      Q: what component do memory prefetcher? block prefetcher?
      Q: how about concurrency?


  * Q: evaluation can be improved
    -- better caption
    -- evaluation without inference time.

    -- what're the problems of Table1?
      -- missing major points
    -- what're the problems of Fig3?
    -- what're the problems of Fig4?
    -- what're the problems of Fig5?
      -- vector-fig instead of raster-fig
    -- what're the problems of Fig6?
      -- explain or remind the experiment setup