Week 4.b
CS7670
09/28 2022
https://naizhengtan.github.io/22fall/


1. distributed (ML) systems
2. TensorFlow 1.0
3. a glance at PyTorch
-----

[Ask what they think of the TF paper]

[say disclaimer]

1. distributed ML systems

  * classical distributed computation
    -- SQL query and other types of queries: searching for employees who spend >40 hrs a week
    -- batch processing: counting words, pagerank
    -- graph analytic: shortest path
    -- ...

  Q: As for ML...
     ...if a model has 1 trillion (10^12) parameters
        and shared by many distributed workers (processes),
     how will you train it?
       [GPT-3, 175 billion parameters, 800GB]

  * Recall DL training

    loop:
      l = loss(y, net(x))
      l.backward()
      for p in net.parameters:
          p += -0.01 * p.grad

  * Q: isn't that ML training is some simple form of computation?
     why not use traditional distributed systems, like Spark and Hadoop for ML?

     Batch dataflow system,
      focus on tolerating failures =>
         fast re-execution =>
           immutable data =>
              expensive updating ML models

  * a brief and inaccurate history:
    -- PS (OSDI 2014)
    -- TF 1.0 (OSDI 2016)
    -- PyTorch (NeurIPS 2019)
       [see reviewer comments here: 
        https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Reviews.html]

  * PS <-> TF 1.0: ? [see TF paper]
  * TF 1.0 <-> PyTorch and TF 2.0: static graph vs. dynamic graph

  * background: parameter server (PS)

    Some interesting observations:
      -- mutable state is crucial for ML
      -- flexible consistency models (async training)
         [Is this true?]

    PS is an architecture:
      [see handout, fig1]
      it has a server group that maintain global parameters;
      and it has multiple worker groups that do computations and use server
        group to sync parameters.

    PS interface: Push & Pull

    core idea: stateless workers + stateful PS
      [see handout, fig2]

   [skipped]
   DISCUSSION: community => MLSys


2. TensorFlow: A System for Large-Scale Machine Learning

  * background question: what're the design goals of TensorFlow?

  * intro: the third paragraph is dense
     [read first two and the last sentences]

  * Q: why is PS inadequate?
     -- new operators (was C++)
     -- RNN, RL, GAN
     -- new optimization algorithms (such as Adam)

  * Q: how would you implement Adam on a PS?
     [see handout, fig3]
     pull and push \theta_{t-1}, m_{t-1}, v_{t-1}

  * How TF addresses these drawbacks?

    "dataflow-based programming abstraction"

    + primitive operators
      (benefit autograd; will see this in PyTorch later)

    + deferred execution
      (later proved to be...)

    DISCUSSION: define-and-run vs. define-by-run
     (See S2 of "DyNet: The Dynamic Neural Network Toolkit",
        https://arxiv.org/pdf/1701.03980.pdf)

    + common abstraction for heterogeneous accelerators
      (later DL Compilation)

  * core abstraction: dataflow graph

    -- dataflow graph
       vertices: operations
       edges:    tensors

    [skipped]
    -- tensor: an n-dimensional array of int/float/string (!)
      [cheng: what can we do with string?]

    -- Q: Why they want tensors to be dense?
      S2.1 says,
      "This decision ensures that the lowest levels of the system have simple
      implementations for memory allocation and serialization, thus reducing the
      framework over-head. Tensors also enable other optimizations for memory
      management and communication, such as RDMA and direct GPU-to-GPU transfer."

  * dffers from traditional dataflow systems in two ways:
        1. supporting multiple concurrent executions on overlapping subgraphs of the overall graph
        2. having mutable states on each vertex that can be shared between different executions of the graph

  * benefits over PS:
      -- arbitrary subgraph on machines that host shared model parameters
      -- can experiment with different optimization algorithms, consistency schemes,
         paralelization strategies
         (flexibility)

  * "Partial and concurrent execution is responsible for much of TF's flexibility."
    and "distributed executions"
    -- data parallelism
    -- model parallelism
    -- pipeline parallelism

  * Tensorflow's fault tolerancce: checkpointing

  DISCUSSION: customization vs. standardization
              (Makimoto's wave)

3. a glance at PyTorch (2019)

  * four trends:
    (1) array-based programming
    (2) automatic differentiation
    (3) open-source Python ecosystem
    (4) hardware accelerators

  * four design principles
    [see handout, fig4]