Week 02a
CS7670
09/09 2025
https://naizhengtan.github.io/25fall/

□ 0. from last time
□ 1. Research problem
□ 2. Research topics
□ 3. Training parallelization
□ 4. Training correctness & anomalies (brief)
□ 5. Discussion & your opinion?
----

0. recap

  * too ML? will be more systemish...
    ...you can either focus on model or systems

  * interesting survey results
    (see details in the end)

    1. fame of opensource models

       DeepSeek (95.5%) > Qwen (63.6%) > BLOOM (50%) > BitNet (13.6%)
       (see corresponding papers at References section)

       * llama would be on par with DeeSeek (if I have to guess)

       Q: what are other models people know of?

    2. most-widely-known term vs. least-know term
       DeepSeek vs. BitNet

    3. top-5 technical terms
        - loss function (90.9%)
        - back-propogation (90.9%)
        - dropout (81.8%)
        - KV Cache (81.8%)
        - regularization (77.3%)

        Q: What is loss function?
           What are realy we doing while trainig LLMs?
           [ask students]

          -- training is solving an optimization problem
          -- LLM is a compression of the Internet
          -- So, how good is the such a compression? [see figs]

        Q: What is dropout?
          -- [explain in figs]
          -- dropout attacks
          -- differ from classic systems: nondeterminism

    4. Bottom-5 technical terms
        - speculative decoding (27.3%)
        - RoPE (18.2%)
        - SwiGLU (18.2%)
        - NCCL (18.2%)
        - PD-separation (18.2%)

  * logistics

   a. communication

      us-to-you:
        --homepage, announcements: check this regularly
        --your NEU email (seldom)

      you-to-us:
        --HotCRP: where you review papers and form discussions
        --my email for admin/sensitive things

   b. components of the course:
      --my lectures
      --guest lectures
      --a final writeup: opinion (can team up)

   c. lectures
      [take a look at the schedule]

      * my lectures
        --paper-oriented; must submit your review before the class
        --attending: no roll call, but...will randomly pick students
        to answer questions (lottery)
        --notes will be published, but will be hard to understand if
        you miss the lecture
        --asking questions in class is encouraged

      * guest lectures:
        --attending is required
        --asking questions is required (later in policy)

   d. final writeup:
       -- an opinion paper
       -- encouraging whoever share the same opinion to team up

   e. final grade
      -- policy: participation (80%)
        * paper review (10 papers): 40%
        * attending in person: 20% (-2% for each absent)
        * questions to guest speakers (2 questions): 20%
      -- writeup (20%)
        * evaluate by quality


1. Research problem:

  - before LLMs, training was "easy" (from a systems perspective)...
    ...by easy, I mean (a) graduate students can manage and (b) with modest resources...
    ...which is not possible now.

  - Q: Assume infinite memory and infinitely fast GPUs;
    how to train a LLM model?
    [ask students]

  - Q: what changed?
    [ask students]
    * nothing much changed on the procedure
    * despite algorithmic improvements, those are under the abstraction for ML engineers

  - root cause: scaling laws

    * Q: what was ML research before?
      [ask students]

    * OpenAI and DeepMind
      (and many others)

    * Aside: test-time scaling & bit-scaling (Furu's point)

  - Core research problem: train the model...
    ...efficiently (fewer GPU and less time) and...
    ...reliably (recover from failures and without "correctness" problems)

  (A) efficient training

    * Q: what will harm training efficiency?
    [ask students]

  (B) reliable training

    * Q: what will harm training reliability?
    [ask students]

2. Research topics

  - Training parallelization
    [most organized topic; more about this today]
    (one guest lecture)

  - Training correctness and performance anomalies
    * Fault tolerance (one guest lecture)
    * Anomaly detection (one guest lecture)
    * Formal verification (one guest lecture)

  - Post-training methods
    * RLHF and RLVR (one guest lecture)
    * Alignment and safety fine-tuning
    * Adaptive fine-tuning (e.g., LoRA)
    * Compression and distillation

  - Real-world training experiences

  - Others
    * Communication optimization (one guest lecture)
    * Memory optimization
    * Resource management (scheduling, allocation, fairness, etc.)
    * Non-traditional training paradigms
      * Heterogeneous training
      * Elastic training
      * Asynchronous training
      * Federated training
      * Green training
      * Training under constraints (old GPUs, low connectivity, etc.)

3. Training parallelization

    - Research problem: too big to fit into one GPU
      Q: what is too big?
         model parameters, gradients, activations, optimizer state

    - Early days (MLSys): FlexFlow and Tofu
        - [anecdote: Minjie's story, 2017]

    - Data parallelism
        - ZeRO: stage 1--3
        - Pytorch FSDP (=ZeRO stage 3)
          [one of the authors, Chien-Chin Huang, is our guest speaker]

        Q: difference between FSDP and Tensor parallelism?
        [ask students]

    - Tensor parallelism
        - Megatron-LM

    - Pipeline parallelism
        * PipeDream: Fast and Efficient Pipeline Parallel DNN Training
            - https://arxiv.org/abs/1806.03377
            - from Microsoft, CMU, Stanford
            - debut: Fri, 8 Jun 2018
            - published in SOSP'19, October 27
              [first talk]
        * GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
            - https://arxiv.org/abs/1811.06965
            - from Google
            - debut: Fri, 16 Nov 2018
            - published in NeurIPS 2019, Dec 8th

        - [anecdote: me at ByteDance, 2020]


    - New parallelism techniques
        - expert parallelism (GShard, Switch Transformer)
        - sequence parallelism (Megatron-LM v2)
        - context parallelism (DeepSpeed-Ulysses, RingAttention)

    - Combining all above
        - Alpa
        - nnScaler
        - 4D parallelism

[will continue from here next time]


Survey results
---
DeepSeek  95.5%
loss function  90.9%
back-propogation  90.9%
dropout  81.8%
KV Cache  81.8%
regularization  77.3%
LoRA  77.3%
RAG  77.3%
SGD  72.7%
vLLM  72.7%
Adam (not a person's name)  63.6%
Qwen  63.6%
MoE  59.1%
FlashAttention  59.1%
Beam Search  54.5%
BLOOM  50%
Top-p sampling  50%
QLoRA  36.4%
Mamba  36.4%
3D parallelism  36.4%
Grouped-Query Attention  31.8%
pre-norm  31.8%
ZeRO  31.8%
PEFT  27.3%
speculative decoding  27.3%
RoPE  18.2%
SwiGLU  18.2%
NCCL  18.2%
PD-separation  18.2%
BitNet  13.6%


References
----

- [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/pdf/2501.12948)
- [DeepSeek-V3 Technical Report](https://arxiv.org/pdf/2412.19437)
- [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/pdf/2402.03300)
- [Qwen2.5 Technical Report](https://arxiv.org/pdf/2412.15115)
- [Qwen3 Technical Report](https://arxiv.org/pdf/2505.09388)
- [BLOOM: A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/pdf/2211.05100)
- [BitNet b1.58 2B4T Technical Report](https://arxiv.org/pdf/2504.12285)
- [Dropout Attacks](https://naizhengtan.github.io/doc/papers/dropout24yuan.pdf)
- [The First Law of Complexodynamics](https://scottaaronson.blog/?p=762)
- FlexFlow, [Beyond data and model parallelism for deep neural networks](https://cs.stanford.edu/~zhihao/papers/sysml19a.pdf)
- Tofu, [Supporting very large models using automatic dataflow graph partitioning](https://dl.acm.org/doi/abs/10.1145/3302424.3303953)
- [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361/1000) (OpenAI, 2020)
- [Training Compute-Optimal Large Language Models](https://arxiv.org/pdf/2203.15556) (DeepMind, 2022, Chinchilla)
- [Understanding Stragglers in Large Model Training Using What-if Analysis](https://www.usenix.org/system/files/osdi25-lin-jinkun.pdf)
- [Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints](https://dl.acm.org/doi/pdf/10.1145/3600006.3613145)
- [Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks](https://www.usenix.org/system/files/osdi25-jiang.pdf)
- [TrainVerify: Equivalence-Based Verification for Distributed LLM Training](https://www.arxiv.org/pdf/2506.15961)
- [TTrace: Lightweight Error Checking and Diagnosis for Distributed Training](https://www.arxiv.org/abs/2506.09280)
- [Verify Distributed Deep Learning Model Implementation Refinement with Iterative Relation Inference](https://www.arxiv.org/abs/2508.09505)
- [Verifying Semantic Equivalence of Large Models with Equality Saturation](https://changlousys.github.io/paper/aerify-euromlsys25.pdf)
- [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155)
- [PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel](https://arxiv.org/pdf/2304.11277)
- [nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training](https://www.usenix.org/system/files/osdi24-lin-zhiqi.pdf)
- [WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training](https://www.usenix.org/system/files/osdi25-wang-zheng.pdf) (OSDI'25)
- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/pdf/1910.02054)
- [Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/pdf/2310.01889)