Week 02a CS7670 09/09 2025 https://naizhengtan.github.io/25fall/ □ 0. from last time □ 1. Research problem □ 2. Research topics □ 3. Training parallelization □ 4. Training correctness & anomalies (brief) □ 5. Discussion & your opinion? ---- 0. recap * too ML? will be more systemish... ...you can either focus on model or systems * interesting survey results (see details in the end) 1. fame of opensource models DeepSeek (95.5%) > Qwen (63.6%) > BLOOM (50%) > BitNet (13.6%) (see corresponding papers at References section) * llama would be on par with DeeSeek (if I have to guess) Q: what are other models people know of? 2. most-widely-known term vs. least-know term DeepSeek vs. BitNet 3. top-5 technical terms - loss function (90.9%) - back-propogation (90.9%) - dropout (81.8%) - KV Cache (81.8%) - regularization (77.3%) Q: What is loss function? What are realy we doing while trainig LLMs? [ask students] -- training is solving an optimization problem -- LLM is a compression of the Internet -- So, how good is the such a compression? [see figs] Q: What is dropout? -- [explain in figs] -- dropout attacks -- differ from classic systems: nondeterminism 4. Bottom-5 technical terms - speculative decoding (27.3%) - RoPE (18.2%) - SwiGLU (18.2%) - NCCL (18.2%) - PD-separation (18.2%) * logistics a. communication us-to-you: --homepage, announcements: check this regularly --your NEU email (seldom) you-to-us: --HotCRP: where you review papers and form discussions --my email for admin/sensitive things b. components of the course: --my lectures --guest lectures --a final writeup: opinion (can team up) c. lectures [take a look at the schedule] * my lectures --paper-oriented; must submit your review before the class --attending: no roll call, but...will randomly pick students to answer questions (lottery) --notes will be published, but will be hard to understand if you miss the lecture --asking questions in class is encouraged * guest lectures: --attending is required --asking questions is required (later in policy) d. final writeup: -- an opinion paper -- encouraging whoever share the same opinion to team up e. final grade -- policy: participation (80%) * paper review (10 papers): 40% * attending in person: 20% (-2% for each absent) * questions to guest speakers (2 questions): 20% -- writeup (20%) * evaluate by quality 1. Research problem: - before LLMs, training was "easy" (from a systems perspective)... ...by easy, I mean (a) graduate students can manage and (b) with modest resources... ...which is not possible now. - Q: Assume infinite memory and infinitely fast GPUs; how to train a LLM model? [ask students] - Q: what changed? [ask students] * nothing much changed on the procedure * despite algorithmic improvements, those are under the abstraction for ML engineers - root cause: scaling laws * Q: what was ML research before? [ask students] * OpenAI and DeepMind (and many others) * Aside: test-time scaling & bit-scaling (Furu's point) - Core research problem: train the model... ...efficiently (fewer GPU and less time) and... ...reliably (recover from failures and without "correctness" problems) (A) efficient training * Q: what will harm training efficiency? [ask students] (B) reliable training * Q: what will harm training reliability? [ask students] 2. Research topics - Training parallelization [most organized topic; more about this today] (one guest lecture) - Training correctness and performance anomalies * Fault tolerance (one guest lecture) * Anomaly detection (one guest lecture) * Formal verification (one guest lecture) - Post-training methods * RLHF and RLVR (one guest lecture) * Alignment and safety fine-tuning * Adaptive fine-tuning (e.g., LoRA) * Compression and distillation - Real-world training experiences - Others * Communication optimization (one guest lecture) * Memory optimization * Resource management (scheduling, allocation, fairness, etc.) * Non-traditional training paradigms * Heterogeneous training * Elastic training * Asynchronous training * Federated training * Green training * Training under constraints (old GPUs, low connectivity, etc.) 3. Training parallelization - Research problem: too big to fit into one GPU Q: what is too big? model parameters, gradients, activations, optimizer state - Early days (MLSys): FlexFlow and Tofu - [anecdote: Minjie's story, 2017] - Data parallelism - ZeRO: stage 1--3 - Pytorch FSDP (=ZeRO stage 3) [one of the authors, Chien-Chin Huang, is our guest speaker] Q: difference between FSDP and Tensor parallelism? [ask students] - Tensor parallelism - Megatron-LM - Pipeline parallelism * PipeDream: Fast and Efficient Pipeline Parallel DNN Training - https://arxiv.org/abs/1806.03377 - from Microsoft, CMU, Stanford - debut: Fri, 8 Jun 2018 - published in SOSP'19, October 27 [first talk] * GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism - https://arxiv.org/abs/1811.06965 - from Google - debut: Fri, 16 Nov 2018 - published in NeurIPS 2019, Dec 8th - [anecdote: me at ByteDance, 2020] - New parallelism techniques - expert parallelism (GShard, Switch Transformer) - sequence parallelism (Megatron-LM v2) - context parallelism (DeepSpeed-Ulysses, RingAttention) - Combining all above - Alpa - nnScaler - 4D parallelism [will continue from here next time] Survey results --- DeepSeek 95.5% loss function 90.9% back-propogation 90.9% dropout 81.8% KV Cache 81.8% regularization 77.3% LoRA 77.3% RAG 77.3% SGD 72.7% vLLM 72.7% Adam (not a person's name) 63.6% Qwen 63.6% MoE 59.1% FlashAttention 59.1% Beam Search 54.5% BLOOM 50% Top-p sampling 50% QLoRA 36.4% Mamba 36.4% 3D parallelism 36.4% Grouped-Query Attention 31.8% pre-norm 31.8% ZeRO 31.8% PEFT 27.3% speculative decoding 27.3% RoPE 18.2% SwiGLU 18.2% NCCL 18.2% PD-separation 18.2% BitNet 13.6% References ---- - [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/pdf/2501.12948) - [DeepSeek-V3 Technical Report](https://arxiv.org/pdf/2412.19437) - [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/pdf/2402.03300) - [Qwen2.5 Technical Report](https://arxiv.org/pdf/2412.15115) - [Qwen3 Technical Report](https://arxiv.org/pdf/2505.09388) - [BLOOM: A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/pdf/2211.05100) - [BitNet b1.58 2B4T Technical Report](https://arxiv.org/pdf/2504.12285) - [Dropout Attacks](https://naizhengtan.github.io/doc/papers/dropout24yuan.pdf) - [The First Law of Complexodynamics](https://scottaaronson.blog/?p=762) - FlexFlow, [Beyond data and model parallelism for deep neural networks](https://cs.stanford.edu/~zhihao/papers/sysml19a.pdf) - Tofu, [Supporting very large models using automatic dataflow graph partitioning](https://dl.acm.org/doi/abs/10.1145/3302424.3303953) - [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361/1000) (OpenAI, 2020) - [Training Compute-Optimal Large Language Models](https://arxiv.org/pdf/2203.15556) (DeepMind, 2022, Chinchilla) - [Understanding Stragglers in Large Model Training Using What-if Analysis](https://www.usenix.org/system/files/osdi25-lin-jinkun.pdf) - [Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints](https://dl.acm.org/doi/pdf/10.1145/3600006.3613145) - [Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks](https://www.usenix.org/system/files/osdi25-jiang.pdf) - [TrainVerify: Equivalence-Based Verification for Distributed LLM Training](https://www.arxiv.org/pdf/2506.15961) - [TTrace: Lightweight Error Checking and Diagnosis for Distributed Training](https://www.arxiv.org/abs/2506.09280) - [Verify Distributed Deep Learning Model Implementation Refinement with Iterative Relation Inference](https://www.arxiv.org/abs/2508.09505) - [Verifying Semantic Equivalence of Large Models with Equality Saturation](https://changlousys.github.io/paper/aerify-euromlsys25.pdf) - [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155) - [PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel](https://arxiv.org/pdf/2304.11277) - [nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training](https://www.usenix.org/system/files/osdi24-lin-zhiqi.pdf) - [WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training](https://www.usenix.org/system/files/osdi25-wang-zheng.pdf) (OSDI'25) - [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/pdf/1910.02054) - [Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/pdf/2310.01889)