Schedule
- This schedule is subject to change over the course of the semester.
- Readings are to be completed before class.
Guest speaker schedule
Week 1
- Friday (09/05)
- Lecture Introduction
- Attention Is All You Need
Week 2
- Tue (09/09)
- Friday (09/12)
- Lecture Training II
Week 3
- Tue (09/16)
- Lecture Serving I
- vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention
- Friday (09/19)
- Lecture Serving II
Week 4
- Tue (09/23)
- Friday (09/26)
- Guest lecture Quanlu Zhang (Infinigence)
- RLinf: Reinforcement Learning Infrastructure for Agentic AI
Week 5
- Tue (09/30)
- lecture Optimization I
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- Friday (10/03)
- lecture Optimization II
Week 6
- Tue (10/07)
- paper reading Training
- Understanding Stragglers in Large Model Training Using What-if Analysis (OSDI’25)
- Friday (10/10)
- guest lecture Jinkun Lin (Cornell)
- Understanding Stragglers in Large Model Training Using What-if Analysis
Week 7
- Tue (10/14)
- paper reading Fault tolerance
- GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP’23)
- Friday (10/17)
- guest lecture Zhuang Wang (AWS)
- GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Week 8
- Tue (10/21)
- Friday (10/24)
- guest lecture Yuhan Liu (UChicago)
- A Case for the KV Cache Layer: Enabling the Next Phase of Fast Distributed LLM Serving
Week 9
- Tue (10/28)
- paper reading LLM Verification
- TrainVerify: Equivalence-Based Verification for Distributed LLM Training (SOSP’25)
- Friday (10/31)
- guest lecture Yunchi Lu (UMich)
- TrainVerify: Equivalence-Based Verification for Distributed LLM Training
Week 10
- Tue (11/04)
- paper reading Communication
- An Extensible Software Transport Layer for GPU Networking
- Friday (11/07)
- guest lecture Yang Zhou (UC Davis)
- UCCL: An Extensible Software Transport Layer for GPU Networking
Week 11
Week 11 (Veterans Day)
- Tue (11/11)
- Veterans Day no class
- Friday (11/14)
- guest lecture Yanghua Peng (ByteDance)
- Large-Scale Multimodal LLM Training in Production
Week 12
- Tue (11/18)
- paper reading Distributed training
- TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining (ICLR’25)
- Friday (11/21)
- guest lecture Chien-Chin Huang (Meta)
- TorchTitan: a PyTorch Native Platform for Training Foundation Models
Week 13
- Tue (11/25)
- guest lecture Kaichao You (UCB)
- vLLM: Easy, Fast, and Cheap LLM Serving for Everyone
- Friday (11/28)
- fall break no class