Week 10a CS7670 11/04 2025 https://naizhengtan.github.io/25fall/ □ 1. Introduction □ 2. Datacenter network □ 3. RDMA □ 4. LLM networking □ 5. UCCL ---- 1. introduction * the field of networking USENIX ATC --> OSDI, FAST, NSDI +->SOSP +->SIGCOMM (ACM version) * Fundamental problem: data movement -- moving data from one machine to another machine (sometime, broadly speaking, a bit stretch, communication within machines as well) * two types of major networking setups: -- Internet or WAN (a lot of unkowns) +--> decentralized networking like BitCoin -- Datacenter network (a lot of knowns) +--> ML networking -- of course, others like wireless and security * Impossiblity: consensus ("Two Generals Paradox", Jim Gray in 1978 in "Notes on Data Base Operating Systems") * if time: * red and blue eyes * Shared Knowledge vs. Common Knowledge * Networking 101: the world we're living in * to address the data movement problem, one way is packet switching * what does a packet look like? say, visiting www.google.com: +----------------------------+ | +------------------------+ | e | +-------------------+ | t | | +-------------+ | h | | | +------+ | e | ip | tcp | http | data | | r | | | +------+ | n | | +-------------+ | e | +-------------------+ | t +------------------------+ +----------------------------+ * the connection between packet switching and the layered model http <-> application tcp <-> transport ip <-> routing ethernet <-> link&physical * Where are the layers implemented? * Application layer is implemented in apps: browser, Skype, Facebook, etc. * TCP/UDP (transport layer): they are implemented in your operating system’s kernel. * IP (routing layer): the end hosts need to participate in the routing layer because they insert the source and destination addresses. Unlike the top two layers, the routers and gateways also need to implement and understand the IP protocol because they need to forward data from one network to another. * Link&physical layers: hardware on end hosts and routers * Fundamental challenges (from a client's perspective): -- reliability => packet loss/duplication/reorder -- efficiency => congestion contral and load balancing 2. Datacenter networke (DCN) * overview -- unlike Internet, or the assumption networking was built on, datacenter environment is highly certain and contrable * datacenter network topology -- Fat-Tree/Clos architecture [draw Spine/Leaf/ToR] * specialized protocols for almost all areas. For example, routing, -- ECMP (Equal-Cost Multi-Path) [as opposite to BGP on Internet] * multipath transport protocols In conventional TCP/IP, a single connection (defined by the 4-tuple: source IP, source port, destination IP, destination port) binds to one path through the network. All packets for that connection follow that same path. If the path becomes congested or fails, the connection stalls or terminates. Multipath transport generalizes this model. It allows endpoints to establish multiple subflows (each over potentially distinct interfaces or routes) that together form one logical connection. The transport layer then coordinates the subflows, providing reliability, ordering, and congestion control across all paths. Examples: -- MPTCP, Multipath TCP -- RDMA Multipath (RoCEv2 extensions) 3. RDMA * What is RDMA? Remote Direct Memory Access (RDMA) allows one machine to read or write directly to another machine’s memory without involving the remote CPU. This mechanism minimizes latency and CPU overhead, which makes it critical for high-performance systems such as distributed databases, storage, and deep learning clusters. The programming interface for RDMA is expressed in terms of verbs—the low-level API exposed by RDMA NICs (RNICs) such as Mellanox or Intel adapters. “Verbs” are essentially operations you post to the RNIC to perform communication or memory actions. * history of RDMA * InfiniBand: Standardizing RDMA In 1999–2000, a consortium of major vendors (IBM, Intel, HP, Dell, Compaq, Microsoft, Sun, and Cisco) founded the InfiniBand Trade Association (IBTA). -- Zero-copy data transfer -- User-level networking (bypassing the kernel) -- Remote direct memory access * Evolution Beyond HPC: RoCE Although InfiniBand dominated HPC clusters, data centers still relied on Ethernet. So, RoCE (RDMA over Converged Ethernet) emerges: -- Introduced by Mellanox (now NVIDIA) around 2010. -- Carries RDMA packets over Ethernet frames directly (no TCP/IP stack). * RDMA verbs (a) Administrative verbs Used to create and configure resources (PDs, CQs, QPs, MRs). (b) Work Request verbs Actual data operations, such as: - Send / Receive - RDMA Read / RDMA Write - Atomic operations (Compare-and-Swap, Fetch-and-Add) * RDMA write with immediate RDMA Write with Immediate is a variant of the RDMA Write verb that sends a small immediate value (typically 32 bits) along with the write operation. This immediate value does not reside in the written memory region; instead, it is delivered to the remote completion queue (CQ) as part of a completion event. * configurations In RDMA (Remote Direct Memory Access), reliability and connection type determine how data is delivered between peers. RDMA defines several Queue Pair (QP) types, corresponding to different transport modes: | Transport Type | Abbreviation | Reliability | Connection Setup | | --------------------- | ------------ | ----------- | ------------------- | | Reliable Connection | RC | Reliable | Connection-oriented | | Unreliable Connection | UC | Unreliable | Connection-oriented | | Unreliable Datagram | UD | Unreliable | Connectionless | * RC => The hardware guarantees reliable, in-order delivery. The RNIC handles retransmissions, sequencing, and acknowledgment. | Feature | RC (Reliable Connection) | UC (Unreliable Connection) | UD (Unreliable Datagram) | | ------------------| ----------------------------------------- | -------------------------- | ----------------------------------- | | Reliability | Reliable | Unreliable | Unreliable | | Connection | Connection-oriented | Connection-oriented | Connectionless | | Delivery ordering | In-order | Unordered | Unordered | | Retransmission | Automatic | None | None | | Verbs supported | All (Send/Recv, RDMA Read/Write, Atomics) | Send/Recv, RDMA Write | Send/Recv only | | Message size | Arbitrary | Arbitrary | <= MTU | 4. LLM networking * what are the characteristics in LLM versus others? -- throughput (training) -- latency (serving) -- highly regular (and known ahead of time) * collective communication primitives -- all-reduce -- all-gather -- reduce-scatter -- all-to-all -- others * problem 1: MoE serving -- experts are hosted on different nodes -- some expert is much hotter than others -- replicate the expert * problem 2: MLT, semi-reliable transmission "MLT [96] customizes loss recovery behaviors for ML training to allow semi-reliable transmission based on the gradient importance from applications" * problem 3: Inefficient loss recovery -- handle packet loss * problem 4: Heterogeneous NICs 5. UCCL * yet another breaking abstractions for performance * good part: it still tries to unify things * the valuable part of interface design is in appendix * two highlighted challenges: (1) How to decouple the data and control paths for existing RDMA NICs? (2) How can we achieve hardware-level performance with the control path running on the CPU?