Opinion
This page contains student opinion articles on various topics related to Large Language Model Systems.
Training Tiny Language Models: An Underexplored Regime Where Conventional Wisdom is Incomplete
Authors: Arya Wu, Xilin Wang, Gavin Yang
Description: This paper argues that established training heuristics for large language models (derived from models with hundreds of millions to billions of parameters) don’t transfer well to small-scale LLMs below 100M parameters. The authors demonstrate through experiments with 30M-75M parameter models that learning rate extrapolations, batch-size scaling rules, and Chinchilla-optimal training durations fail non-monotonically at smaller scales. Key findings include: (1) critical batch size is larger than expected for tiny models, (2) square root scaling for learning rate provides marginal benefit at best, and (3) linear warmup scaling consistently improves training. The paper also explores optimal training duration, showing that extended training continues improving tiny models with no saturation observed, but Chinchilla-optimal is not compute-efficient at this scale.
Regarding Project 1
Software, Not Silicon: Why Better Algorithms Beat Bigger GPUs
Authors: Sanchit Ahuja and Harshit Garg
Description: This article examines the evolution of vLLM’s performance across versions 0.2.0 to 0.11.0 and demonstrates that software innovations alone achieved nearly 2x performance improvement without any hardware upgrades. Using the stabilityai/stablelm-tuned-alpha-7b model on a single A100 GPU, the authors tracked five metrics across versions: average latency per token dropped from 0.29s to 0.13s, average latency per output token fell from 1.56s to 0.78s, and throughput rose from 6.82 req/s to 13.58 req/s. Through changelog analysis, they identify that these gains stem from algorithmic innovations in memory management (PagedAttention), scheduling improvements, and kernel optimizations. Interestingly, performance actually regressed from v0.4.0 through v0.6.x before recovering in v0.7.0, illustrating that algorithmic improvements require careful system integration. The authors argue that sustainable AI scaling requires treating algorithmic innovation as a first-class citizen alongside silicon advancement, and that the cost dynamics favor software optimization over hardware scaling.
Regarding Project 2
Batch-invariance: Our Testing and Observations on Current Community Efforts
Authors: Anyu Yang, Junbeom In, Xiaocong Zhang
Description: This article examines the batch invariance feature in vLLM, an LLM inference engine. Batch invariance means that LLM inferencing produces the same results for the same input across various batches, which is a building block of overall determinism. The authors replicated and tested the software setup of the initial vLLM batch invariance pull request and found that all small models they tested (TinyLlama-1.1B, Qwen2.5-1.5B, Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B) were NOT batch invariant on their GPUs (RTX 3060 and RTX 4060). They argue that current community efforts to achieve batch invariance are unreliable due to naive and insufficient testing methodology. The paper also includes a proof-of-concept batch-invariant matrix multiplication kernel implemented in Triton.
Regarding Project 3
Research as an Imitation Game: When Good Researchers Copy and Great Researchers Steal
Author: Hanhui Wang
Description: An opinion piece arguing that high-impact research doesn’t only involve building entirely new mechanisms, but also recognizing structural equivalence between disparate domains and transplanting solutions across boundaries of time and fields. The author distinguishes between “copying” (surface-level adaptation that preserves form while ignoring context) and “stealing” (beginning with a new problem, revealing its abstract structure, and adapting old ideas with understanding of why they work). Two case studies illustrate this: (1) vLLM’s PagedAttention, which “stole” the paging mechanism from 1960s operating systems to solve GPU memory management for LLM serving, and (2) video generation research borrowing the core intuition from LLM research that generation can serve as a forcing function for reasoning. The piece emphasizes that in an era of increasingly complex AI systems, the ability to bridge isolated islands of knowledge is becoming a critical differentiator.
LLMs. Experimental Proxies for AGI: Why Behavioral Equivalence Matters More Than Mechanism
Author: Franc O
Description: This opinion piece argues that Large Language Models demonstrate unprecedented generalization capabilities that make them valuable proxies for studying artificial general intelligence (AGI). The author contends that if a system exhibits the behavioral signatures of general intelligence across sufficient domains, it becomes a valid experimental substrate for AGI research, regardless of its underlying mechanism. Key evidence includes emergent capabilities that appear discontinuously with scale (chain-of-thought reasoning, cross-modal understanding, tool use), functional decomposition showing LLMs implement core cognitive operations through alternative pathways (abstract reasoning, causal inference, meta-learning), and the predictable relationship between model size, data quantity, and capability emergence (Chinchilla scaling laws). The paper argues that LLMs provide instrumental value as imperfect proxies, offering a computational laboratory for testing theories of intelligence, studying compositional generalization, few-shot learning dynamics, and alignment challenges.
With state, LLMs are increasingly becoming databases - So we need to incorporate catalogs to engineer accountability
Author: Arunit Baidya
Description: This article argues that modern LLM systems have evolved from stateless token predictors to stateful systems that manage state through mechanisms like durable logs, retrieval-based augmentation, and user session histories. The author draws parallels between LLM systems and database systems, noting that LLM systems now store, retrieve, update, index and manage information across multiple data stores. The key argument is that LLM systems need a “catalog” (similar to database catalog tables in PostgreSQL) to track dependencies between components, artifacts, and their versions over time. This catalog would enable accountability by allowing developers to trace which artifacts, embedding indexes, and system states influenced a particular response through a graph representation of dependencies. While acknowledging implementation challenges around compute costs, memory overhead, and security concerns, the author argues that catalogs are necessary given the growing complexity and reliance on LLM systems, enabling proactive identification of recurring failure patterns rather than reactive debugging.
Speed at the Cost of Safety: A Hidden Trade-off in LLM Engines
Author: Shuyi Lin
Description: This article argues that aggressive approximation techniques used to accelerate LLM serving—including quantization, pruning, speculative decoding, and KV cache compression—come with a hidden cost: compromised safety. The author demonstrates through evaluations using Jailbreak Oracle (JO), a tree-search-based evaluation tool, that approximated models exhibit drastically higher vulnerability to unsafe prompts. In a case study comparing Qwen3-8B against its quantized variant, the quantized model showed a catastrophic 6x increase in safety vulnerability scores (from 1400 to 9000), failing under simple structural noise where the original required sophisticated semantic manipulation. The article also questions the claimed efficiency benefits, highlighting the batching bottleneck, speculative decoding paradox, and KV cache limitations that diminish real-world speedups. The author explains that safety mechanisms rely on “sparse” or “long-tail” neurons that are the first casualties of approximation since they’re statistically classified as “unimportant” for average perplexity. The article concludes that the community must pivot from “lossy” approximations toward robust hardware-software co-design to ensure speed does not compromise AI system safety.
Asynchronous RL Training is Structurally Necessary for LLM Post-Training at Scale
Author: Zikai Wang
Description: This article argues that asynchronous reinforcement learning architectures are fundamentally necessary for efficient LLM post-training at scale, particularly for agentic workloads. The author demonstrates that synchronous RLHF training suffers from a critical straggler problem: the generation phase consumes 70-80% of training time, and response lengths follow a long-tailed distribution, forcing all GPUs to wait for the slowest worker. This creates catastrophic resource waste where a single long response blocks the entire cluster. While recent synchronous optimizations like Seer’s context-aware scheduling and VeRL’s over-sampling achieve impressive speedups for text generation, these techniques fail for agentic workloads involving tool calls, API latencies, and code execution, where variance comes from unpredictable external sources rather than the model’s generation process. Asynchronous architectures solve this by decoupling generation from training—rollout workers stream trajectories continuously while training proceeds on available data, achieving 2-10× speedups without waiting for stragglers. The article concludes that the straggler problem is structural and cannot be engineered away through better hardware or scheduling; as LLM post-training moves toward agentic, tool-integrated workloads, asynchronous training will become the default rather than the exception.