Opinion

This page contains student opinion articles on various topics related to Large Language Model Systems.


Training Tiny Language Models: An Underexplored Regime Where Conventional Wisdom is Incomplete

Authors: Arya Wu, Xilin Wang, Gavin Yang

Description: This paper argues that established training heuristics for large language models (derived from models with hundreds of millions to billions of parameters) don’t transfer well to small-scale LLMs below 100M parameters. The authors demonstrate through experiments with 30M-75M parameter models that learning rate extrapolations, batch-size scaling rules, and Chinchilla-optimal training durations fail non-monotonically at smaller scales. Key findings include: (1) critical batch size is larger than expected for tiny models, (2) square root scaling for learning rate provides marginal benefit at best, and (3) linear warmup scaling consistently improves training. The paper also explores optimal training duration, showing that extended training continues improving tiny models with no saturation observed, but Chinchilla-optimal is not compute-efficient at this scale.

Read the full article

Regarding Project 1


Software, Not Silicon: Why Better Algorithms Beat Bigger GPUs

Authors: Sanchit Ahuja and Harshit Garg

Description: This article examines the evolution of vLLM’s performance across versions 0.2.0 to 0.11.0 and demonstrates that software innovations alone achieved nearly 2x performance improvement without any hardware upgrades. Using the stabilityai/stablelm-tuned-alpha-7b model on a single A100 GPU, the authors tracked five metrics across versions: average latency per token dropped from 0.29s to 0.13s, average latency per output token fell from 1.56s to 0.78s, and throughput rose from 6.82 req/s to 13.58 req/s. Through changelog analysis, they identify that these gains stem from algorithmic innovations in memory management (PagedAttention), scheduling improvements, and kernel optimizations. Interestingly, performance actually regressed from v0.4.0 through v0.6.x before recovering in v0.7.0, illustrating that algorithmic improvements require careful system integration. The authors argue that sustainable AI scaling requires treating algorithmic innovation as a first-class citizen alongside silicon advancement, and that the cost dynamics favor software optimization over hardware scaling.

Read the full article

Regarding Project 2


Batch-invariance: Our Testing and Observations on Current Community Efforts

Authors: Anyu Yang, Junbeom In, Xiaocong Zhang

Description: This article examines the batch invariance feature in vLLM, an LLM inference engine. Batch invariance means that LLM inferencing produces the same results for the same input across various batches, which is a building block of overall determinism. The authors replicated and tested the software setup of the initial vLLM batch invariance pull request and found that all small models they tested (TinyLlama-1.1B, Qwen2.5-1.5B, Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B) were NOT batch invariant on their GPUs (RTX 3060 and RTX 4060). They argue that current community efforts to achieve batch invariance are unreliable due to naive and insufficient testing methodology. The paper also includes a proof-of-concept batch-invariant matrix multiplication kernel implemented in Triton.

Read the full article

Regarding Project 3


Research as an Imitation Game: When Good Researchers Copy and Great Researchers Steal

Author: Hanhui Wang

Description: An opinion piece arguing that high-impact research doesn’t only involve building entirely new mechanisms, but also recognizing structural equivalence between disparate domains and transplanting solutions across boundaries of time and fields. The author distinguishes between “copying” (surface-level adaptation that preserves form while ignoring context) and “stealing” (beginning with a new problem, revealing its abstract structure, and adapting old ideas with understanding of why they work). Two case studies illustrate this: (1) vLLM’s PagedAttention, which “stole” the paging mechanism from 1960s operating systems to solve GPU memory management for LLM serving, and (2) video generation research borrowing the core intuition from LLM research that generation can serve as a forcing function for reasoning. The piece emphasizes that in an era of increasingly complex AI systems, the ability to bridge isolated islands of knowledge is becoming a critical differentiator.

Read the full article


LLMs. Experimental Proxies for AGI: Why Behavioral Equivalence Matters More Than Mechanism

Author: Franc O

Description: This opinion piece argues that Large Language Models demonstrate unprecedented generalization capabilities that make them valuable proxies for studying artificial general intelligence (AGI). The author contends that if a system exhibits the behavioral signatures of general intelligence across sufficient domains, it becomes a valid experimental substrate for AGI research, regardless of its underlying mechanism. Key evidence includes emergent capabilities that appear discontinuously with scale (chain-of-thought reasoning, cross-modal understanding, tool use), functional decomposition showing LLMs implement core cognitive operations through alternative pathways (abstract reasoning, causal inference, meta-learning), and the predictable relationship between model size, data quantity, and capability emergence (Chinchilla scaling laws). The paper argues that LLMs provide instrumental value as imperfect proxies, offering a computational laboratory for testing theories of intelligence, studying compositional generalization, few-shot learning dynamics, and alignment challenges.

Read the full article


With state, LLMs are increasingly becoming databases - So we need to incorporate catalogs to engineer accountability

Author: Arunit Baidya

Description: This article argues that modern LLM systems have evolved from stateless token predictors to stateful systems that manage state through mechanisms like durable logs, retrieval-based augmentation, and user session histories. The author draws parallels between LLM systems and database systems, noting that LLM systems now store, retrieve, update, index and manage information across multiple data stores. The key argument is that LLM systems need a “catalog” (similar to database catalog tables in PostgreSQL) to track dependencies between components, artifacts, and their versions over time. This catalog would enable accountability by allowing developers to trace which artifacts, embedding indexes, and system states influenced a particular response through a graph representation of dependencies. While acknowledging implementation challenges around compute costs, memory overhead, and security concerns, the author argues that catalogs are necessary given the growing complexity and reliance on LLM systems, enabling proactive identification of recurring failure patterns rather than reactive debugging.

Read the full article


Speed at the Cost of Safety: A Hidden Trade-off in LLM Engines

Author: Shuyi Lin

Description: This article argues that aggressive approximation techniques used to accelerate LLM serving—including quantization, pruning, speculative decoding, and KV cache compression—come with a hidden cost: compromised safety. The author demonstrates through evaluations using Jailbreak Oracle (JO), a tree-search-based evaluation tool, that approximated models exhibit drastically higher vulnerability to unsafe prompts. In a case study comparing Qwen3-8B against its quantized variant, the quantized model showed a catastrophic 6x increase in safety vulnerability scores (from 1400 to 9000), failing under simple structural noise where the original required sophisticated semantic manipulation. The article also questions the claimed efficiency benefits, highlighting the batching bottleneck, speculative decoding paradox, and KV cache limitations that diminish real-world speedups. The author explains that safety mechanisms rely on “sparse” or “long-tail” neurons that are the first casualties of approximation since they’re statistically classified as “unimportant” for average perplexity. The article concludes that the community must pivot from “lossy” approximations toward robust hardware-software co-design to ensure speed does not compromise AI system safety.

Read the full article


Asynchronous RL Training is Structurally Necessary for LLM Post-Training at Scale

Author: Zikai Wang

Description: This article argues that asynchronous reinforcement learning architectures are fundamentally necessary for efficient LLM post-training at scale, particularly for agentic workloads. The author demonstrates that synchronous RLHF training suffers from a critical straggler problem: the generation phase consumes 70-80% of training time, and response lengths follow a long-tailed distribution, forcing all GPUs to wait for the slowest worker. This creates catastrophic resource waste where a single long response blocks the entire cluster. While recent synchronous optimizations like Seer’s context-aware scheduling and VeRL’s over-sampling achieve impressive speedups for text generation, these techniques fail for agentic workloads involving tool calls, API latencies, and code execution, where variance comes from unpredictable external sources rather than the model’s generation process. Asynchronous architectures solve this by decoupling generation from training—rollout workers stream trajectories continuously while training proceeds on available data, achieving 2-10× speedups without waiting for stragglers. The article concludes that the straggler problem is structural and cannot be engineered away through better hardware or scheduling; as LLM post-training moves toward agentic, tool-integrated workloads, asynchronous training will become the default rather than the exception.

Read the full article