Project 2: vLLM Performance Evolution

What is vLLM?

vLLM is an open-source framework for Large Language Model (LLM) serving. It is designed to make inference both fast and memory-efficient, and is now widely adopted in practice. vLLM is introduced in the PagedAttention paper.

Performance evolution of computer systems

Performance of a sophisticated system is usually a myth: it is rarely a single number, and instead reflects many interacting components, workloads, and assumptions. Understanding the performance evolution over time and across versions can help identify which optimizations truly matter, expose regressions that would otherwise remain hidden, and guide future design choices based on empirical evidence.

For example, below is a study of Linux performance evolution. (if you’re interested, read the paper, An Analysis of Performance Evolution of Linux’s Core Operations (SOSP’19).)

Fig1(a) of the paper, An Analysis of Performance Evolution of Linux’s Core Operations (SOSP'19)

How to read the figure:

  • The X-axis shows the evolution of Linux versions.
  • The Y-axis lists different microbenchmarks.
  • Each cell reports relative performance compared to Linux kernel version 4.0. Greener indicates higher performance.
  • Curious about what happened starting from version 4.14? See the paper for details.

Why study vLLM performance evolution

Since its debut on 09/02/2023, vLLM has undergone rapid and substantial evolution. The improvements reflect both system-level innovations and model-side optimizations tailored for LLMs. Studying this trajectory is essential for (at least) three reasons:

  1. Relevance of techniques: understanding which optimizations remain impactful in today’s serving environments versus those made obsolete by newer designs.
  2. Attribution of performance gains: identifying which changes contributed most to throughput, latency, or efficiency improvements.
  3. Trade-off analysis: examining how different approaches balance performance, resource usage, and flexibility.

By tracing vLLM’s performance evolution, researchers and practitioners gain insight into the principles that drive state-of-the-art LLM serving, while also learning from the successes and limitations of prior optimizations.

Goals

  • Develop hands-on experience with LLM serving
  • Build a deep understanding of LLM serving optimizations
  • Design and implement a benchmarking methodology across vLLM versions
  • Produce a performance evolution plot for vLLM (analogous to the Linux example above)

Who should care?

  • those interested in LLM serving
  • do research in optimizing LLM serving
  • do research in benchmarking LLM runtime performances
  • people interested in the history of scientific ideas and the evolution of research practices

Project milestones

  1. form a team (ideally 2–4 members)
  2. set up project infrastructure: machines, tools, task tracking, and shared documentation
  3. select and prepare stable vLLM versions for evaluation
  4. decide a benchmarking methodology (study prior literature and best practices)
  5. do experiments and analyze the data
  6. write an opinion