Project 3: Batch-invariant GPU Kernels

Nondeterminism at the kernel-level is confusing

As we have learned, Transformers are deterministic: given a fixed sequence of tokens, they compute the logits for the next token deterministically. If the model, the prompt, and the decoding strategy are fixed—and if the decoding strategy is itself deterministic or uses a fixed random seed—the output are supposed to be deterministic as well. However, this is often not the case in practice, which is confusing.

Here is an example using vLLM. We repeat the same prompt, “Are LLM systems cool?”, 2 times and 16 times, and send them in two separate batches to a Qwen3-0.6B model running on an RTX 4090 with vLLM. The outputs within the batch of size 2 are identical (left), whereas the outputs within the batch of size 16 differ, and many also differ from those in the batch of size 2. The figure below shows the results.

fig from the lab3 diff

Detailed results are here.

The root cause

Thinkinking machines, a startup company, has posted this blog, Defeating Nondeterminism in LLM Inference. It gives a reason why there are differences across different batch sizes:

[…]the primary reason [for why batch sizes affect the outputs] nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies!

Moreover, they implemented a set of batch-invariant GPU kernels.

This project: replicate batch-invariant kernels and explain the differences within batch

In this project, students will replicate a set of batch-invariant GPU kernels in Triton and optimize their performance. Meanwhile, they will analyze in detail the sources of nondeterminism in existing kernels and propose methods to mitigate them.

The goals of this project are:

Develop hands-on experience with Triton
Build a deep understanding of GPU kernel optimizations
Design and implement a set of batch-invariant kernels
Produce a performance evaluation of batch-invariant kernels

Who should care?

those interested in GPU kernel implementation
do research in optimizing GPU kernels

Project milestones

form a team (ideally 2–4 members)
set up project infrastructure: machines, tools, task tracking, and shared documentation
learn Triton
re-implement Thinking machines’ batch-invariant kernels
optimize the kernels
do performnace experiments and analyze the data
write an opinion