Correctness Without a Reference
Existential equivalence: a definition of correctness for AI
For as long as we have verified software, a specification has been something you can check. Given an output, you can decide whether it is correct. Sort a list: is the result ordered, and a permutation of the input? Two questions, both decidable. Most of what I have worked on stands on this: my system Cobra verifies whether a database served its transactions serializably; our Orochi project verifies whether an untrusted server returned what the real computation would have. Different systems, one bedrock: correctness is a property you can check.
Modern AI removes the bedrock. Ask a model to summarize four hundred pages, draft the brief, or run the agent that books your travel. Look at the output and try to decide—correct, or not? Nothing answers. The specification did not get harder to check. It stopped being a checkable object at all.
The “canonical solution” is to find a reference. Pick a trusted implementation and call the output correct when it matches—re-execute the computation, diff against a gold version. It is, for example, what GPU kernel developers do, and how deep-learning compilers are tested. But equality is probably the wrong relation. Sum a million floats in a different order and the last bits change; the answer is still correct, yet a reference check rejects it. And if the reference itself is wrong, you certify its bug as truth. We do not want equality to a reference. We want equivalence—and we have to say which differences are allowed.
Physics settled this first
Sit in a sealed box with no windows and drop a ball. It falls. Are you on Earth, or in a rocket accelerating at one gravity? No measurement inside can tell. Einstein refused to ask which one is “real”: if no observation distinguishes them, they are the same. Sameness is not absolute. It is sameness with respect to what you can observe.
We have already seen this—observational equivalence—in different areas. The physics only makes it vivid: fix what you observe, and you fix what counts as correct.
The definition for “AI correctness”
So change the question. Not “did the machine produce the right answer?” but: could a legitimate run of the machine have produced what I see?
correct ⇔ there exists a legitimate run e with observe(e) = the outcome.
An output is correct if some legitimate execution explains it. To call it wrong is the harder claim: that no legitimate run could have produced it. The definition turns on three choices. What makes a run legitimate—this is the specification, under its true name. What makes two outcomes equal. And what we observe. The legitimate runs are exactly the differences that do not matter: reorder the float sum, fine; return 7 where every legitimate run yields 3, not fine.
If this feels familiar, it should. It echoes several established concepts in computer science. Take serializability as an example. A database history is correct precisely when there exists a serial order of its transactions consistent with the reads and writes we observed, and database checkers like Cobra spend their whole effort searching for that one legitimate order. Serializability was existential equivalence all along. We had not named the pattern.
Why this is useful
The pattern is what excites me, because it does not break as the systems get wilder. You climb from a chip to an agent by turning two dials—loosen what counts as legitimate, coarsen what you observe—and the sentence still means something. On a GPU kernel, the legitimate runs are the valid summation orders and you observe the numbers, up to ε. In LLM inference, they are the samplings the decoding policy allows and you observe the text (see an example in our Jailbreak Oracle Problem). In an agent, they are the allowed interleavings of tool calls and the world, and you observe the side effects—the flight booked, the calendar updated.
Same definition, three altitudes. Naming it pays off twice. The specification becomes a single object, the legitimacy predicate; every dispute about whether an AI is “correct” turns into a concrete dispute about which runs we are willing to call legitimate. And it shows where the work is hard, just by counting. Exhibiting one legitimate run is enough to accept; ruling out all of them is what it takes to reject.
This says only that the machine ran legitimately—not that its output was wise. Whether a result is helpful, or harmless, is a different question, and a later post.
Conclusion
Existential equivalence is one definition that travels: an output is correct when some legitimate run explains what we observe—whether that run is a summation order, a sampling path, or an agent’s trace. It does not make specifying AI easy. It tells us where the difficulty lives: in the legitimacy predicate, and in what we choose to observe. That is a smaller, sharper question than “is the AI right,” and a better place to start.
Every correctness question is a windowless box. The only decision that matters is what you let yourself observe. Make it honestly, and correctness stops being a verdict you pronounce. It becomes an explanation the world can offer—or cannot.
References
- Cheng Tan, Changgeng Zhao, Shuai Mu, and Michael Walfish. Cobra: Making Transactional Key-Value Stores Verifiably Serializable. OSDI 2020.
- Cheng Tan, Lingfan Yu, Joshua B. Leners, and Michael Walfish. The Efficient Server Audit Problem, Deduplicated Re-execution, and the Web. SOSP 2017.
- Christos H. Papadimitriou. The Serializability of Concurrent Database Updates. Journal of the ACM 26(4), 1979.
- Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Jinyang Li, Aurojit Panda, and Lingming Zhang. NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers. ASPLOS 2023.
- Shuyi Lin, Anshuman Suri, Alina Oprea, and Cheng Tan. Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem. MLSys 2026.