This article explores the issue of non-determinism in large language model (LLM) inference, a common problem for developers and users. It delves into a recent research paper by Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, which sheds light on the root cause and proposes solutions.
Background: Thinking Machines Lab
Thinking Machines Lab has garnered significant attention since its inception.
-
The company completed a \$2 billion seed round with a valuation of \$12 billion.
-
This investment occurred before the release of any products, a rare occurrence in the AI industry.
-
Investors included prominent firms like A16z, NVIDIA, AMD, and Cisco.
-
This heavy investment in a company with no product yet is believed to be due to the strength of its research team and the valuable insights coming from their research.
The Problem: Inconsistent Outputs from LLMs
Why do LLMs sometimes produce different outputs even when given the same input and fixed random seeds? This inconsistency can be frustrating and presents a significant challenge for real-world applications. Even when using open-source inference libraries like vLLM and SGLang on your own hardware, the issue persists.
Common Misconceptions
Many attribute this non-determinism to GPU concurrency or floating-point arithmetic errors. While these factors play a role, they are not the primary cause, according to Thinking Machines Lab.
The True Culprit: Batch Invariance Loss
The core issue lies in the lack of batch invariance during LLM inference.
-
LLM inference servers handle requests in batches to optimize resource utilization.
-
The size of these batches varies depending on the server load.
-
The research found that the same input can produce different outputs depending on the batch size it is processed in.
-
The "non-associativity" of floating-point arithmetic is the primary culprit for this variance.
Floating-Point Non-Associativity
Floating-point numbers, designed for dynamic precision, can exhibit non-associativity in calculations.
-
For example, (a + b) + c may not equal a + (b + c) when dealing with floating-point numbers.
-
This occurs because of how floating-point numbers represent large and small values, where the precision of smaller numbers can be "swallowed" by larger numbers.
-
This becomes magnified during the core operations of LLM inference.
Impact on Model Operations
The non-associativity of floating-point math can significantly affect calculations within LLMs. Specifically, this becomes an issue in operations such as:
-
Matrix multiplication
-
RMSNorm (Root Mean Square Normalization)
-
Attention mechanisms
-
These operations involve many floating-point additions and multiplications.
When servers process requests with different batch sizes, the reduction order of these operations changes. Different reduction orders can lead to different results.
The GPU Concurrency Misunderstanding
The research clarifies that GPU concurrency itself is not the problem. Even on GPUs, fixing the reduction order ensures consistent results across multiple runs. It is the varying reduction strategies, dependent on batch size, that cause the issue.
The Solution: Batch-Invariant Kernels
The solution involves implementing "batch-invariant kernels" for core LLM operations, ensuring a consistent reduction order regardless of batch size.
RMSNorm Implementation
The goal is to force the RMSNorm calculation to always use a data-parallel approach, with each batch element's reduction completed within a single core. This eliminates variations in the reduction order.
Matrix Multiplication Implementation
The implementation needs to avoid adjusting based on batch size. To accomplish that:
-
Disable "Split-K" strategies to prevent changing the reduction order based on batch size.
-
Fix the tensor core instruction size to ensure consistency in the internal reduction order.
Attention Mechanism Implementation
This is the most complex to solve.
-
Fix the split size for KV (Key-Value) cache, rather than splitting the cache based on the number of splits.
-
Consistently update the KV cache layout before attention kernels to avoid data layout differences.
Experimental Results
Thinking Machines Lab demonstrated the effectiveness of their solution using the Qwen/Qwen3-235B-A22B-Instruct-2507 model.
-
With standard kernels, 1000 samplings produced 80 different results.
-
With batch-invariant kernels, the results were 100% consistent across 1000 samplings.
While there may be some performance costs, the deterministic results make it worthwhile for critical applications.
Implications and Significance
This research has significant implications:
-
It provides a scientific solution for the reproducibility and reliability of LLMs.
-
It enables on-policy reinforcement learning for LLMs by ensuring consistent behavior between training and inference.
Conclusion
Thinking Machines Lab's work addresses a critical issue in LLM inference, paving the way for more reliable and consistent AI systems. The focus on determinacy highlights a new area for competition in LLM technology.