Video thumbnail for 【人工智能】击败大模型推理的非确定性 | Thinking Machines | 批次不变性缺失 | 浮点数非结合性 | 归约化顺序 | 批次不变内核 | RMSNorm | 矩阵乘法 | 注意力机制

Fixing LLM Uncertainty: Batch Invariance Explained

Summary

Quick Abstract

Tired of inconsistent outputs from your large language models (LLMs), even with fixed random seeds? This summary dives into a groundbreaking study by Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, which reveals the hidden cause of LLM non-determinacy. We'll explore the problem, the real culprit, and the proposed solutions for repeatable and reliable AI.

Quick Takeaways:

  • LLM output inconsistencies stem from "batch variance," not solely GPU concurrency or floating-point errors.

  • Different batch sizes during inference alter the order of floating-point operations (non-associativity), leading to varying results.

  • Thinking Machines Lab proposes "batch-invariant kernels" for operations like RMSNorm, matrix multiplication, and attention mechanisms, ensuring consistent output regardless of batch size.

  • Implementing these kernels can unlock reliable on-policy reinforcement learning for LLMs.

The core issue isn't GPU parallelism, but the changing order of calculations due to varying batch sizes. Their solution involves enforcing consistent reduction strategies within key LLM operations, achieving remarkable determinacy. While some initial performance trade-offs exist, optimized "batch-invariant kernels" offer a path to robust and replicable AI outputs, especially critical in fields like finance, healthcare, and research, allowing for reproducible results.

This article explores the issue of non-determinism in large language model (LLM) inference, a common problem for developers and users. It delves into a recent research paper by Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, which sheds light on the root cause and proposes solutions.

Background: Thinking Machines Lab

Thinking Machines Lab has garnered significant attention since its inception.

  • The company completed a \$2 billion seed round with a valuation of \$12 billion.

  • This investment occurred before the release of any products, a rare occurrence in the AI industry.

  • Investors included prominent firms like A16z, NVIDIA, AMD, and Cisco.

  • This heavy investment in a company with no product yet is believed to be due to the strength of its research team and the valuable insights coming from their research.

The Problem: Inconsistent Outputs from LLMs

Why do LLMs sometimes produce different outputs even when given the same input and fixed random seeds? This inconsistency can be frustrating and presents a significant challenge for real-world applications. Even when using open-source inference libraries like vLLM and SGLang on your own hardware, the issue persists.

Common Misconceptions

Many attribute this non-determinism to GPU concurrency or floating-point arithmetic errors. While these factors play a role, they are not the primary cause, according to Thinking Machines Lab.

The True Culprit: Batch Invariance Loss

The core issue lies in the lack of batch invariance during LLM inference.

  • LLM inference servers handle requests in batches to optimize resource utilization.

  • The size of these batches varies depending on the server load.

  • The research found that the same input can produce different outputs depending on the batch size it is processed in.

  • The "non-associativity" of floating-point arithmetic is the primary culprit for this variance.

Floating-Point Non-Associativity

Floating-point numbers, designed for dynamic precision, can exhibit non-associativity in calculations.

  • For example, (a + b) + c may not equal a + (b + c) when dealing with floating-point numbers.

  • This occurs because of how floating-point numbers represent large and small values, where the precision of smaller numbers can be "swallowed" by larger numbers.

  • This becomes magnified during the core operations of LLM inference.

Impact on Model Operations

The non-associativity of floating-point math can significantly affect calculations within LLMs. Specifically, this becomes an issue in operations such as:

  • Matrix multiplication

  • RMSNorm (Root Mean Square Normalization)

  • Attention mechanisms

  • These operations involve many floating-point additions and multiplications.

When servers process requests with different batch sizes, the reduction order of these operations changes. Different reduction orders can lead to different results.

The GPU Concurrency Misunderstanding

The research clarifies that GPU concurrency itself is not the problem. Even on GPUs, fixing the reduction order ensures consistent results across multiple runs. It is the varying reduction strategies, dependent on batch size, that cause the issue.

The Solution: Batch-Invariant Kernels

The solution involves implementing "batch-invariant kernels" for core LLM operations, ensuring a consistent reduction order regardless of batch size.

RMSNorm Implementation

The goal is to force the RMSNorm calculation to always use a data-parallel approach, with each batch element's reduction completed within a single core. This eliminates variations in the reduction order.

Matrix Multiplication Implementation

The implementation needs to avoid adjusting based on batch size. To accomplish that:

  • Disable "Split-K" strategies to prevent changing the reduction order based on batch size.

  • Fix the tensor core instruction size to ensure consistency in the internal reduction order.

Attention Mechanism Implementation

This is the most complex to solve.

  • Fix the split size for KV (Key-Value) cache, rather than splitting the cache based on the number of splits.

  • Consistently update the KV cache layout before attention kernels to avoid data layout differences.

Experimental Results

Thinking Machines Lab demonstrated the effectiveness of their solution using the Qwen/Qwen3-235B-A22B-Instruct-2507 model.

  • With standard kernels, 1000 samplings produced 80 different results.

  • With batch-invariant kernels, the results were 100% consistent across 1000 samplings.

While there may be some performance costs, the deterministic results make it worthwhile for critical applications.

Implications and Significance

This research has significant implications:

  • It provides a scientific solution for the reproducibility and reliability of LLMs.

  • It enables on-policy reinforcement learning for LLMs by ensuring consistent behavior between training and inference.

Conclusion

Thinking Machines Lab's work addresses a critical issue in LLM inference, paving the way for more reliable and consistent AI systems. The focus on determinacy highlights a new area for competition in LLM technology.

Was this summary helpful?

Quick Actions

Watch on YouTube

Related Summaries

Summarize a New YouTube Video

Enter a YouTube video URL below to get a quick summary and key takeaways.