最佳拍档: Fixing LLM Uncertainty: Batch Invariance Explained

Tired of inconsistent outputs from your large language models (LLMs), even with fixed random seeds? This summary dives into a groundbreaking study by Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, which reveals the hidden cause of LLM non-determinacy. We'll explore the problem, the real culprit, and the proposed solutions for repeatable and reliable AI.

Quick Takeaways:

LLM output inconsistencies stem from "batch variance," not solely GPU concurrency or floating-point errors.
Different batch sizes during inference alter the order of floating-point operations (non-associativity), leading to varying results.
Thinking Machines Lab proposes "batch-invariant kernels" for operations like RMSNorm, matrix multiplication, and attention mechanisms, ensuring consistent output regardless of batch size.
Implementing these kernels can unlock reliable on-policy reinforcement learning for LLMs.

The core issue isn't GPU parallelism, but the changing order of calculations due to varying batch sizes. Their solution involves enforcing consistent reduction strategies within key LLM operations, achieving remarkable determinacy. While some initial performance trade-offs exist, optimized "batch-invariant kernels" offer a path to robust and replicable AI outputs, especially critical in fields like finance, healthcare, and research, allowing for reproducible results.

This article explores the issue of non-determinism in large language model (LLM) inference, a common problem for developers and users. It delves into a recent research paper by Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, which sheds light on the root cause and proposes solutions.

Background: Thinking Machines Lab

Thinking Machines Lab has garnered significant attention since its inception.

The company completed a \$2 billion seed round with a valuation of \$12 billion.
This investment occurred before the release of any products, a rare occurrence in the AI industry.
Investors included prominent firms like A16z, NVIDIA, AMD, and Cisco.
This heavy investment in a company with no product yet is believed to be due to the strength of its research team and the valuable insights coming from their research.

The Problem: Inconsistent Outputs from LLMs

Why do LLMs sometimes produce different outputs even when given the same input and fixed random seeds? This inconsistency can be frustrating and presents a significant challenge for real-world applications. Even when using open-source inference libraries like vLLM and SGLang on your own hardware, the issue persists.

Common Misconceptions

Many attribute this non-determinism to GPU concurrency or floating-point arithmetic errors. While these factors play a role, they are not the primary cause, according to Thinking Machines Lab.

The True Culprit: Batch Invariance Loss

The core issue lies in the lack of batch invariance during LLM inference.

LLM inference servers handle requests in batches to optimize resource utilization.
The size of these batches varies depending on the server load.
The research found that the same input can produce different outputs depending on the batch size it is processed in.
The "non-associativity" of floating-point arithmetic is the primary culprit for this variance.

Floating-Point Non-Associativity

Floating-point numbers, designed for dynamic precision, can exhibit non-associativity in calculations.

For example, (a + b) + c may not equal a + (b + c) when dealing with floating-point numbers.
This occurs because of how floating-point numbers represent large and small values, where the precision of smaller numbers can be "swallowed" by larger numbers.
This becomes magnified during the core operations of LLM inference.

Impact on Model Operations

The non-associativity of floating-point math can significantly affect calculations within LLMs. Specifically, this becomes an issue in operations such as:

Matrix multiplication
RMSNorm (Root Mean Square Normalization)
Attention mechanisms
These operations involve many floating-point additions and multiplications.

When servers process requests with different batch sizes, the reduction order of these operations changes. Different reduction orders can lead to different results.

The GPU Concurrency Misunderstanding

The research clarifies that GPU concurrency itself is not the problem. Even on GPUs, fixing the reduction order ensures consistent results across multiple runs. It is the varying reduction strategies, dependent on batch size, that cause the issue.

The Solution: Batch-Invariant Kernels

The solution involves implementing "batch-invariant kernels" for core LLM operations, ensuring a consistent reduction order regardless of batch size.

RMSNorm Implementation

The goal is to force the RMSNorm calculation to always use a data-parallel approach, with each batch element's reduction completed within a single core. This eliminates variations in the reduction order.

Matrix Multiplication Implementation

The implementation needs to avoid adjusting based on batch size. To accomplish that:

Disable "Split-K" strategies to prevent changing the reduction order based on batch size.
Fix the tensor core instruction size to ensure consistency in the internal reduction order.

Attention Mechanism Implementation

This is the most complex to solve.

Fix the split size for KV (Key-Value) cache, rather than splitting the cache based on the number of splits.
Consistently update the KV cache layout before attention kernels to avoid data layout differences.

Experimental Results

Thinking Machines Lab demonstrated the effectiveness of their solution using the Qwen/Qwen3-235B-A22B-Instruct-2507 model.

With standard kernels, 1000 samplings produced 80 different results.
With batch-invariant kernels, the results were 100% consistent across 1000 samplings.

While there may be some performance costs, the deterministic results make it worthwhile for critical applications.

Implications and Significance

This research has significant implications:

It provides a scientific solution for the reproducibility and reliability of LLMs.
It enables on-policy reinforcement learning for LLMs by ensuring consistent behavior between training and inference.

Conclusion

Thinking Machines Lab's work addresses a critical issue in LLM inference, paving the way for more reliable and consistent AI systems. The focus on determinacy highlights a new area for competition in LLM technology.

Fixing LLM Uncertainty: Batch Invariance Explained

Summary

Quick Abstract

Background: Thinking Machines Lab

The Problem: Inconsistent Outputs from LLMs

Common Misconceptions

The True Culprit: Batch Invariance Loss

Floating-Point Non-Associativity

Impact on Model Operations

The GPU Concurrency Misunderstanding

The Solution: Batch-Invariant Kernels

RMSNorm Implementation

Matrix Multiplication Implementation

Attention Mechanism Implementation

Experimental Results

Implications and Significance

Conclusion

Quick Actions

More from 最佳拍档

【人工智能】AI构建者手册2025 | ICONIQ发布68页报告| AI原生公司 | AI赋能公司 | 代理工作流 | 基础设施 | 市场定价 | 团队结构 | 成本预算 | 内部效率

【商业】算力新锐CoreWeave即将IPO | 挖矿前身 | AI转机 | 151亿美元RPO | 预期能否兑现 | 软硬件实力 | 英伟达深度绑定 | 营收和亏损双增 | 市场竞争和风险

【英伟达】Tensor Core演进史 | SemiAnalysis | Amdahl定律 | 强、弱缩放 | Volta | Turing | Ampere | Blackwell | 结构化稀疏

Related Summaries

【人工智能】AI构建者手册2025 | ICONIQ发布68页报告| AI原生公司 | AI赋能公司 | 代理工作流 | 基础设施 | 市场定价 | 团队结构 | 成本预算 | 内部效率

【商业】算力新锐CoreWeave即将IPO | 挖矿前身 | AI转机 | 151亿美元RPO | 预期能否兑现 | 软硬件实力 | 英伟达深度绑定 | 营收和亏损双增 | 市场竞争和风险

【英伟达】Tensor Core演进史 | SemiAnalysis | Amdahl定律 | 强、弱缩放 | Volta | Turing | Ampere | Blackwell | 结构化稀疏

【爆料】非营利组织猛爆Sam Altman黑料 | OpenAI Files | 冒充YC董事长 | 涉嫌利益输送 | 架空OpenAI董事会 | 取消投资回报上限 | 隐瞒持股 | 欺骗和隐瞒

【人工智能】软件3.0时代到来 | Andrej Karpathy | 软件的三个阶段 | 大模型是操作系统 | 早期操作系统之争 | 局限性 | 部分自治应用 | 双向奔赴 | 可靠性鸿沟

【人工智能】AI构建者手册2025 | ICONIQ发布68页报告| AI原生公司 | AI赋能公司 | 代理工作流 | 基础设施 | 市场定价 | 团队结构 | 成本预算 | 内部效率

【商业】算力新锐CoreWeave即将IPO | 挖矿前身 | AI转机 | 151亿美元RPO | 预期能否兑现 | 软硬件实力 | 英伟达深度绑定 | 营收和亏损双增 | 市场竞争和风险

Summarize a New YouTube Video