最佳拍档: AMD vs NVIDIA: Why AMD GPUs Still Struggle in AI? (SemiAnalysis Deep Dive)

Delve into the AI algorithm war between Nvidia and AMD GPUs! A new, in-depth report dissects why Nvidia dominates data centers despite AMD's seemingly superior hardware. We explore technical performance, total cost of ownership, and the crucial role of software ecosystems and rental market dynamics in shaping market share. Discover the hidden complexities behind AI server performance and the factors influencing enterprise purchasing decisions.

Quick Takeaways:

AMD's MI300X shows TCO advantages, especially with large, dense models.
Nvidia's CUDA ecosystem and software optimization give it a crucial edge, particularly in low-latency scenarios.
AMD struggles with software ecosystem development and limited rental market availability.
H200 often outperforms MI300X in low-latency tasks.
Memory bandwidth impacts performance with different workloads.
Market segmentation reflects supply chain and business model differences.
AMD's future depends on faster software optimization and ecosystem development to rival Nvidia's dominance.

Introduction

Hello everyone, this is Zui Jia Paitang. I'm Dafei. In today's artificial intelligence algorithmic military competition, Nvidia's GPU has become almost a standard for the global data center, while AMD's products seem to always stay out of the mainstream market. Even though AMD has been frequently releasing heavy-duty products like MS300X and MS325X in recent years, surpassing in some technical parameters, the market feedback has been neither good nor bad. A semiconductor analysis institution, SEMIANALYSIS, took six months to complete a super-long in-depth report, which may unveil the mystery.

The Test Setup

Goal and Methodology

For a long time, it has been said in the market that AMD's AI servers have better reasoning performance in terms of total ownership cost. To verify this, the Semi-Analysis team conducted a 6-month marathon test. Their aim was to compare the actual performance of AMD and Nvidia reasoning solutions in a real production environment. The core methodology was to break through the limitations of traditional offline standard tests and focus on the dynamic balance of online load and end-to-end delay, simulating real user response scenes.

Model Selection

The testing team carefully selected representative dense structure and diluted mixed expert structure models. For the dense structure, they chose the L3-70B and L8-405B models with FP16 resolution. For the diluted mixed expert structure, they used the DPSeqv3-670B model with FP8 resolution. DeepSig V3's model structure is very close to OpenAI's models, making its test results of important reference value.

Input and Output Token Length Combinations

To reflect actual reasoning scenes and performance characteristics, the team carried out basic tests of three different input and output token length combinations. For dense decoding tasks, they set 4K input and 1K output. For balanced dialogue tasks, it was 1K input and 1K output. For memory bandwidth - sensitive reasoning tasks, they used 1K input and 4K output. These three scenarios cover different business needs.

Push-Brake Engine Choice

The choice of the push-brake engine was also challenging. VLLM became the main test framework of the LAMA3 series due to its extensive compatibility. TENSOR RT-LLM showed Nvidia's depth optimization capability in free hardware. And SG-LUN became the first choice of DeepSync V3 because of its efficiency in handling large-scale models. The test team also evaluated the impact of volume and TP configuration.

Hardware Specifications Comparison

In terms of hardware specifications, AMD's MI300X has 192GB of HBM capacity and 5.3TB of bandwidth per second, with a single-node theoretical bandwidth of 42.4TB per second. MI325X has 256GB of HBM and 6TB of bandwidth per second. Nvidia's H200 has 144GB of HBM and 4.8TB of bandwidth per second, while the B200 with the Blackwell architecture has a horror bandwidth of 8TB per second and a single-node theoretical bandwidth of 64TB per second. This bandwidth gap may explain AMD's market share drop in Q1 of 2025 after Nvidia's new product release.

Test Results

Lama3 70B IP16 Test

In the Lama3 70B IP16 test, different scenes led to different results. In the 1K input and 1K output balancing task of chatting and translating, in low-latency scenarios, H100 and H200 easily took the lead with VLIM. But after improving the size and frequency of P processing, MI325X's high bandwidth advantage emerged. In the 1K input and 4K output memory sensitivity task, H100 lagged due to bandwidth limitation, while MI325X was relatively stable but had the most power consumption in high latency. H200 with Tanser RT-LRI-M showed a comprehensive advantage.

Lama3 405B FP8 Test

In the Lama3 405B FP8 large - shape chip model test, AMD's hardware advantage was more obvious. In the 1K input and 1K output scenarios, MI325X and MI300X exceeded all of Nvidia's configurations in less than 40 seconds. But H200 with Tansel RT-L2IM maintained high performance. In the 1K input and 4K output memory - limited scene, MI325X suppressed H200 with VLIM configuration, but H200 supported by TensorRT and VLIM still led.

DeepSync v3 670B FP8 Test

In the DeepSync v3 670B FP8 test, H100 couldn't run the model due to single-node memory limitation. In the low-latency and high-interaction chat scene, H200 almost won. Mi325X could compete in a small range. But in high-latency tasks, Mi325X's performance per dollar was 20% - 30% higher than H200.

Overall, in most scenarios, MI300X lacks competitiveness compared to H200, especially in low-latency scenarios. But in some specific areas like large - model tests, AMD shows unique advantages.

Total Cost of Ownership (TCO)

From a long-term perspective, AMD shows an advantage in TCO. The total cost of MI300X per hour is $1.34, lower than $1.58 of H100 and $1.63 of H200. In the ultra-low delay task of Lama3 70B, the cost of every million tokens of MI325X and MI300X is better. But as the delay increases, Nvidia's scale effect and software optimization reverse its cost efficiency. However, MS-325X's price increase exceeds its performance increase, not transforming into a cost advantage.

Software Ecology and Market Segmentation

Software Ecology Gap

AMD's real challenge is in software ecology construction. Nvidia's CUDA ecosystem has been accumulated for decades, with over 2 million developers and tens of thousands of applications. AMD's ROCK-M platform lags in CI coverage, data quality, and core compared to CUDA. For example, the developer experience of Tansel RT LRM is still being optimized, and AMD's SG-LAN market coverage rate is less than 10% of Nvidia's, with a high technical threshold for ordinary users.

Market Segmentation

The ecological difference in the rental market also forms a barrier for AMD. More than 100 new cloud service providers offer mid-term services for Nvidia GPUs, pushing rental costs low. For AMD, only a few suppliers provide short-term rents for Mi300X and Mi325X, with high rental prices. This makes small and medium-sized enterprises unlikely to choose AMD for short-term needs.

Future Outlook

SEMI - Analysis believes that AMD's response rhythm seems behind Nvidia's Blackwell structure. The large-scale export of MI325X is one quarter behind H200, and MI355X will wait until the end of 2025. But AMD still has a chance. The MSI 55X with 288GB of HBM and 8TB of bandwidth per second could repeat part of the competitive pattern in 2026 with fast-tracking software optimization.

This double - bear struggle drives technological progress. The ecological isolation and cost advantage of AMD form a balance. The customized demand of super-large-scale enterprises and the flexible purchase model of small and medium-sized enterprises coexist, indicating a more complex split structure in the future algorithmic market. Developers and users need to pay close attention to the interweaving of hardware and software ecosystems.

About the Author

The author Dylan is very persistent in comparing Nvidia and AMD. He has previously written two articles analyzing AMD GPU problems and was interviewed by Suma once. This time he wrote another one. Do you agree with the analysis of AMD in the semi-analysis article? Welcome to leave a message in the comment section. Thank you for watching this video, see you next time.

AMD vs NVIDIA: Why AMD GPUs Still Struggle in AI? (SemiAnalysis Deep Dive)

Summary

Quick Abstract

Introduction

The Test Setup

Goal and Methodology

Model Selection

Input and Output Token Length Combinations

Push-Brake Engine Choice

Hardware Specifications Comparison

Test Results

Lama3 70B IP16 Test

Lama3 405B FP8 Test

DeepSync v3 670B FP8 Test

Total Cost of Ownership (TCO)

Software Ecology and Market Segmentation

Software Ecology Gap

Market Segmentation

Future Outlook

About the Author

Quick Actions

More from 最佳拍档

【人工智能】击败大模型推理的非确定性 | Thinking Machines | 批次不变性缺失 | 浮点数非结合性 | 归约化顺序 | 批次不变内核 | RMSNorm | 矩阵乘法 | 注意力机制

【人工智能】AI构建者手册2025 | ICONIQ发布68页报告| AI原生公司 | AI赋能公司 | 代理工作流 | 基础设施 | 市场定价 | 团队结构 | 成本预算 | 内部效率

【商业】算力新锐CoreWeave即将IPO | 挖矿前身 | AI转机 | 151亿美元RPO | 预期能否兑现 | 软硬件实力 | 英伟达深度绑定 | 营收和亏损双增 | 市场竞争和风险

Related Summaries

【人工智能】击败大模型推理的非确定性 | Thinking Machines | 批次不变性缺失 | 浮点数非结合性 | 归约化顺序 | 批次不变内核 | RMSNorm | 矩阵乘法 | 注意力机制

【人工智能】AI构建者手册2025 | ICONIQ发布68页报告| AI原生公司 | AI赋能公司 | 代理工作流 | 基础设施 | 市场定价 | 团队结构 | 成本预算 | 内部效率

【商业】算力新锐CoreWeave即将IPO | 挖矿前身 | AI转机 | 151亿美元RPO | 预期能否兑现 | 软硬件实力 | 英伟达深度绑定 | 营收和亏损双增 | 市场竞争和风险

【英伟达】Tensor Core演进史 | SemiAnalysis | Amdahl定律 | 强、弱缩放 | Volta | Turing | Ampere | Blackwell | 结构化稀疏

【爆料】非营利组织猛爆Sam Altman黑料 | OpenAI Files | 冒充YC董事长 | 涉嫌利益输送 | 架空OpenAI董事会 | 取消投资回报上限 | 隐瞒持股 | 欺骗和隐瞒

【人工智能】击败大模型推理的非确定性 | Thinking Machines | 批次不变性缺失 | 浮点数非结合性 | 归约化顺序 | 批次不变内核 | RMSNorm | 矩阵乘法 | 注意力机制

【人工智能】AI构建者手册2025 | ICONIQ发布68页报告| AI原生公司 | AI赋能公司 | 代理工作流 | 基础设施 | 市场定价 | 团队结构 | 成本预算 | 内部效率

Summarize a New YouTube Video