Video thumbnail for Skip M3 Ultra & RTX 5090 for LLMs | NEW 96GB KING

96GB RTX Pro 6000 vs RTX 5090: Is it Worth it for AI?

Summary

Quick Abstract

Is the new RTX Pro 6000 worth the hefty price tag for AI tasks? This video benchmarks NVIDIA's high-end card, comparing its tokens per second performance against the RTX 5090 and Apple's M4 Max. We explore various model sizes and quantization levels, including running a massive 35,000 token prompt, to uncover its true potential.

Quick Takeaways:

  • The RTX Pro 6000 boasts 96GB VRAM, enabling larger models and complex tasks.

  • Achieved impressive speeds: 215 tokens/second with Gemma 3, and 20 tokens/second on a 70B parameter model with full layer offload.

  • Surprisingly, the RTX 5090 often outperformed the Pro 6000 on smaller, quantized models.

  • FP16 and F32 models showed similar, or slightly better, performance on the Pro 6000.

  • High power consumption (up to 600W) and noticeable coil whine are factors to consider.

  • The card offers better value per dollar compared to the M3 Ultra Mac Studio for specific AI workflows.

Ultimately, the RTX Pro 6000 shines with large models, but its performance edge isn't consistent across all tasks, raising questions about its overall value.

RTX Pro 6000 vs. RTX 5090: A Deep Dive into AI Performance

This article compares the new RTX Pro 6000 graphics card with other high-end cards, especially the RTX 5090, focusing on AI model performance, tokens per second, and value for money. The testing environment is a custom-built AI machine running Linux. All cards were purchased independently.

Initial Impressions and Setup

The RTX Pro 6000 is a substantial card with 96 GB of VRAM. The aim is to assess its token generation speed for AI models and determine if its high price is justified. The initial setup involves connecting the power cable and display ports (four in total) to the card.

Performance with Smaller Models

A test was conducted with a model that previously performed poorly on the RTX 5090 due to VRAM limitations. By offloading all 80 layers to the GPU, the model loaded successfully into the RTX Pro 6000's VRAM (43GB). With a small prompt, the RTX Pro 6000 achieved approximately 31.89 tokens per second.

Comparison with RTX 5090

The RTX Pro 6000 boasts a significant CUDA core bump (24,604) compared to the RTX 5090 (21,760). The RTX Pro 6000 achieved 215 tokens per second with the Gemma 3 model, while the RTX 5090 did about 93 tokens per second.

Testing with Large Models (70B Parameters)

The RTX Pro 6000's large VRAM is ideal for large language models. Different quantization levels (Q4 and Q8) were explored, impacting model size and potentially output quality. Offloading layers to the GPU is crucial for optimal performance.

  • Q4: Around 40GB in size.

  • Q8: Around 70GB in size.

Initially, with only 66 out of 80 layers offloaded, performance was poor (3 tokens per second). However, offloading all 80 layers drastically improved the speed to 20 tokens per second.

Long Prompt Testing

A 205-token prompt resulted in approximately 18 tokens per second. Subsequently, the article included a sponsored segment for ChatLLM Teams, a dashboard that houses top LLMs.

Performance with FP16 and F32 Models

The RTX Pro 6000 was tested with floating-point 16 (FP16) and floating-point 32 (F32) models. The Quen Coder 32 billion instruct FP16 model yielded 23 tokens per second. A longer prompt resulted in a similar token generation speed. The Mistral 7B F32 model achieved 51 tokens per second. A comparison to the M4 Max machine was made using M4 Max versions of the same models, where the RTX 6000 had significantly higher token per second, for example, 7.63 on M4 Max compared to 22 on RTX 6000.

Pushing the Limits: A 35,000 Token Prompt

A generated 35,000 token prompt, though failing to load initially due to LM Studio's context length limits, was eventually tested with a model supporting a larger context length (Quen 2.5 coder 32 billion instruct Q8) with a context length setting of 40,000. The model loaded successfully, using 74.9 GB of VRAM. The processing was slow.

  • Token Generation Speed: 17 tokens per second.

  • Time to First Token: 29.9 seconds.

Head-to-Head Comparison

A comparison of the RTX Pro 6000 with other cards revealed some interesting findings.

  • The RTX Pro 6000 was the only card capable of running a 70-billion parameter model.

  • The RTX 5090 was surprisingly faster than the RTX Pro 6000 in several tests, particularly with smaller, quantized models (Q4).

  • FP16 and F32 models performed similarly, or slightly better, on the RTX Pro 6000.

  • These findings were unexpected given the RTX Pro 6000's greater CUDA core count.

Value Analysis

The fastest speed observed on RTX Pro was 215 tokens per second, where M3 Ultra got 100.

  • M3 Ultra Max Studio: $10,000

  • RTX Pro 6000: $7,500 - $11,000

Even at full price, the RTX Pro 6000 offers almost two times the value (tokens per second per dollar) compared to the Mac Studio. The large VRAM is a significant advantage, despite its cost.

Conclusion

Despite the higher cost, the RTX Pro 6000 offers a significant performance boost for large AI models due to its 96GB VRAM. The RTX 5090, however, proved to be surprisingly competitive, and even faster in some tests with smaller, quantized models.

Was this summary helpful?