Exploring Machine Learning on Apple Silicon: Clustering and Large Language Models
This article explores the capabilities of Apple Silicon machines, particularly Mac Studios, for running large language models (LLMs). It compares these machines to high-end GPUs, discusses clustering techniques, and delves into the challenges and potential of distributed machine learning on Apple hardware.
The Memory Bottleneck in Machine Learning
Machine learning models, especially LLMs, demand significant memory. Traditionally, high-end GPUs have been favored due to their parallel processing capabilities. However, these GPUs, like Nvidia's H100s and consumer-grade options like the RTX 5090, come with high costs, power consumption, and relatively limited RAM (e.g., up to 32GB for the RTX 5090). Finding them can also be difficult.
Apple Silicon's Advantage: Unified Memory and Power Efficiency
Apple Silicon machines offer a compelling alternative. While not as fast as Nvidia GPUs in raw processing power, they are readily available, highly power-efficient, and boast large amounts of unified memory. This shared memory allows the GPU to access system RAM, enabling the execution of models like a 70 billion parameter model on a machine with 128GB of RAM.
Clustering with Mac Studios: Experimenting with EXO and MLX Distributed
The article investigates clustering multiple Mac Studios to further expand memory capacity. It explores two primary methods:
- EXO: A clustering solution designed for simplicity and automatic operation.
- MLX Distributed: Apple's framework for distributed machine learning on Apple Silicon, offering optimized performance.
The initial tests involved a cluster of four M4 Max Mac Studios, each with 128 GB of memory. The goal was to run models exceeding the memory capacity of a single machine.
Setting Up an MLX Distributed Cluster
The following steps are crucial for setting up an MLX Distributed cluster:
- SSH Configuration: Enable SSH on all machines and allow passwordless login between them.
- Consistent Python Environments: Utilize a tool like Conda to create identical Python environments on all machines. This ensures consistency in package versions and dependencies.
- Host File Configuration: Create a
hosts.json
file that defines the hostnames of each machine in the cluster. - Network Setup: Establish a local network connection between the machines, utilizing either Ethernet or Thunderbolt.
Ethernet vs. Thunderbolt for Networking
The article compares Ethernet and Thunderbolt for inter-machine communication:
- Ethernet (10 Gigabit): Offers a straightforward setup and good stability. Speed tests reveal transfer speeds around 9.4 GBs.
- Thunderbolt Bridge: Requires manual IP address configuration and provides significantly faster data transfer speeds (e.g., 65 Gbits per second). However, MLX Distributed doesn't fully leverage this speed advantage because it assumes models are pre-loaded on each machine.
Performance Evaluation: Scaling Challenges and Model Size Limitations
Experiments with a smaller model (Deep Sea Coder V2 light instruct 4-bit) revealed interesting scaling behavior. While a single machine achieved 173 tokens per second, adding a second machine slowed performance to 107 tokens per second. Further increasing the cluster size to four machines resulted in a significant performance drop to 79 tokens per second.
The limitations become more pronounced with very large models. The DeepSeek R1 4-bit model (420 GB) couldn't run on two machines with 128 GB RAM each. Even with four machines, the process struggled to utilize the GPU effectively, resulting in only 15 tokens per second.
Introducing the Mac Studio with 512 GB Unified Memory
The article highlights the benefits of Apple's Mac Studio with 512GB of unified memory. It can run the DeepSeek R1 model comfortably, achieving 19 tokens per second. This demonstrates the value of having a large memory pool in a single machine for demanding tasks.
Comparing the M4 Max and M3 Ultra
The M4 Max and M3 Ultra chips were compared, running smaller models with the memory fully offloaded to the GPU. Surprisingly, the M4 Max was faster on these particular tests.
The Challenge of Uneven Memory Distribution
The article explores running larger models across a cluster with a Mac Studio with 512GB of memory and four others with 128GB of memory. The MLX framework attempts to split a single large model across multiple devices.
The primary challenge involved uneven memory distribution, specifically with the DeepSeek V3 model (750 GB). MLX Distributed's dynamic load distribution aims to allocate parts of the model based on machine capabilities. However, initial attempts to leverage the 512 GB machine more effectively were unsuccessful.
The author discovered that MLX distributed uses MPI (Message Passing Interface). MPI did not account for the order that machines were specified in the host file so the larger memory machine was not being used efficiently.
Final Thoughts and Recommendations
The article concludes with several key takeaways:
- For smaller models (30-70 billion parameters), a single machine with 128GB of RAM is sufficient.
- Clustering becomes necessary for models exceeding the memory capacity of a single machine.
- MLX Distributed shows promise but still faces challenges in efficiently utilizing clusters with heterogeneous memory configurations.
- Ethernet offers simplicity and stability, while Thunderbolt provides faster data transfer (though not always fully utilized by MLX Distributed).
- Until improved memory distribution strategies are developed, using identical machines for clustering is recommended.
The author acknowledges the complexities and ongoing development in this field, expressing hope for future advancements that will unlock the full potential of distributed machine learning on Apple Silicon.