Large language models (LLMs) are powerful, but their high hardware requirements make them inaccessible to many. This article explores the innovative techniques researchers are developing to reduce the hardware needed to run these models, focusing on quantization and the groundbreaking BitNet architecture.
The Problem: Hardware Costs and Model Size
Running state-of-the-art open-source models like Deep Seek V3 can be prohibitively expensive, requiring hardware costing upwards of $400,000. While researchers have created smaller models or distilled larger ones to reduce hardware needs, even these scaled-down versions can require expensive GPUs costing around $20,000. This leaves many users with the option of using smaller, less capable models, which can be frustrating due to their limited abilities.
Quantization: Reducing Precision for Efficiency
One approach to reducing hardware demands is quantization. This technique reduces the number of bits used to store model weights, impacting the model's memory footprint.
How Quantization Works
In a typical model, weights are stored using FP16 (16 bits). A 7 billion parameter model requires approximately 14 GB of memory. To speed up processing, this data is loaded into GPU VRAM. However, users with less VRAM (e.g., 8 GB) face limitations. While offloading (actively rotating weights) is an option, it significantly slows down performance.
Quantization addresses this by using fewer bits per weight (e.g., FP8, INT4). This reduces memory usage but also decreases precision. With FP16, the smallest increment between numbers is around 0.001. In FP8, it's around 0.125, and in INT4, it's 1. Reduced precision means the model must round weights, which can affect prediction accuracy. To mitigate this, quantized models are often fine-tuned after quantization using calibration datasets.
Benefits of Quantization
Despite the potential loss of precision, quantization offers significant benefits:
-
Reduced memory usage: Cutting memory usage by at least half is possible.
-
Minimal performance drop: The performance difference between FP16 and FP8 is often small.
-
Overall Improvement: Running a quantized larger model can be better than using a smaller, full-precision model.
BitNet: A Revolutionary Approach
BitNet is a novel architecture designed to drastically reduce hardware requirements by using extremely low-precision weights.
BitNet's Core Concept
The initial BitNet paper proposed using only 1-bit weights, representing either 1 or -1. This significantly reduces storage requirements and eliminates the need for matrix multiplications, replacing them with simpler addition and subtraction. While technically challenging, training from the ground up with this 1-bit setup results in better stability than quantizing a fully trained model.
BitNet B1.58: Introducing Sparsity
BitNet B1.58 introduced a new state: zero, in addition to 1 and -1. This introduces sparsity, allowing the model to turn off connections between neurons. This enhances model performance and retains the benefits of simplified computations.
Performance and Memory Savings
BitNet B1.58 offers substantial memory savings and speed improvements. For example:
-
A 1.3 billion parameter Llama 1 model uses nearly three times less memory and is 66 times faster.
-
A 3 billion parameter Llama 1 model uses 3.5 times less memory and is 2.7 times faster.
-
When scaled up to 70 billion parameters, BitNet requires 7.16 times less memory than Llama 70B.
Remarkably, a 70 billion parameter BitNet B1.58 model is more efficient in terms of generation speed, memory usage, and energy consumption than a 13 billion parameter full-precision LLM. The "B1.58" name refers to the approximate information content per weight now that three states are possible.
BitNet A4.8: Optimizing Activations and KV Cache
Further research led to BitNet A4.8, which focuses on reducing the precision of activations. It uses 4-bit activations for attention and feedforward network inputs, while retaining 8-bit precision for intermediate states where precision is crucial. It also introduces a 3-bit KV cache which greatly expands the context window with minimal memory use.
Bitnet A4.8 is also a sparse model that only uses 55% of the parameters for every input.
Training Costs and Scaling
BitNet is very energy efficient, making it extremely attractive for training. It's estimated that training a BitNet model can cost significantly less than training a traditional transformer model. This efficiency, combined with its memory advantages, makes BitNet a promising architecture for scaling LLMs.
The Future of BitNet
BitNet's performance is promising. Further advancements are anticipated, particularly in optimizing hardware for ternary operations and improving long-context performance. The BitNet B 1.58B4T models are currently available on Hugging Face. Ongoing research focuses on addressing the technical challenges of ternary operations for transformers.