Video thumbnail for The myth of 1-bit LLMs | Extreme Quantization

1-Bit LLMs Explained: How Extreme Quantization Works

Summary

Quick Abstract

Delve into the fascinating world of one-bit Large Language Models (LLMs)! Despite the catchy name, are they truly one-bit? This summary explores Microsoft's BitNet, a leader in this area, and discusses how it achieves impressive performance with drastically reduced memory usage, opening doors for local LLM execution and enhanced privacy. Learn about the architecture, training techniques, and potential impact of quantized LLMs.

Quick Takeaways:

  • "One-bit" LLMs like BitNet are actually closer to 1.58 bits or ternary weights.

  • Quantization significantly reduces memory footprint, enabling faster inference and smaller checkpoints.

  • BitNet utilizes quantization-aware training (QAT) to maintain accuracy with low-precision weights.

  • Element-wise lookup tables (ELUT) are employed for efficient storage of ternary weights.

  • Despite current limitations, scaling laws suggest that larger one-bit LLMs could be very powerful.

  • Companies like Microsoft and Google push for low-bit LLMs, envisioning local AI on consumer devices.

This approach allows inference to run faster while consuming less memory, but still maintains strong performance. We will also talk about how they are stored and what sort of results these types of LLMs get.

Understanding One-Bit LLMs: A Deep Dive

Microsoft's "one-bit LLM" is a catchy term, but it's more of a metaphor than a literal description. While inspired by the binary nature of neurons, these models, like the BitNet paper, often utilize more than one bit. Let's explore what one-bit LLMs are, how they function, and their potential impact.

The Inspiration: Biological Neurons and Binary Values

The concept of one-bit LLMs draws inspiration from the way biological neurons operate. Bill Gates famously stated that neurons function in a binary fashion, either firing or not firing. In contrast, one-bit LLMs use binary weight values of -1 and +1, as -1 is more expressive than zero.

Some models use ternary values (-1, 0, and +1). The inclusion of zero allows the model to ignore irrelevant features. However, ternary values require more than one bit for storage, specifically about 1.58 bits. The term "one-bit LLM" is often used loosely to refer to both binary and ternary variants.

Motivation: Speed, Memory, and Accessibility

The primary motivation behind pursuing one-bit LLMs is to achieve faster inference with reduced memory usage. Quantizing model weights from 16-bit or 8-bit floating-point to a single bit significantly reduces memory requirements. This, in turn, leads to faster inference, especially on hardware limited by memory bandwidth. Furthermore, smaller checkpoints make these models easier to download and distribute. BitNet, for example, runs inference significantly faster on both GPUs and CPUs compared to full-precision models with similar parameter counts.

Historical Context: From CNNs to LLMs

The idea of binary models isn't entirely new. Quantized convolutional neural networks (CNNs) emerged around 2015 for image classification. Examples include BinaryConnect, BinaryNet, and XNORNET. Later, quantized BERT models like BinaryBERT and TernaryBERT appeared. With the rise of powerful LLMs, Microsoft introduced BitNet in 2023. The original BitNet used truly one-bit binary weights, later upgrading to ternary weights in a subsequent paper. A two-billion parameter version of Ternary BitNet was recently open-sourced.

How One-Bit LLMs Work: The Bit Linear Layer

BitNet, at a high level, uses the transformer architecture with a key modification: the introduction of a custom "Bit Linear" layer.

  • Standard Transformer Blocks: These blocks contain self-attention mechanisms and feed-forward networks. These components usually rely on linear layers, which involve matrix multiplication.

  • Bit Linear Layer: The conventional linear layers are swapped out for Bit Linear layers. Within these layers, activations are 8-bit integers and weights are ternary. Crucially, only the weights within Bit Linear are ternary; the rest of the network operates in full precision, including the attention computation and the token embedding matrix.

Quantization and Dequantization

The Bit Linear layer acts as a black box, taking full-precision activations and outputting a transformed version. This involves a quantization step upon entry (converting floats to 8-bit integers) and a dequantization step upon exit (converting integers back to float16). The matrix multiplication within Bit Linear is optimized for ternary weights, reducing computation to additions and subtractions.

Layer Normalization

A layer normalization step is applied before the activation is quantized to ensure the mean is zero and the variance is one. This helps with quantization implementation and prevents outliers. Layer normalization homogenizes the activations making the whole process resilient to outliers.

Quantization-Aware Training (QAT)

One-bit LLMs are not simple to implement. They need Quantization-Aware Training (QAT).

  • The Problem with Post-Training Quantization (PTQ): While PTQ (quantizing a trained model) works for higher bit widths (e.g., int8), it struggles with one or two bits.

  • QAT as a Solution: During QAT, a master copy of the weights is preserved in full precision during training. This is used in the backward pass to calculate gradients, update optimizer states, and nudge the weights in the appropriate direction. The forward pass uses quantized weights to predict the training, thereby exposing the model to the quantized weights. This quantization is simulated and the weights are thrown away immediately.

The quantization process, which involves rounding, is not differentiable. The straight-through estimator (STE) is used to approximate the gradients. STE essentially bypasses the rounding by setting the derivative to 1.

QAT is often applied as a fine-tuning stage after full-precision pre-training. This makes sense since pre-training takes up to 90% of the training budget, with 10% used for QAT fine-tuning.

Storage and Bit Packing

Storing ternary weights efficiently requires more than just allocating one bit per weight.

  • Naive Approach: Storing each two-bit number in its own 8-bit integer is inefficient.

  • Bit Packing: Packing multiple two-bit weights into a single byte saves space but requires custom logic for unpacking and operations.

  • Elementwise Lookup Table (ELUT): This method groups weights and encodes each group as a single unit. For example, grouping three weights results in 27 combinations, requiring five bits for encoding, resulting in an average of 1.67 bits per weight. BitNet adopts TL2 (groups of three weights stored in five bits) and TL1 (groups of two weights stored in four bits).

Efficient Computation with Ternary Weights

Matrix multiplication with ternary weights can be optimized by precomputing lookup tables. For example, when grouping three activations together, there are 27 cached values for each group. When performing the matrix multiplication, the results can be accumulated by adding the partial results from the table. Microsoft has released bitnet.cpp, a runtime for efficient ternary LLM operation.

Performance and the Future

While vibe checks of the 2 billion parameter model show somewhat limited performance, a fair comparison is with other open-weight models with 1-2 billion parameters. In these comparisons, BitNet performs well, often falling right behind the leader, while taking up less memory. Scaling laws hold for low-bit models, suggesting that the unreleased 70 billion parameter BitNet model will offer significantly improved performance.

Other LLM providers may soon join the race to develop and improve one-bit LLMs. Google's Gemma already has a four-bit version, and the ecosystem has good incentives to keep pushing the research.

Was this summary helpful?

Quick Actions

Watch on YouTube

Related Summaries

No related summaries found.

Summarize a New YouTube Video

Enter a YouTube video URL below to get a quick summary and key takeaways.