Understanding Retrieval Augmented Generation (RAG)
This article explains Retrieval Augmented Generation (RAG), a technique used to enhance the performance of Large Language Models (LLMs) by providing them with relevant context. It addresses the problem of "model hallucination" and limitations in processing large documents.
The Problem: Model Hallucination and Large Documents
LLMs can sometimes generate incorrect or nonsensical answers, a phenomenon known as "model hallucination," especially when they lack the necessary information. Simply providing the entire document to the LLM isn't always effective. Large documents can overwhelm the model, making it difficult to pinpoint the key information needed to answer the question accurately. The model can be easily sidetracked with too much information.
RAG: A Solution
RAG aims to address these issues by providing the LLM with only the most relevant parts of a document. Instead of sending the entire document, it retrieves and sends only the sections that are most pertinent to the user's question.
Embedding Models: Finding Relevance
To determine the relevance of text, RAG utilizes embedding models.
-
Unlike LLMs, embedding models output a fixed-length array (also known as a vector) for any given text input.
-
This array represents a compressed, lossy representation of the text's meaning.
-
The key idea is that texts with similar meanings will have vectors that are close to each other in a high-dimensional space.
Vector Space and Distance
Imagine a coordinate system where each dimension corresponds to a value in the vector output by the embedding model. Each piece of text is represented as a point in this space. Texts with similar meanings will be located closer to each other. The distance between the vectors of two pieces of text indicates their semantic similarity.
When a user asks a question, it is also converted into a vector using the same embedding model. The program then calculates the distance between the question's vector and the vectors of all the text fragments in the document. The text fragments with the smallest distances are considered the most relevant and are sent to the LLM along with the question.
RAG Architecture: A Step-by-Step Breakdown
Here's how RAG works:
- Document Chunking: The document is divided into smaller pieces or "chunks". This process can involve splitting by character count, paragraph, sentence or more sophisticated methods.
- Embedding: Each chunk is processed by an embedding model, generating a vector for each chunk.
- Vector Database Storage: The vectors and their corresponding text chunks are stored in a vector database. Vector databases are designed to efficiently find the vectors closest to a given query vector.
- Query Embedding: When a user asks a question, the question is also converted into a vector using the same embedding model.
- Retrieval: The vector database is queried with the question's vector, and the database retrieves the k nearest neighbor vectors (the vectors closest to the question's vector).
- Augmented Generation: The text chunks corresponding to the retrieved vectors are sent to the LLM along with the user's question. The LLM then generates an answer based on this context.
Limitations of RAG
While RAG improves LLM performance, it's not without limitations:
-
Chunking Challenges: Determining the optimal way to chunk a document is difficult, as different structures may require different methods. In some cases, key information may be split across chunks, leading to loss of context.
-
Lack of Global Perspective: RAG struggles with questions that require understanding the entire document rather than specific sections. For example, counting the occurrences of a word across the entire document may not be easily achieved using RAG, as each sentence might have equal, weak relevance.
Future Improvements
Ongoing research is focused on improving RAG, including more sophisticated chunking strategies and methods for incorporating a more global perspective. Some solutions involve using LLMs themselves in the chunking process to identify optimal split points, or uniformly replacing terms to provide more context.
RAG: A Useful Compromise
RAG represents a practical compromise, working within the limitations of current LLMs to provide more accurate and relevant responses. While it's not a perfect solution, it addresses key challenges and helps to mitigate the risk of model hallucination, pending future architectures that transcend these compromises. RAG is a form of compression that filters for the most important components of a document.