Implementing a RAG Architecture from Scratch
This article demonstrates how to implement a Retrieval-Augmented Generation (RAG) architecture from scratch. RAG helps Large Language Models (LLMs) maintain focus and avoid generating nonsensical responses when dealing with long contexts. We will walk through the process step-by-step, including chunking, embedding, and querying a vector database.
Understanding RAG
The core idea behind RAG is to provide the LLM with relevant context extracted from a knowledge base. This involves:
- Chunking: Breaking down a document into smaller, manageable segments.
- Embedding: Converting each chunk into a vector representation that captures its semantic meaning.
- Vector Database: Storing these embeddings in a database optimized for similarity searches.
- Retrieval: When a user asks a question, finding the chunks in the vector database that are semantically similar to the question.
- Augmentation: Sending these retrieved chunks, along with the original question, to the LLM to generate a more informed and accurate response.
Chunking the Article
The first step is to break the source document into smaller chunks. For this demonstration, we'll use an article titled "Regarding Linghu Chong's Reincarnation as a Slime and Offering a Beautiful Explosion to the World". The article's structure, with paragraphs separated by two carriage returns, simplifies the chunking process.
-
We create a
chunk.py
file. -
The
read_data
function reads the article into a single string. -
The
get_chunk
function splits the article into a list of strings, using two carriage returns as delimiters.
A modification is made to handle chapter titles. If a paragraph begins with a hash sign (#
), it is merged with the following body text to avoid creating overly short, isolated chunks. Alternative chunking methods, such as LangChain's RecursiveCharacterTextSplitter
, are also available.
Embedding the Chunks
Next, each chunk needs to be converted into an embedding vector and stored in a vector database.
-
The
embed.py
file is created. -
ChromaDB is used as the vector database due to its simplicity.
-
Google's embedding model
gemini-embedding-exp-03-07
is chosen for generating embeddings. -
The necessary dependencies are installed, and the
GOOGLE_API_KEY
environment variable is configured.
The embed
function takes text as input and returns its embedding. Google's embedding model distinguishes between storage and query embeddings.
-
Storage embeddings are used for storing document chunks in the database.
-
Query embeddings are used for embedding the user's question.
This distinction allows the model to better capture semantic relationships between questions and relevant document chunks. The task_type
parameter in the Gemini API is set to RETRIEVAL_DOCUMENT
for storage and RETRIEVAL_QUERY
for querying.
Creating the Vector Database
An instance of ChromaDB is created to store the embeddings. The database is stored in the chroma.db
folder. A data table is created with the Pinyin of Linghu Chong as the name. The create_db
function performs embedding on each chunk using the embed
function with store=True
and stores the results in the vector database using ChromaDB's upsert
method. Each data entry requires a string-type ID, for this example, the chunk's index is used.
Querying the Vector Database
With the database populated, the query_db
function allows us to retrieve relevant chunks based on a user's question. The function takes a question
as input, embeds it using the embed
function with store=False
, and then uses the resulting embedding to query the vector database for the most similar chunks. The n_results
parameter is set to 5 to retrieve the top 5 most relevant records.
Integrating with the Large Language Model
Finally, the retrieved text chunks are combined with the original question to create a prompt for the LLM. The prompt is then sent to the LLM (in this case, Gemini-Flash-2.5) to generate a final answer. The LLM uses the provided context to produce a more informed and accurate response. The complete code is provided in the video description for experimentation.