Unlocking RAG’s Potential: Mastering Advanced Techniques – Part 1

Author

Data Science
Generative AI

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Introduction

Large Language Models (LLM) can be inconsistent. At times, they provide accurate answers to questions, while other times, they can generate random facts from their training data. The output can be grammatically correct but factually nonsensical. The term hallucination captures the idea that the generated content appears coherent but lacks accuracy or relevance. One way to address this challenge is by leveraging Retrieval-augmented generation (RAG). This approach enhances the quality of language model-generated responses by incorporating external knowledge sources. By grounding the model in verifiable facts, RAG ensures more accurate and reliable answers. Therefore, the advantages are three folds:

Cost-effective domain adaptation: RAG offers a more economical approach compared to finetuning on domain-specific data.
Access to current, reliable facts: RAG ensures that the model can access up-to-date and trustworthy information.
Transparency and verification: Users can cross-reference the model’s answers with its sources, allowing claims to be fact-checked and trusted.

Following is an example of a real scenario where the user’s question is asking about ProCogia while the LLM has hallucinated due to a lack of contextual knowledge. RAG has proven to be an effective approach in this case to give external context to the LLM for curating the final output.

With the evolution of numerous large language model architectures, RAG has emerged as a significant breakthrough. Several advanced RAG architectures have demonstrated success in injecting effective knowledge bases to LLMs. This article delves into a selection of these architectures. The architectures covered in this article are the following:

Naive RAG
Hybrid
Parent Document Retriever / Small2Big
HyDE
Compressor and Filtering
Self-Querying Retrieval

For demonstration purposes, the gpt-3.5-turbo model has been used as the LLM. The source data utilized for this article includes website contents sourced from https://procogia.com/. The following code snippet is provided to set up a RAG framework using LangChain.

Naive RAG

The standard RAG architecture consists of two main components: indexing and retrieval. In the indexing phase, the knowledge base is segmented into smaller passages, and a retrieval model (i.e., embedding model) is employed to generate embeddings (i.e., vectors), which are stored in a vector database (e.g., Chroma, FAISS, Pinecone).

Following indexing, during the retrieval phase when a user submits a query, an embedding of the query is generated using the same retrieval model. Utilizing similarity search, the most similar context is identified from the vector store and forwarded to the LLM. This process provides the LLM with external knowledge to produce the final output, which is then transmitted back to the user. The competitive advantage of employing this architecture lies in its ability to deliver more accurate and cost-effective knowledge injection compared to fine-tuning LLMs [1].

Hybrid

Unlike the conventional RAG, hybrid search employs two retrievers instead of one, merging the strengths of dense and sparse retrievers. The dense retriever is very similar to the retriever model shown in Naïve RAG while the sparse retriever [2] utilizes token-based document searching. Sparse retrievers may struggle with capturing more nuanced semantic relationships between documents, as they primarily rely on surface-level keyword matching. This limitation is where dense retrievers, which employ sophisticated neural network models to understand semantic similarities, complement sparse retrievers. Therefore, hybrid search can be an effective technique in capturing valuable context for the LLM.

The indexing process in this technique closely resembles that of Naïve RAG. However, instead of utilizing a single vector store, we now require two vector stores—one for the sparse retriever model and another for the dense retriever model. In the retrieval phase, a weighted average of the response scores from both the dense retriever and sparse retriever is calculated, followed by reranking. The top reranked results are given back as context to the LLM. EnsembleRetriever [3] in LangChain implements hybrid search.

Parent Document Retriever (Small2big)

Chunking documents into passages is a critical step in document indexing. Smaller chunks are favored because they enable the embedding model to capture their meaning more effectively than larger chunks, which risk losing meaning in their embeddings. This limitation arises because the embedding model translates chunks into fixed-dimensional vectors, thereby restricting the semantics captured within these vectors. Consequently, embedding vectors from smaller chunks are likely to retain more information, making vector search on smaller chunks more effective during retrieval.

Conversely, larger chunks can offer more context to the LLM during the retrieval phase. The ParentDocumentRetriever [4] addresses this dilemma by segmenting and storing small chunks of data. During retrieval, it initially retrieves these small chunks but then refers to the parent IDs associated with them, ultimately returning the larger documents from which they originated. As a result, the LLM receives a broader context as a knowledge base, leading to improved responses.

HyDE

HyDE [5] stands for Hypothetical Document Embeddings, presents a new approach to improving search accuracy. It’s particularly useful for questions that are unclear or lacking in detail, which is common in real-life scenarios where users often ask vague questions. HyDE handles each question by creating a pseudo response, essentially an intermediate answer leveraging LLM. The pseudo response is then passed to the retriever model to search for context within the original document chunks. The LLM generates the final response from the context passed by the retriever model.

LangChain has an implementation of HyDE under HypotheticalDocumentEmbedder [6] class, where the user’s query is sent to the LLM to generate an intermedia answer which is then passed to the retriever model to search for similar context from indexed documents resulting in better context for vague questions.

Compressors and Filtering

A major issue with RAG is that when retrieving contexts from the document store, not all information is relevant. Sometimes, the retriever model retrieves a large chunk containing only a small portion of useful content. This method addresses such situations by enhancing the quality of context retrieved by the retriever. There are compressors and filters [7] that can be assigned to the retrieved context to concise and filter only relevant context and pass it to the LLM to generate the final response. However, one drawback of incorporating compression and filtering techniques is the increase in inference time. Some of the strategies involved include:

ContextualCompressionRetriever: Extracting only the content relevant to the query.
EmbeddingsFilter: Filtering out results that are not closely related to the query.
EmbeddingsRedundantFilter: Removing irrelevant content after a split or transformation.

Additionally, an LLM chain can be used to select queries for display to the final LLM. This process can involve a model fine-tuned to determine whether to display the information, categorizing it as either “YES” to show or “NO” to hide.

Self-Querying Retrieval

Self-querying retriever [8] has the ability to search within its database. When it encounters a natural language query, it utilizes a method to transform it into a structured query. This process involves the LLM query constructor and translator, which receives the user’s query and translates it into a structured query with filters and then passes it to the retriever model. The structured query, generated through the LLM query constructor and translator, serves as a crucial link between the retriever model and the LLM. However, to ensure the effectiveness of this process, the underlying data must adhere to a structured format and include relevant metadata.

Consequently, the structured output from the query constructor and translator allows the retriever to do more than just compare the user’s question with the stored documents. It also gives the retriever the ability to find specific details from the user’s question about the metadata of those documents and use them to filter the results. This context is then seamlessly passed to the LLM, enabling it to generate a final response that aligns closely with the user’s original query with conditions.

Below is an example of structured data on different programming languages, their use and release date.

In conclusion, there is no universal technique that fits all solutions. The methods presented in this article can be amalgamated to create a resilient framework. However, the adoption of diverse techniques is contingent upon the nature of queries encountered, the characteristics of the source data, and resource constraints.

References

Author

Shahed Sabab

View all posts

Keep reading

Dig deeper into data development by browsing our blogs…