RAG (Retrieval Augmented Generation)

Explore the evolution of Large Language Models (LLMs) and how Retrieval Augmented Generation (RAG) techniques are addressing their limitations, making AI responses more accurate and reliable. Discover advanced RAG methods like Chain of Note, Corrective RAG, RAG Fusion, and Self-RAG for enhanced AI applications.

7 min readJul 5, 2024

RAG AND LLM

As the popularity of Large Language Models (LLMs) like ChatGPT has skyrocketed, so have the expectations of their capabilities. Many users began treating LLMs as alternatives to traditional search engines, hoping to get instant, accurate information. However, this surge in usage also highlighted some significant limitations. Beyond concerns about copyright, privacy, security, and mathematical calculations, two primary issues became apparent:

LLMs often fail to provide up-to-date information.
LLMs sometimes produce factually inaccurate responses.

The Core Challenge of LLMs

Users seek knowledge and wisdom from LLMs, but these models are essentially advanced predictors of the next word in a sequence. The real challenge lies in making LLMs:

Respond with the most current information.
Avoid delivering factually incorrect answers.
Handle and incorporate proprietary information.

What is Retrieval Augmented Generation (RAG)?

In 2023, Retrieval Augmented Generation (RAG) emerged as a popular technique to address these challenges within the realm of LLMs.

Retrieval Augmented Generation works as follows:

A user writes a prompt or query.
This query is sent to an orchestrator.
The orchestrator sends a search query to a retriever.
The retriever fetches relevant information from various knowledge sources and sends it back.
The orchestrator augments the prompt with the retrieved context and sends it to the LLM.
The LLM generates a response, which is then displayed to the user via the orchestrator.

How Does RAG Help?

Unlimited Knowledge: The retriever in a RAG system can access external information sources. This means the LLM is not confined to its internal data. The external sources can include proprietary documents, databases, or even the internet, vastly expanding the model’s knowledge base.

Example: Imagine you ask an LLM about the latest research in quantum computing. Without RAG, the LLM might only provide information based on data it was trained on, which could be outdated. With RAG, the system retrieves the latest papers and articles on the topic, providing a much more current and accurate response.

from transformers import pipeline

# Initialize the retriever
def retrieve_external_info(query):
    # Here you would implement the logic to retrieve data from external sources.
    # This is a simplified example.
    return "Retrieved context about the latest research in quantum computing."

# Initialize the LLM
generator = pipeline('text-generation', model='gpt-3')

# User query
query = "What is the latest research in quantum computing?"

# Retrieve external information
context = retrieve_external_info(query)

# Augment the prompt with the context
augmented_query = f"{context}\n\n{query}"

# Generate the response
response = generator(augmented_query)

print(response)

Confidence in Responses: With the additional context provided by the retriever, LLM responses are more reliable and accurate. This context helps the LLM generate answers with greater confidence, reducing the risk of misinformation.

Emerging RAG Techniques

As the RAG approach evolves, new techniques have been developed to enhance its capabilities further:

1. Chain of Note (CoN)

This technique involves generating notes for retrieved documents. By breaking down the problem into smaller steps and generating notes at each step, the final answer becomes more factually accurate and trustworthy.

Example: When asked about the historical events leading up to the fall of the Berlin Wall, a CoN-enhanced RAG system retrieves multiple documents, generating notes on each significant event. This step-by-step breakdown ensures a detailed and accurate historical account.

# Example function to generate notes for each retrieved document
def generate_notes(docs):
    notes = []
    for doc in docs:
        # Generate a note for each document
        note = f"Note on document: {doc}"
        notes.append(note)
    return notes

# Example documents retrieved
docs = ["Document 1: Event A", "Document 2: Event B", "Document 3: Event C"]

# Generate notes for the retrieved documents
notes = generate_notes(docs)

# Combine notes to form the augmented context
context = "\n".join(notes)

# User query
query = "What events led to the fall of the Berlin Wall?"

# Augment the prompt with the context
augmented_query = f"{context}\n\n{query}"

# Generate the response
response = generator(augmented_query)

print(response)

2. Corrective RAG

In this method, if a retrieved answer is ambiguous, the query is sent to a search engine. The search results are then combined with the original RAG documents, and the LLM re-evaluates the query, considering both sources to produce a more accurate response.

Example: If an LLM is asked about the implications of a recent court ruling but the retrieved document is unclear, the Corrective RAG method sends the query to a search engine. It retrieves additional articles and legal opinions, re-evaluates the information, and provides a more comprehensive answer.

# Example function to perform corrective retrieval
def corrective_retrieve(query, initial_docs):
    # If the initial retrieval is ambiguous, perform a search
    search_results = search_engine(query)
    return initial_docs + search_results

# Example initial retrieved documents
initial_docs = ["Ambiguous document about court ruling"]

# Perform corrective retrieval
combined_docs = corrective_retrieve(query, initial_docs)

# Augment the prompt with the combined context
context = "\n".join(combined_docs)
augmented_query = f"{context}\n\n{query}"

# Generate the response
response = generator(augmented_query)

print(response)

3. RAG Fusion

This approach involves breaking a query into smaller sub-queries. Each sub-query is used to retrieve the most relevant documents from a vector database. The results are then prioritized using the Reciprocal Rank Fusion algorithm. This method significantly improves traditional search systems by providing diverse perspectives and uncovering deeper insights.

Example: For a complex question about climate change’s impact on different ecosystems, RAG Fusion breaks the query into parts focusing on various ecosystems (e.g., oceans, forests, deserts). It retrieves and prioritizes the most relevant documents for each sub-query, resulting in a comprehensive and nuanced response.

# Example function to break a query into sub-queries
def break_into_sub_queries(query):
    return ["Sub-query 1: Climate change impact on oceans", "Sub-query 2: Climate change impact on forests", "Sub-query 3: Climate change impact on deserts"]

# Example function to perform RAG Fusion
def rag_fusion(query):
    sub_queries = break_into_sub_queries(query)
    all_docs = []
    for sub_query in sub_queries:
        # Retrieve documents for each sub-query
        docs = retrieve_documents(sub_query)
        all_docs.extend(docs)
    # Prioritize documents using Reciprocal Rank Fusion (simplified example)
    prioritized_docs = prioritize_documents(all_docs)
    return prioritized_docs

# Perform RAG Fusion
fused_docs = rag_fusion(query)

# Augment the prompt with the fused context
context = "\n".join(fused_docs)
augmented_query = f"{context}\n\n{query}"

# Generate the response
response = generator(augmented_query)

print(response)

4. Self-RAG

In Self-RAG, LLMs perform self-reflection for dynamic retrieval, critique, and generation. This technique allows the model to self-evaluate and improve its responses by iteratively refining its understanding and outputs.

Example: When an LLM is asked about future trends in artificial intelligence, Self-RAG enables the model to retrieve diverse sources, reflect on their content, critique its initial response, and generate a more insightful and forward-looking answer.

# Example function for self-reflection
def self_reflect(query, initial_response):
    critiques = critique_response(initial_response)
    refined_query = refine_query_with_critiques(query, critiques)
    refined_docs = retrieve_documents(refined_query)
    return refined_docs

# User query
query = "What are the future trends in artificial intelligence?"

# Initial retrieval and response
initial_docs = retrieve_documents(query)
initial_response = generator(f"{initial_docs}\n\n{query}")

# Perform self-reflection
refined_docs = self_reflect(query, initial_response)

# Augment the prompt with the refined context
context = "\n".join(refined_docs)
augmented_query = f"{context}\n\n{query}"

# Generate the final response
final_response = generator(augmented_query)

print(final_response)

Real-World Applications

The RAG technique, facilitated by frameworks like LangChain and LlamaIndex, is becoming increasingly prevalent in applications such as:

1. Question and Answer Systems

In Q&A systems, the RAG technique enables large language models to provide accurate and contextually relevant answers by retrieving information from external sources. This ensures that the responses are not only based on the model’s training data but also include the latest information available.

2. Conversational Agents

Conversational agents or chatbots use RAG to provide more dynamic and informed interactions. By accessing external databases or proprietary documents, chatbots can offer precise and up-to-date responses, enhancing user experience.

3. Recommendation Systems

In recommendation systems, RAG can enhance the quality and relevance of recommendations by incorporating real-time data and user preferences from various sources. This makes the recommendations more personalized and effective.

4. Content Generation

Content creation tools leverage RAG to produce high-quality and contextually relevant content by accessing up-to-date information from the web or specific databases. This ensures that the generated content is both accurate and current.

Conclusion

The rise of LLMs has brought about significant advancements and challenges. While they offer powerful predictive capabilities, ensuring their responses are up-to-date, accurate, and contextually aware is crucial. Techniques like RAG and its advanced variations are instrumental in overcoming these challenges, paving the way for more reliable and versatile AI applications.

For more insights and detailed implementations of these techniques, explore the resources and articles linked below:

By continually refining these techniques, we can unlock the full potential of LLMs, making them indispensable tools for the future of AI-driven knowledge and innovation.