Why RAG Applications Fail in Production

It worked as a prototype; then all went down!

Published in

Towards AI

7 min readMar 20, 2024

Retrieval-Augmented Generation (RAG) applications have emerged as powerful tools in the landscape of Large Language Models (LLMs), enhancing their capabilities by integrating external knowledge. Despite their promise, RAG applications often face challenges when transitioning from prototype to production environments. This article delves into the intricacies of RAG applications, exploring common pitfalls and strategic insights for successful deployment.

From Prototype to Production

Deploying RAG applications in a production setting is fraught with challenges. The complexity of integrating generative LLMs with retrieval mechanisms means that any number of elements can malfunction, leading to potential system failures. For instance, the scalability and robustness of the system are crucial; it must handle unpredictable loads and remain operational under high demand. Moreover, predicting user interactions with the system in a live environment is challenging, necessitating continuous monitoring and adaptation to maintain performance and reliability.

Source: https://medium.com/@vipra_singh/building-llm-applications-retrieval-search-part-5-c83a7004037d

Types of RAG Models

Based on Retrieval Method: RAG models can be categorized by the retrieval method they use, such as using BM25 (a traditional information retrieval function) or more advanced dense retrievers that leverage neural network-based embeddings to find relevant documents. The choice of retriever impacts how well the model can fetch pertinent information from a corpus

Based on Generation Mechanism: The generative component of RAG usually employs transformer-based models like BERT, GPT-2, or GPT-3. These models generate responses based on the documents retrieved, tailoring the output to be contextually relevant and detailed

Sequential vs. Parallel Processing: Some RAG models process the retrieval and generation steps sequentially, where the system first retrieves all relevant documents and then generates a response based on them. In contrast, others might employ a more parallel approach, continuously retrieving and generating in a more intertwined fashion

Fine-tuning Approaches: RAG models can be fine-tuned in different ways to adapt both the retrieval and generation processes to specific tasks. This includes fine-tuning on specific knowledge-intensive tasks, where the model learns to better align the retrieved information with the generated text to answer questions or provide explanations

Configuration and Customization

RAG configurations allow for extensive customization to tailor the models to specific needs. Key configuration options include:

Number of Documents to Retrieve (n_docs): Defines how many documents the retriever should fetch, which can influence the breadth of information considered in generating responses.
Max Combined Length (max_combined_length): Limits the total length of the context used for generating a response, affecting the detail and scope of the generated text.
Retrieval Vector Size: Determines the size of the embeddings used for retrieval, impacting the granularity of semantic matching between queries and documents.
Retrieval Batch Size: Specifies how many retrieval queries are processed simultaneously, influencing retrieval speed and efficiency

Applications and Considerations

RAG models are particularly effective in applications requiring deep knowledge integration and contextual understanding, such as legal research, scientific literature review, and complex customer service queries. The integration of retrieval and generation processes allows RAG models to provide accurate, detailed, and contextually relevant responses that are grounded in external sources of information.

Key Areas of Failure and Mitigation Strategies

Retrieval Quality

Effective retrieval is foundational to RAG’s success. Ensuring that the system retrieves documents that are both relevant and diverse in response to queries is crucial. Failures in this area can lead to inaccurate or irrelevant responses, undermining the system’s utility and user trust. Generally retrieval will be done using some sort of similarity matrix. The algorithm matters! Cosine similarity will have a general match but will likely fail in domain-specific applications. Especially in healthcare, be prepared to use multi-query retrievers, self-query, or even ensemble retrievers.

Hallucinations

RAG systems sometimes generate information not grounded in the retrieved documents, a phenomenon known as hallucinations. These can severely impact the credibility and accuracy of the system, necessitating robust mechanisms to filter out noise and integrate information from multiple sources to provide coherent and accurate responses

Privacy and Security Concerns

Privacy breaches and security vulnerabilities are significant risks, especially when handling sensitive information. RAG applications must be designed to prevent unauthorized disclosure of personal or confidential data and resist manipulative attacks that could compromise the system’s integrity. This is a particular pain-point in your enterprise applications. In practice it is not about whether you have safeguarded your application, but about whether you have dotted all i’s and crossed all t’s. You will have to prove it that you have done all things that you could to protect enterprise data.

Malicious Use and Content Safety

Ensuring that RAG applications do not facilitate illegal activities or generate harmful content is essential. This includes implementing safeguards against creating or disseminating content that can be used for malicious purposes. This will likely not be a concern for all enterprise users and use-cases as these use cases will be catered to a specific audience using a specific data. No enterprise will take the risk of using all information in a RAG.

Domain-specific Challenges

RAG applications tailored for specific domains must handle out-of-domain queries effectively, ensuring they provide relevant and accurate responses even when queries fall outside their primary knowledge base. More on this later in the success section — but its a pain in the butt. In short, for the niche of your domain, you better think of using a domain-specific large model in conjunction with a generalized large model like the OpenAI/Claude/whatever.

Completeness and Brand Integrity

The completeness of responses and maintenance of brand integrity are vital for user satisfaction and trust. RAG systems should deliver comprehensive and contextually appropriate answers while avoiding content that could damage the brand’s reputation

Technical and Operational Issues

Issues such as recursive retrieval, sentence window retrieval, and the balancing act between self-hosted and API-based LLM deployments can significantly affect the performance and cost-effectiveness of RAG applications. Each of these elements requires careful consideration to optimize retrieval accuracy and system efficiency

Strategies for Success

To mitigate these risks, RAG applications should undergo a reasonably extensive planning. There is no better solution than to anticipate for the future. It should also undergo, an extensive testing across multiple scenarios, including retrieval quality, hallucination prevention, privacy protection, and security.

Real world is the Achilles heal for almost all production data products. Yes using the same old pubmed data will give you a working pipeline but when a RAG is to interact with real data from different journals, it will fail miserably. Monitoring and updating the system based on real-world usage and feedback are crucial for continuous improvement. It is important to build an RAG at a smaller scale using real data from various sources and then scale it up to a large scale. In today’s world compute and space is cheap, s focus on the information security, infrastructure, things like SSO integration, SOC2 certificates etc so that once you build a RAG you will be able to share with the client with confidence.

Furthermore, selecting the right technical infrastructure, ensuring data quality, and implementing robust security measures are key to successful RAG application deployment in production environments. Think about the data pipelines in the future. Come up with “what if” scenarios and build your documentation and codebase accordingly. Nobody talks about it, but build your contract in a way that your clients are aware of the possible “Fails” and in-place “fail-safes”.

If you are dealing with a specific domain, know that the model you use to create embeddings will matter. I have seen a drive towards using smaller models for embeddings; however, if the vocabulary size for that model does not contain the keywords from your domain, then you are doomed. Yes, this means spending a bit more, or if your pockets are lined well, then spending a lot more to build your own LLM that maintains the vocabulary for your domain. Simple fine-tuning of the model will not fix this problem. Remember, the higher the things you want to retrieve, the worse the performance.

The last one is Brand integrity. Let’s call a spade a spade. A brand is a made-up identity. We want to mimic the brand as closely as we can. Consider this task as the last “icing” on the RAG cake. First, get the task completed, extract what you need, and generate flat text from the abstractions so you have the accuracy metric. Then and only then, ask it to be reworded into a Brand-lingo.

Good luck.! If your company needs help, find me on Linkedin https://www.linkedin.com/in/mandarkarhade/

If you have read it until this point — Thank you! You are a hero (and a Nerd ❤)! I try to keep my readers up to date with “interesting happenings in the AI world,” so please 🔔 clap | follow | Subscribe 🔔

Read my other work

RAG in Production: Chunking Decisions

Prototype to Production; All about chunking strategies and the decision process

pub.towardsai.net

Revolutionizing Large-Scale Deep Learning with Microsoft DeepSpeed

Microsoft democratizes and standardizes at-scale LLM training

pub.towardsai.net

FastEval: 1 click evaluation of Language Models

Evaluation on various benchmarks with a single command

ithinkbot.com

Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

The convergence of LLMs and causal inference is paving the way toward AI systems that are not only advanced but also…

ithinkbot.com

Generative AI to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format

Yes, but 18/50 reviews noted safety concerns

ai.gopubby.com

Fine Tuning vs. Retrieval Augmented Generation

Its your choice, use it wisely

medium.com

Is Cosine-Similarity of Embeddings Really About Similarity?

we caution against blindly using cosine-similarity

ai.gopubby.com

AI in Healthcare: Bridging Gaps and Building Futures

Do not fight the monetary incentives. You will lose.

ithinkbot.com

Unveiling the Path to Equitable AI in Healthcare: A Call to Action

Do no harm in the Era of AI in Healthcare

ithinkbot.com

Page by Page Review: Mixtral of Experts (8x7B)

A completely opensource model that could dominate the entrepreneurial scene in the field of Generative AI

pub.towardsai.net

Run Mixtral 8x7b on Google Colab Free

A clever trick allows offloading some layers

pub.towardsai.net

PowerInfer: 11x Speed up LLM Inference On a Local GPU

Some neurons are HOT! Some are cold! A clever way of using GPU-CPU hybrid interface to achieve impressive speeds!

pub.towardsai.net

Make Any* LLM fit Any GPU in 10 Lines of Code

An ingenious way of running models larger than the VRAM of the GPU. It may be slow but it freaking works!

pub.towardsai.net

Why RAG Applications Fail in Production

It worked as a prototype; then all went down!

From Prototype to Production

Types of RAG Models

Configuration and Customization

Applications and Considerations

Key Areas of Failure and Mitigation Strategies

Strategies for Success

RAG in Production: Chunking Decisions

Prototype to Production; All about chunking strategies and the decision process

Revolutionizing Large-Scale Deep Learning with Microsoft DeepSpeed

Microsoft democratizes and standardizes at-scale LLM training

FastEval: 1 click evaluation of Language Models

Evaluation on various benchmarks with a single command

Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

The convergence of LLMs and causal inference is paving the way toward AI systems that are not only advanced but also…

Generative AI to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format

Yes, but 18/50 reviews noted safety concerns

Fine Tuning vs. Retrieval Augmented Generation

Its your choice, use it wisely

Is Cosine-Similarity of Embeddings Really About Similarity?

we caution against blindly using cosine-similarity

AI in Healthcare: Bridging Gaps and Building Futures

Do not fight the monetary incentives. You will lose.

Unveiling the Path to Equitable AI in Healthcare: A Call to Action

Do no harm in the Era of AI in Healthcare

Page by Page Review: Mixtral of Experts (8x7B)

A completely opensource model that could dominate the entrepreneurial scene in the field of Generative AI

Run Mixtral 8x7b on Google Colab Free

A clever trick allows offloading some layers

PowerInfer: 11x Speed up LLM Inference On a Local GPU

Some neurons are HOT! Some are cold! A clever way of using GPU-CPU hybrid interface to achieve impressive speeds!

Make Any* LLM fit Any GPU in 10 Lines of Code

An ingenious way of running models larger than the VRAM of the GPU. It may be slow but it freaking works!

Written by Mandar Karhade, MD. PhD.

More from Mandar Karhade, MD. PhD. and Towards AI

RAG in Production: Chunking Decisions

Prototype to Production; All about chunking strategies and the decision process

Llama 3 + Groq is the AI Heaven

Llama 3 shines on Groq with blazing generation

Machine Learning Was Hard Until I Learned These 5 Secrets!

The secrets no one tells you but make learning ML a lot easier and enjoyable.

How to Optimize Chunk Sizes for RAG in Production?

The chunk size can make or break the retrieval. Here is how to determine the best chunk size for your use case.

Recommended from Medium

The 4 Advanced RAG Algorithms You Must Know to Implement

Implement from scratch 4 advanced RAG methods to optimize your retrieval and post-retrieval algorithm

The GenAI Reference Architecture

Lists

Predictive Modeling w/ Python

Natural Language Processing

Practical Guides to Machine Learning

data science and AI

RAG on Complex PDF using LlamaParse, Langchain and Groq

Retrieval-Augmented Generation (RAG) is a new approach that leverages Large Language Models (LLMs) to automate knowledge search, synthesis…

Why Entire AI field is headed towards AI Agents

GenAI is just the beginning, what comes next is AI Agent.

From RAG to GraphRAG , What is the GraphRAG and why i use it?

Before discussing RAG and GraphRAG,

A New Coefficient of Correlation

What if you were told there exists a new way to measure the relationship between two variables just like correlation except possibly…