这是用户在 2024-5-14 19:54 为 https://pub.towardsai.net/why-rag-applications-fail-in-production-a-technical-deep-dive-15cc976af52c 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Why RAG Applications Fail in Production

It worked as a prototype; then all went down!

Mandar Karhade, MD. PhD.
Towards AI
Published in
7 min readMar 20, 2024

Retrieval-Augmented Generation (RAG) applications have emerged as powerful tools in the landscape of Large Language Models (LLMs), enhancing their capabilities by integrating external knowledge. Despite their promise, RAG applications often face challenges when transitioning from prototype to production environments. This article delves into the intricacies of RAG applications, exploring common pitfalls and strategic insights for successful deployment.

From Prototype to Production

Deploying RAG applications in a production setting is fraught with challenges. The complexity of integrating generative LLMs with retrieval mechanisms means that any number of elements can malfunction, leading to potential system failures. For instance, the scalability and robustness of the system are crucial; it must handle unpredictable loads and remain operational under high demand. Moreover, predicting user interactions with the system in a live environment is challenging, necessitating continuous monitoring and adaptation to maintain performance and reliability​.

Source: https://medium.com/@vipra_singh/building-llm-applications-retrieval-search-part-5-c83a7004037d

Types of RAG Models

Based on Retrieval Method: RAG models can be categorized by the retrieval method they use, such as using BM25 (a traditional information retrieval function) or more advanced dense retrievers that leverage neural network-based embeddings to find relevant documents. The choice of retriever impacts how well the model can fetch pertinent information from a corpus​

Based on Generation Mechanism: The generative component of RAG usually employs transformer-based models like BERT, GPT-2, or GPT-3. These models generate responses based on the documents retrieved, tailoring the output to be contextually relevant and detailed

Sequential vs. Parallel Processing: Some RAG models process the retrieval and generation steps sequentially, where the system first retrieves all relevant documents and then generates a response based on them. In contrast, others might employ a more parallel approach, continuously retrieving and generating in a more intertwined fashion​

Fine-tuning Approaches: RAG models can be fine-tuned in different ways to adapt both the retrieval and generation processes to specific tasks. This includes fine-tuning on specific knowledge-intensive tasks, where the model learns to better align the retrieved information with the generated text to answer questions or provide explanations​

Configuration and Customization

RAG configurations allow for extensive customization to tailor the models to specific needs. Key configuration options include:

  • Number of Documents to Retrieve (n_docs): Defines how many documents the retriever should fetch, which can influence the breadth of information considered in generating responses.
  • Max Combined Length (max_combined_length): Limits the total length of the context used for generating a response, affecting the detail and scope of the generated text.
  • Retrieval Vector Size: Determines the size of the embeddings used for retrieval, impacting the granularity of semantic matching between queries and documents.
  • Retrieval Batch Size: Specifies how many retrieval queries are processed simultaneously, influencing retrieval speed and efficiency​

Applications and Considerations

RAG models are particularly effective in applications requiring deep knowledge integration and contextual understanding, such as legal research, scientific literature review, and complex customer service queries. The integration of retrieval and generation processes allows RAG models to provide accurate, detailed, and contextually relevant responses that are grounded in external sources of information.

Key Areas of Failure and Mitigation Strategies

Retrieval Quality

Effective retrieval is foundational to RAG’s success. Ensuring that the system retrieves documents that are both relevant and diverse in response to queries is crucial. Failures in this area can lead to inaccurate or irrelevant responses, undermining the system’s utility and user trust​. Generally retrieval will be done using some sort of similarity matrix. The algorithm matters! Cosine similarity will have a general match but will likely fail in domain-specific applications. Especially in healthcare, be prepared to use multi-query retrievers, self-query, or even ensemble retrievers.


RAG systems sometimes generate information not grounded in the retrieved documents, a phenomenon known as hallucinations. These can severely impact the credibility and accuracy of the system, necessitating robust mechanisms to filter out noise and integrate information from multiple sources to provide coherent and accurate responses​

Privacy and Security Concerns

Privacy breaches and security vulnerabilities are significant risks, especially when handling sensitive information. RAG applications must be designed to prevent unauthorized disclosure of personal or confidential data and resist manipulative attacks that could compromise the system’s integrity​. This is a particular pain-point in your enterprise applications. In practice it is not about whether you have safeguarded your application, but about whether you have dotted all i’s and crossed all t’s. You will have to prove it that you have done all things that you could to protect enterprise data.

Malicious Use and Content Safety

Ensuring that RAG applications do not facilitate illegal activities or generate harmful content is essential. This includes implementing safeguards against creating or disseminating content that can be used for malicious purposes​. This will likely not be a concern for all enterprise users and use-cases as these use cases will be catered to a specific audience using a specific data. No enterprise will take the risk of using all information in a RAG.

Domain-specific Challenges

RAG applications tailored for specific domains must handle out-of-domain queries effectively, ensuring they provide relevant and accurate responses even when queries fall outside their primary knowledge base​. More on this later in the success section — but its a pain in the butt. In short, for the niche of your domain, you better think of using a domain-specific large model in conjunction with a generalized large model like the OpenAI/Claude/whatever.

Completeness and Brand Integrity

The completeness of responses and maintenance of brand integrity are vital for user satisfaction and trust. RAG systems should deliver comprehensive and contextually appropriate answers while avoiding content that could damage the brand’s reputation​

Technical and Operational Issues

Issues such as recursive retrieval, sentence window retrieval, and the balancing act between self-hosted and API-based LLM deployments can significantly affect the performance and cost-effectiveness of RAG applications. Each of these elements requires careful consideration to optimize retrieval accuracy and system efficiency​

Strategies for Success

To mitigate these risks, RAG applications should undergo a reasonably extensive planning. There is no better solution than to anticipate for the future. It should also undergo, an extensive testing across multiple scenarios, including retrieval quality, hallucination prevention, privacy protection, and security.

Real world is the Achilles heal for almost all production data products. Yes using the same old pubmed data will give you a working pipeline but when a RAG is to interact with real data from different journals, it will fail miserably. Monitoring and updating the system based on real-world usage and feedback are crucial for continuous improvement. It is important to build an RAG at a smaller scale using real data from various sources and then scale it up to a large scale. In today’s world compute and space is cheap, s focus on the information security, infrastructure, things like SSO integration, SOC2 certificates etc so that once you build a RAG you will be able to share with the client with confidence.

Furthermore, selecting the right technical infrastructure, ensuring data quality, and implementing robust security measures are key to successful RAG application deployment in production environments. Think about the data pipelines in the future. Come up with “what if” scenarios and build your documentation and codebase accordingly. Nobody talks about it, but build your contract in a way that your clients are aware of the possible “Fails” and in-place “fail-safes”.

If you are dealing with a specific domain, know that the model you use to create embeddings will matter. I have seen a drive towards using smaller models for embeddings; however, if the vocabulary size for that model does not contain the keywords from your domain, then you are doomed. Yes, this means spending a bit more, or if your pockets are lined well, then spending a lot more to build your own LLM that maintains the vocabulary for your domain. Simple fine-tuning of the model will not fix this problem. Remember, the higher the things you want to retrieve, the worse the performance.

The last one is Brand integrity. Let’s call a spade a spade. A brand is a made-up identity. We want to mimic the brand as closely as we can. Consider this task as the last “icing” on the RAG cake. First, get the task completed, extract what you need, and generate flat text from the abstractions so you have the accuracy metric. Then and only then, ask it to be reworded into a Brand-lingo.

Good luck.! If your company needs help, find me on Linkedin https://www.linkedin.com/in/mandarkarhade/

If you have read it until this point — Thank you! You are a hero (and a Nerd ❤)! I try to keep my readers up to date with “interesting happenings in the AI world,” so please 🔔 clap | follow | Subscribe 🔔

Read my other work

Mandar Karhade, MD. PhD.
Towards AI

Life Sciences AI/ML/GenAI advisor