这是用户在 2025-7-18 11:09 为 https://abdullin.com/ilya/how-to-build-best-rag/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Home » Ship with ChatGPT
»  使用 ChatGPT 发货
ERC · LLM Bench · Newsletter · Labs · About
 ERC · LLM 长凳  · 时事通讯  · 实验室  · 大约

Ilya Rice: How I Won the Enterprise RAG Challenge
Ilya Rice:我是如何赢得企业 RAG 挑战赛的

From Zero to SoTA in a Single Competition
在单场比赛中从零到 SoTA

ilya-avatar2.png
In this guest blog post Ilya Rice describes the approach that helped him build the best RAG and win in the Enterprise RAG Challenge. He took first place in both prize categories and on SotA leaderboard. Source code.
在这篇客座博客文章中, Ilya Rice 描述了帮助他构建最佳 RAG 并在 企业 RAG 挑战赛中获胜的方法。他在奖项类别和 SotA 排行榜上均获得第一名。 源代码


Also posted at TimeToAct Austria and on Habr (RU).
也发表在 TimeToAct AustriaHabr (RU) 上。

What is the RAG Challenge about?
RAG 挑战赛是关于什么的?

The task was to create a question-answering system based on annual company reports. Briefly, the process on the competition day was as follows:
任务是根据公司年度报告创建一个问答系统。简单来说,比赛当天的流程如下:

  1. You're given 100 annual reports from randomly selected companies and 2.5 hours to parse them and build a database. The reports are PDFs, each up to 1000 pages long.
    您将获得来自随机选择公司的 100 份年度报告,并有 2.5 小时的时间来解析它们并建立数据库。这些报告是 PDF,每个报告最长可达 1000 页。
  2. Then, 100 random questions are generated (based on predefined templates), which your system must answer as quickly as possible.
    然后,将生成 100 个随机问题(基于预定义的模板),您的系统必须尽快回答这些问题。

All questions must have definitive answers, such as:
所有问题都必须有明确的答案,例如:

  • Yes/No;  是/否;
  • Company name (or multiple company names in some cases);
    公司名称(或在某些情况下为多个公司名称);
  • Titles of leadership positions, launched products;
    领导职务、推出产品;
  • Numeric metrics: revenue, store count, etc.
    数字指标:收入、商店数量等。

Each answer must include references to pages containing evidence of the response, ensuring the system genuinely derived the answer rather than hallucinating.
每个答案都必须包括对包含回答证据的页面的引用,以确保系统真实地得出答案而不是幻觉。

Winning system architecture:
制胜系统架构:

ilya-20250325101541.png

Apart from basic steps, the winning solution incorporates two routers and LLM reranking.
除了基本步骤外,获胜的解决方案还包括两个路由器和 LLM 重新排名。

You can check out the questions and answers produced by my best-performing system here.
您可以 在此处查看我性能最佳的系统生成的问题和答案。

Now, I'll delve into every step involved in building the system, the bumps and bruises I experienced along the way, and the best practices discovered during this process.
现在,我将深入研究构建系统所涉及的每一步、我在此过程中经历的颠簸和瘀伤,以及在此过程中发现的最佳实践。

Quick Guide to RAG
RAG 快速指南

RAG (Retrieval-Augmented Generation) is a method that extends the capabilities of Large Language Models (LLMs) by integrating them with a knowledge base of any size.
RAG(检索增强生成)是一种通过将大型语言模型 (LLM) 与任何规模的知识库集成来扩展其功能的方法。

Development pathway of a basic RAG system includes the following stages:
一个基本的 RAG 系统的发展路径包括以下几个阶段:

  1. Parsing: Preparing data for the knowledge base by collecting documents, converting them to text format, and cleaning out irrelevant noise.
    解析 :通过收集文档、将其转换为文本格式并清除不相关的噪音来为知识库准备数据。
  2. Ingestion: Creating and populating the knowledge base.
    摄取 :创建和填充知识库。
  3. Retrieval: Building a tool that finds and returns relevant data based on user queries, typically employing semantic search within a vector database.
    检索 :构建一个根据用户查询查找和返回相关数据的工具,通常在矢量数据库中采用语义搜索。
  4. Answering: Enriching the user's prompt with retrieved data, sending it to the LLM, and returning the final answer.
    回答 :使用检索到的数据丰富用户的提示,将其发送到 LLM,并返回最终答案。

1. Parsing  1. 解析

To start populating any database, PDF documents must first be converted to plain text. PDF parsing is an extremely non-trivial task filled with countless subtle difficulties:
要开始填充任何数据库,必须首先将 PDF 文档转换为纯文本。PDF 解析是一项极其重要的任务,充满了无数微妙的困难:

  • preserving table structures;
    保留表结构;
  • retaining critical formatting elements (e.g., headings and bullet lists);
    保留关键的格式元素(例如,标题和项目符号列表);
  • recognizing multi-column text;
    识别多列文本;
  • handling charts, images, formulas, headers/footers, and so on.
    处理图表、图像、公式、页眉/页脚等。

Interesting PDF parsing issues I encountered (but didn't have time to solve):
我遇到的有趣的 PDF 解析问题(但没有时间解决):

  • Large tables are sometimes rotated by 90 degrees, causing parsers to produce garbled and unreadable text.
    大型表有时会旋转 90 度,导致解析器生成乱码和不可读的文本。

ilya-20250325101714.png

  • Charts composed partially of images and partially of text layers.
    图表  部分由图像和部分文本层组成。
  • Some documents had font encoding issues: visually, the text looks fine, but attempting to copy or parse it results in a nonsensical set of characters.
    某些文档存在字体编码问题:从视觉上看,文本看起来不错,但尝试复制或解析它会导致一组无意义的字符。

ilya-20250325102000.png

Fun fact: I investigated this issue separately and discovered that the text could be decoded—it was a Caesar cipher with varying ASCII shifts per word. This raised numerous questions for me. If someone intentionally encrypted copying of a publicly available company report—why? If the font broke during conversion—why precisely this way?
有趣的事实:我单独调查了这个问题,发现文本可以解码——它是一个凯撒密码,每个单词的 ASCII 移位各不相同。这给我带来了许多问题。如果有人故意加密了公开发布的公司报告的副本,为什么?如果字体在转换过程中损坏 - 为什么会这样?

Choosing a Parser  选择 Parser

I experimented with about two dozen PDF parsers:
我试验了大约二十多个 PDF 解析器:

  • niche parsers;  利基解析器;
  • reputable ones;  有声望的;
  • cutting-edge ML-trained parsers;
    经过 ML 训练的尖端解析器;
  • proprietary parsers with API access.
    具有 API 访问权限的专有解析器。

I can confidently state that currently, no parser can handle all nuances and fully return PDF content as text without losing part of the important information along the way.
我可以自信地说,目前 ,没有解析器可以处理所有细微差别并将 PDF 内容完全作为文本返回,而不会在此过程中丢失部分重要信息

The best-performing parser for the RAG Challenge turned out to be the relatively known Docling. Interestingly, one of the competition organizers—IBM—is behind its development.
RAG 挑战赛中性能最好的解析器是相对知名的 Docling。有趣的是,竞赛组织者之一 IBM 是其开发的幕后推手。

Parser Customization  解析器自定义

Despite its excellent results, Docling lacked some essential capabilities. These features existed partially but in separate configurations that couldn't be combined into one.
尽管 Docling 取得了出色的结果,但缺乏一些基本功能。这些功能部分存在,但以单独的配置存在,无法合并为一个。

Therefore, I rolled up my sleeves, thoroughly examined the library's source code, and rewrote several methods to fit my needs, obtaining a JSON containing all necessary metadata after parsing. Using this JSON, I constructed a Markdown document with corrected formatting and near-perfect conversion of table structures from PDF to not just MD, but also HTML format, which proved important later on.
因此,我卷起袖子,彻底检查了库的源代码,并重写了几种方法以满足我的需求,在解析后获得了包含所有必要元数据的 JSON。使用这个 JSON,我构建了一个 Markdown 文档,该文档具有更正的格式,并且将表格结构从 PDF 近乎完美地转换为 MD 格式,而且从 HTML 格式转换,这在后来被证明很重要。

This library is quite fast but still not enough to parse 15 thousand pages within 2.5 hours on a personal laptop. To solve this, I leveraged GPU acceleration for parsing and rented a virtual machine with a 4090 GPU for 70 cents an hour for the competition.
这个库非常快,但仍然不足以在 15000 小时内在个人笔记本电脑上解析 2.5 个页面。为了解决这个问题,我利用 GPU 加速进行解析,并以每小时 70 美分的价格租用了一台配备 4090 GPU 的虚拟机参加比赛。

ilya-20250325102041.png Runpod turned out to be extremely convenient for short-term GPU rentals
事实证明,Runpod 对于短期 GPU 租赁来说非常方便

Parsing all 100 documents took about 40 minutes, which, based on reports and comments from other participants, is an extremely high parsing speed.
解析所有 100 个文档大约需要 40 分钟,根据其他参与者的报告和评论,这是一个  高的解析速度。


At this stage, we have reports parsed into JSON format.
在此阶段,我们已将报告解析为 JSON 格式。

Can we now populate the database?
我们现在可以填充数据库吗?

Not yet. First, we must clean the text from noise and preprocess the tables.
还没有。首先,我们必须清除文本中的杂音并预处理表格。

Text Cleaning and Table Preparation
文本清理和表准备

Sometimes parts of the text get parsed incorrectly from PDFs and contain specific syntax, reducing readability and meaningfulness. I addressed this using a batch of dozen regular expressions.
有时,部分文本会从 PDF 中错误地解析并包含特定的语法,从而降低可读性和意义。我使用一批十几个正则表达式解决了这个问题。

ilya-20250325102115.png Example of poorly parsed text
ilya-20250325102115.png 解析不良文本的示例

Documents with the aforementioned Caesar cipher were also detected via regex patterns. I tried to decode them, but even after restoration, they contained many artifacts. Therefore, I simply ran these documents entirely through OCR.
具有上述 Caesar 密码的文档也是通过正则表达式模式检测到的。我试图解码它们,但即使在修复之后,它们也包含许多伪影。因此,我只是完全通过 OCR 运行这些文件。

Table Serialization  表序列化

In large tables, the metric name (horizontal header) is often positioned too far from vertical headers, weakening semantic coherence.
在大型表中,度量名称(水平标题)的位置通常与垂直标题相距太远,从而削弱了语义的一致性。

ilya-20250325102151.png There are 1,500 irrelevant tokens separating vertical and horizontal headers
有 1,500 个不相关的标记分隔垂直和水平标头

This significantly reduces the chunk's relevance in vector search (let alone situations where the table doesn't fit entirely into one chunk). Additionally, LLMs struggle to match metric names with headers in large tables, possibly returning a wrong value.
这大大降低了 chunk 在 vector search 中的相关性(更不用说表不完全适合一个 chunk 的情况了)。此外,LLM 难以将指标名称与大型表中的标头匹配,可能会返回错误的值。

Serialization of tables became the solution. Research on this topic is sparse, so I had to navigate this independently. You can google Row-wise Serialization, Attribute-Value Pairing, or read this research paper.
表的序列化成为解决方案。关于这个主题的研究很少,所以我不得不独立驾驭它。您可以在 google 上搜索 Row-wise Serialization、Attribute-Value Pairing,或阅读 这篇研究论文

The essence of serialization is transforming a large table into a set of small, contextually independent strings.
序列化的本质是将一个大表转换为一组独立于上下文的小字符串。

After extensive experiments with prompts and Structured Output schemas, I found a solution that enabled even GPT-4o-mini to serialize huge tables almost losslessly. Initially, I fed tables to the LLM in Markdown format, but then switched to HTML format (this is where it proved useful!). Language models understand it much better, plus it allows describing tables with merged cells, subheadings, and other structural complexities.
在对提示和结构化输出模式进行广泛实验后,我找到了一种解决方案,它甚至使 GPT-4o-mini 能够几乎无损地序列化大型表。最初,我以 Markdown 格式将表格提供给 LLM,但后来切换到 HTML 格式(这就是它被证明有用的地方!语言模型更好地理解它,此外,它还允许描述具有合并单元格、副标题和其他结构复杂性的表格。

To answer a question like, "What was the company's shareholder's equity in 2021?" it's sufficient to feed the LLM a single sentence rather than a large structure with lots of "noise."
要回答这样的问题,“公司 2021 年的股东权益是多少?”,只需向法学硕士提供一句话,而不是一个充满“噪音”的大结构就足够了。

During serialization, the whole table is converted into a set of such independent blocks:
在序列化过程中,整个 table 被转换为一组这样的独立块:

  • subject_core_entity: Shareholders' equity
    subject_core_entity 股东权益
  • information_block: Shareholders' equity for the years from 2012/3 to 2022/3 are as follows: ¥637,422 million (2012/3), ¥535,422 million (2013/3), ¥679,160 million (2014/3), ¥782,556 million (2015/3), ¥540,951 million (2016/3), ¥571,983 million (2017/3), ¥511,242 million (2018/3), ¥525,064 million (2019/3), ¥513,335 million (2020/3), ¥577,782 million (2021/3), and ¥1,274,570 million (2022/3).
    information_block2012/3 至 2022/3 年度股东权益如下:6,374.22 亿日元(2012/3)、5,354.22 亿日元(2013/3)、6,791.60 亿日元(2014/3)、7,825.56 亿日元(2015/3)、5,409.51 亿日元(2016/3)、5,719.83 亿日元(2017/3)、511,242 万日元(2018/3)、5250.64 亿日元(2019/3)、5,133.35 亿日元(2020/3)、5,77782 万日元(2021/3)和 1274,57,5700 万日元(2022/3)。

After obtaining a serialized version of the table, I placed it beneath the original table as a kind of textual annotation for each element.
在获得表格的序列化版本后,我将其放在原始表格下方,作为每个元素的一种文本注释。

You can view the serialization prompt and logic in the project's repository: tables_serialization.py
可以在项目的仓库中查看序列化提示和逻辑: tables_serialization.py

*Despite serialization's fantastic potential, the winning solution ultimately didn't use it. I'll explain why at the end of the article.
* 尽管连载具有巨大的潜力,但获胜的解决方案最终没有使用它。我将在文章末尾解释原因。

2. Ingestion  2. 摄入

Reports have been converted from PDF to clean Markdown text. Now let's create databases from them.
报表已从 PDF 转换为干净的 Markdown 文本。现在让我们从它们创建数据库。

Agreeing on terminology  就术语达成一致

In the realm of search systems (Google Search, full-text-search, Elastic Search, vector search, etc.), a document is a single indexed element returned by the system as a query result. A document could be a sentence, paragraph, page, website, image—doesn't matter. But personally, this definition always confuses me due to the more common, everyday meaning: a document as a report, contract, or certificate.
在搜索系统(Google 搜索、全文搜索、弹性搜索、矢量搜索等)领域, 文档  是系统作为查询结果返回的单个索引元素。文档可以是句子、段落、页面、网站、图像——没关系。但就我个人而言,这个定义总是让我感到困惑,因为更常见的日常含义:  作为  报告、合同或证书的文件。

Therefore, from here on, I'll use document in its everyday meaning.
因此,从现在开始,我将以日常含义使用 文档  。

The element stored in the database, I'll call a chunk, since we store simply sliced pieces of text.
存储在数据库中的元素,我将其称为 ,因为我们存储的是简单的切片文本片段。

Chunking  分块

According to the competition rules, we had to specify the pages containing relevant information. Enterprise systems use the same approach: references allow verifying that the model's answer isn't hallucinated.
根据比赛规则,我们必须指定包含相关信息的页面。企业系统使用相同的方法:参考资料允许验证模型的答案不是幻觉。

This not only makes the system more transparent to users but also simplifies debugging during development.
这不仅使系统对用户更加透明,还简化了开发过程中的调试。

The simplest option is to use a whole page of a document as a chunk since pages rarely exceed a couple thousand tokens (although table serialization could expand a page up to five thousand).
最简单的选择是将文档的整页用作块,因为页面很少超过几千个标记(尽管表序列化可以将页面扩展到五千个)。

But let's think again about the semantic coherence between the query and a chunk of document text. Usually, an informational piece sufficient for an answer is no larger than ten sentences.
但是,让我们再考虑一下查询和文档文本块之间的语义一致性。通常,足以回答的信息性文章不超过 10 个句子。

Thus, logically, a target statement within a small paragraph will yield a higher similarity score than the same statement diluted within a whole page of weakly relevant text.
因此,从逻辑上讲,与在整页弱相关文本中稀释的相同陈述相比,小段落中的目标陈述将产生更高的相似度分数。

I split the text on each page into chunks of 300 tokens (approximately 15 sentences).
我将每页上的文本分成 300 个标记(大约 15 个句子)的块。

To slice the text, I used a recursive splitter with a custom MD dictionary. To avoid losing information cut between two chunks, I added a small text overlap (50 tokens).
为了对文本进行切片,我使用了带有自定义 MD 字典的递归拆分器。为了避免丢失两个块之间的信息,我添加了一个小的文本重叠(50 个标记)。

If you're worried that overlap won't fully eliminate risks from poor slicing, you can Google "Semantic splitter." This is especially important if you plan to insert only found chunks in the context.
如果您担心重叠无法完全消除切片不佳带来的风险,您可以在 Google 上搜索 “Semantic splitter”。如果您计划在上下文中仅插入找到的 chunk,这一点尤其重要。

However, the precision of slicing had almost no effect on my retrieval system.
然而,切片的精度对我的检索系统几乎没有影响。

Each chunk stores its ID and the parent page number in its metadata.
每个块都将其 ID 和父页码存储在其元数据中。

Vectorization  矢 量化

Our collection of chunks is prepared; now let's create the vector database—or rather, databases. 100 databases, where 1 database = 1 document.
我们的 chunk 集合已经准备好了;现在,让我们创建 Vector 数据库 — 或者更确切地说,数据库。100 个数据库,其中 1 个数据库 = 1 个文档。

Because why mix information from all companies into one heap and later try to separate one company's revenue from another's? Target information for an answer is always strictly within a single document.
因为为什么要将所有公司的信息混合到一个堆中,然后试图将一家公司的收入与另一家公司的收入分开呢?答案的目标信息始终严格在单个文档中。

We only need to determine which database to query for a given question (more on that later).
我们只需要确定为给定问题查询哪个数据库(稍后会详细介绍)。

To create, store, and search the vector databases, I used FAISS.
为了创建、存储和搜索矢量数据库,我使用了 FAISS

A bit about vector database formats
关于矢量数据库格式的一些信息

Databases were created with the IndexFlatIP method.
数据库是使用 IndexFlatIP 方法创建的。

The advantage of Flat indices is that all vectors are stored "as-is," without compression or quantization. Searches use brute-force, giving higher precision. The downside is such searches are significantly more compute- and memory-intensive.
Flat 索引的优点是所有向量都“按原样”存储,无需压缩或量化。搜索使用暴力破解,精度更高。缺点是此类搜索的计算和内存密集度要高得多。

If your database has at least a hundred thousand elements, consider IVFFlat or HNSW. These formats are much faster (though require a bit more resources when creating the database). But increased speed comes at the cost of accuracy due to approximate nearest neighbor (ANN) search.
如果您的数据库至少有十万个元素,请考虑使用 IVFFlat 或 HNSW。这些格式要快得多(尽管在创建数据库时需要更多的资源)。但是,由于近似最近邻 (ANN) 搜索,速度的提高是以牺牲准确性为代价的。

Separating chunks of all documents into different indexes allowed me to use Flat databases.
将所有文档的块分离到不同的索引中,使我能够使用 Flat 数据库。

IP (inner product) is used to calculate the relevance score through cosine similarity. Aside from IP, there's also L2—which calculates relevance score via Euclidean distance. IP typically gives better relevance scoring.
IP (inner product) 用于通过余弦相似度计算相关性分数。除了 IP,还有 L2 — 它通过欧几里得距离计算相关性分数。IP 通常提供更好的相关性评分。

To embed chunks and queries into vector representation, I used text-embedding-3-large.
为了将块和查询嵌入到向量表示中,我使用了 text-embedding-3-large

3. Retrieval  3. 检索

After creating our databases, it's time to move on to the "R" (Retrieval) part of our RAG system.
创建数据库后,是时候转到 RAG 系统的“R”(检索)部分了。

A Retriever is a general search system that takes a query as input and returns relevant text containing the information necessary for an answer.
检索器是一种通用搜索系统,它将查询作为输入并返回包含答案所需信息的相关文本。

In the basic implementation, it is simply a query to a vector database, extracting the top_n results.
在基本实现中,它只是对 vector 数据库的查询,提取 top_n 结果。

This is an especially critical part of the RAG system: if the LLM does not receive the necessary information in the context of a query, it cannot provide a correct answer—no matter how well you fine-tune your parsing or answer prompts.
这是 RAG 系统的一个特别关键的部分:如果 LLM 在查询上下文中没有收到必要的信息,它就无法提供正确的答案——无论您如何微调解析或答案提示。

Junk in → Junk out.
垃圾进来→垃圾出去。

The quality of a retriever can be improved in many ways. Here are methods I explored during the competition:
猎犬的质量可以通过多种方式提高。以下是我在比赛中探索的方法:

Hybrid search: vDB + BM25
混合搜索:vDB + BM25

Hybrid search combines semantic vector-based search with traditional keyword-based text search (BestMatch25). It theoretically improves retrieval accuracy by not only considering the meaning of the text but also precise keyword matches. Typically, results from both methods are merged and reranked by a combined score.
混合搜索将基于语义向量的搜索与传统的基于关键字的文本搜索 (BestMatch25) 相结合。理论上,它不仅通过考虑文本的含义,还通过考虑精确的关键字匹配来提高检索准确性。通常,两种方法的结果会合并并按组合分数重新排序。

I didn't particularly like this approach: in its minimal implementation, it often reduced the retrieval quality instead of improving it.
我不是特别喜欢这种方法:在它的最小实现中,它经常降低检索质量而不是提高检索质量。

Generally, hybrid search is a good technique and can be refined further by modifying input queries. At its simplest, LLMs can rephrase questions to remove noise and increase keyword density.
通常,混合搜索是一种很好的技术,可以通过修改输入查询来进一步优化。简单来说,LLM 可以改写问题以消除噪音并增加关键词密度。

If you've had positive experiences with hybrid search, especially regarding potential issues and solutions, please share in the comments.
如果您对混合搜索有积极的体验,尤其是在潜在问题和解决方案方面,请在评论中分享。

In any case, I had more promising alternatives in mind and decided not to explore this direction further.
无论如何,我心中有更有前途的替代方案,并决定不再进一步探索这个方向。

Cross-encoder reranking  跨编码器重新排序

Reranking the results of vector search using Cross-encoder models seemed promising. In short, Cross-encoders give a more precise similarity score but are slower.
使用交叉编码器模型对向量搜索结果进行重新排名似乎很有希望。简而言之,交叉编码器给出了更精确的相似性分数,但速度较慢。

Cross-encoders lie between embedding models (bi-encoders) and LLMs. Unlike comparing texts via their vector representations (which inherently lose some information), cross-encoders directly assess semantic similarity between two texts, giving more accurate scores.
交叉编码器位于嵌入模型(双编码器)和 LLM 之间。与通过向量表示比较文本(本质上会丢失一些信息)不同,交叉编码器直接评估两个文本之间的语义相似性,从而给出更准确的分数。

However, pairwise comparisons of the query with every database element take too long.
但是,查询与每个数据库元素的成对比较需要很长时间。

Thus, cross-encoder reranking is suitable only for a small set of chunks already filtered by vector search.
因此,交叉编码器重新排序仅适用于已经通过矢量搜索过滤的一小部分块。

At the last minute, I abandoned this method due to the scarcity of cross-encoder reranking models available via APIs. Neither OpenAI nor other large providers offered them, and I didn't want the hassle of managing another API balance.
在最后一刻,由于缺乏通过 API 提供的跨编码器重新排名模型,我放弃了这种方法。OpenAI 和其他大型提供商都没有提供它们,我不想管理另一个 API 余额的麻烦。

But if you're interested in trying cross-encoder reranking, I recommend Jina Reranker. It performs well on benchmarks, and Jina offers a generous number of requests upon registration.
但如果您有兴趣尝试跨编码器重新排名,我推荐 Jina Reranker。它在基准测试中表现良好,Jina 在注册时提供了大量的请求。

Ultimately, I opted for an even more attractive alternative: LLM reranking!
最终,我选择了一个更具吸引力的替代方案:法学硕士重新排名!

LLM reranking  LLM 重新排名

Simple enough: pass text and a question to the LLM and ask, “Is this text helpful for answering the question? How helpful? Rate its relevance from 0 to 1.”
很简单:将文本和问题传递给 LLM,然后问:“这段文本对回答问题有帮助吗?有多大帮助?将其相关性从 0 到 1 评级。

Until recently, this approach wasn't viable due to the high cost of powerful LLM models. But now we have fast, cheap, and smart enough LLMs available.
直到最近,由于强大的 LLM 模型成本高昂,这种方法还不可行。但现在我们有足够快速、便宜和智能的 LLM。

Like Cross-encoder reranking, we apply this after initial filtering via vector search.
与 Cross-encoder reranking 一样,我们在通过向量搜索进行初始过滤后应用它。

I developed a detailed prompt describing general guidelines and explicit relevance criteria in increments of 0.1:
我开发了一个详细的提示,以 0.1 为增量描述一般准则和明确的相关性标准:

  • 0 = Completely Irrelevant: The block has no connection or relation to the query.
    0 = 完全不相关:该块与查询没有连接或关系。
  • 0.1 = Virtually Irrelevant: Only a very slight or vague connection to the query.
    0.1 = 几乎无关紧要:与查询只有非常轻微或模糊的联系。
  • 0.2 = Very Slightly Relevant: Contains an extremely minimal or tangential connection.
    0.2 = 非常轻微相关:包含极小或切线连接。
  • ...

The LLM query is formatted as Structured output with two fields: reasoning (allowing the model to explain its judgment) and relevance_score, allowing extraction directly from the JSON without additional parsing.
LLM 查询的格式为结构化输出,具有两个字段: 推理  (允许模型解释其判断)和 relevance_score,允许直接从 JSON 中提取,无需额外解析。

I further optimized the process by sending three pages at once in one request, prompting the LLM to return three scores simultaneously. This increased speed, reduced cost, and slightly improved scoring consistency, as adjacent blocks of text grounded the model's assessments.
我进一步优化了流程,在一个请求中一次发送了三个页面,促使 LLM 同时返回三个分数。这提高了速度,降低了成本,并略微提高了评分一致性,因为相邻的文本块为模型的评估奠定了基础。

The corrected relevance score was calculated using a weighted average:
校正后的相关性分数使用加权平均值计算:

vector_weight = 0.3llm_weight = 0.7
vector_weight = 0.3,llm_weight = 0.7

In theory, you could bypass vector search and pass every page through the LLM directly. Some participants did just that, successfully. However, I believe a cheaper, faster filter using embeddings is still necessary. For a 1000-page document (and some documents were this large), answering just one question would cost roughly 25 cents—too expensive.
理论上,您可以绕过向量搜索,直接通过 LLM 传递每一页。一些参与者成功地做到了这一点。然而,我认为使用嵌入的更便宜、更快的过滤器仍然是必要的。对于一份 1000 页的文档(有些文档这么大),仅回答一个问题大约需要花费 25 美分——太贵了。

And, after all, we’re competing in a RAG challenge, aren’t we?
而且,毕竟,我们正在参加 RAG 挑战赛,不是吗?

Reranking via GPT-4o-mini cost me less than one cent per question! This approach delivered excellent quality, speed, and cost balance—exactly why I chose it.
通过 GPT-4o-mini 重新排名每个问题花费我不到一美分!这种方法提供了出色的质量、速度和成本平衡 - 这正是我选择它的原因。

Check out the reranking prompt here.
在此处查看重新排名提示。

Parent Page Retrieval  父页面检索

Remember how I talked about splitting text into smaller chunks? Well, there's a small but important caveat here.
还记得我是如何讨论将文本拆分成更小的块的吗?好吧,这里有一个小但重要的警告。

Yes, the core information needed to answer is usually concentrated in a small chunk — which is exactly why breaking the text into smaller pieces improves retrieval quality.
是的,回答所需的核心信息通常集中在一小块中——这正是将文本分成更小的部分可以提高检索质量的原因。

But the rest of the text on that page may still contain secondary — yet still important — details.
但该页面上的其余文本可能仍包含次要但仍然重要的细节。

Because of this, after finding the top_n relevant chunks, I only use them as pointers to the full page, which then goes into the context. That's precisely why I recorded the page number in each chunk's metadata.
因此,在找到 top_n 相关块后,我只将它们用作指向整个页面的指针,然后进入上下文。这正是我在每个块的元数据中记录页码的原因。

Assembled Retriever  组装的猎犬

ilya-20250325102248.png

Let's recap the final retriever steps:
让我们回顾一下最后的检索器步骤:

  1. Vectorize the query.  矢量化查询。
  2. Find the top 30 relevant chunks based on the query vector.
    根据查询向量查找前 30 个相关块。
  3. Extract pages via chunk metadata (remember to deduplicate!).
    通过区块元数据提取页面(记得去重!
  4. Pass pages through the LLM reranker.
    通过 LLM reranker 传递页面。
  5. Adjust relevance scores for pages.
    调整页面的相关性分数。
  6. Return top 10 pages, prepend each page with its number, and merge them into a single string.
    返回前 10 页,在每个页前面加上其编号,然后将它们合并为一个字符串。

Our retriever is now ready!
我们的猎犬现在已经准备好了!

4. Augmentation  4. 增加

ilya-20250325102326.png

Our vector database is set up, and the retriever is ready. With the "R" (Retrieval) part of RAG behind us, we now approach the "A" (Augmentation) part, which is pretty straightforward, consisting mainly of f-strings and concatenations.
我们的矢量数据库已建立,检索器已准备就绪。有了 RAG 的 “R” (检索) 部分,我们现在接近 “A” (增强) 部分,这非常简单,主要由 f 字符串和串联组成。

One interesting detail is how I structured prompt storage. After trying different approaches across multiple projects, I eventually settled on the following approach:
一个有趣的细节是我如何构建提示存储。在多个项目中尝试了不同的方法后,我最终确定了以下方法:

I store prompts in a dedicated prompts.py file, typically splitting prompts into logical blocks:
我将提示存储在一个专用的 prompts.py 文件中,通常将提示拆分为逻辑块:

  • Core system instruction;
    核心系统指令;
  • Pydantic schema defining the response format expected from the LLM;
    Pydantic 模式定义 LLM 期望的响应格式;
  • Example question-answer pairs for creating one-shot/few-shot prompts;
    用于创建一次性/少量提示的问答对示例;
  • Template for inserting the context and the query.
    用于插入上下文和查询的模板。

A small function combines these blocks into the final prompt configuration as needed. This method allows flexible testing of different prompt configurations (e.g., comparing the effectiveness of different examples for one-shot prompts).
一个小函数根据需要将这些块组合到最终的 prompt 配置中。这种方法允许灵活地测试不同的提示配置(例如,比较不同示例对 one-shot 提示的有效性)。

Some instructions may repeat across multiple prompts. Previously, changing such instructions meant synchronizing updates across all prompts using them, easily leading to mistakes. The modular approach solved this issue. Now, I place recurring instructions into a shared block and reuse it across several prompts.
某些说明可能会在多个提示中重复。以前,更改此类指令意味着使用它们的所有提示同步更新,这很容易导致错误。模块化方法解决了这个问题。现在,我将重复的指令放入一个共享块中,并在多个提示中重复使用它。

Additionally, modular blocks simplify handling when prompts become overly long.
此外,当提示变得过长时,模块化块简化了处理。

All prompts can be viewed in the project repository: prompts.py
所有提示都可以在项目存储库中查看: prompts.py

5. Generation  5. 生成

The third part "G" in RAG is the most labor-intensive. Achieving high quality here requires skillful implementation of several fundamental techniques.
RAG 中的第三部分“G”是最劳动密集型的。在这里实现高质量需要熟练地实施几种基本技术。

Routing queries to the database
将查询路由到数据库

ilya-20250325102349.png

This is one of the simplest yet most useful parts of a RAG system.
这是 RAG 系统中最简单但最有用的部分之一。

Recall that each report has its own separate vector database. The question generator was designed so that the company's name always explicitly appears in the question.
回想一下,每个报告都有自己单独的矢量数据库。问题生成器的设计使公司名称始终明确出现在问题中。

We also have a list of all company names (provided along with the PDF reports at the start of the competition). Thus, extracting the company's name from a query doesn't even require an LLM: we simply iterate over the list, extract the name via re.search() from the question, and match it to the appropriate database.
我们还有一份所有公司名称的列表(在比赛开始时与 PDF 报告一起提供)。因此,从查询中提取公司名称甚至不需要 LLM:我们只需迭代列表,通过 re.search() 从问题中提取名称,并将其与适当的数据库进行匹配。

In real-world scenarios, routing queries to databases is more complex than in our controlled, sterile conditions. Most likely, you'll have additional preliminary tasks: tagging databases or using an LLM to extract entities from the question to match them to a database.
在实际场景中,将查询路由到数据库比在受控的无菌条件下更复杂。最有可能的是,您将有额外的初步任务:标记数据库或使用 LLM 从问题中提取实体以将它们与数据库进行匹配。

But conceptually, the approach remains unchanged.
但从概念上讲,这种方法保持不变。

To summarize:  总结一下:

Found the name → matched to DB → search only in this DB. The search space shrinks 100-fold.
找到与数据库匹配的名称→→仅在此数据库中搜索。搜索空间缩小了 100 倍。

Routing queries to prompts
将查询路由到提示

ilya-20250325102415.png

One requirement of the competition was the answer format. Each answer must be concise and strictly conform to the data type as if storing it directly into the company's database.
比赛的一项要求是答案格式。每个答案都必须简洁明了,并严格符合数据类型,就像将其直接存储到公司的数据库中一样。

Alongside each question, the expected type is given explicitly—int/float, bool, str, or list[str].
在每个问题旁边,都会明确给出预期的类型 - int/floatboolstrlist[str]。

Each type involves 3–6 nuances to consider when responding.
每种类型都涉及 3-6 个细微差别,供您在回复时考虑。

For example, if a question asks for a metric value, the answer must be solely numeric, without comments, currency signs, etc. For monetary metrics, the currency in the report must match the currency in the question, and numbers must be normalized—reports often write something like "$1352 (in thousands)" and the system must reply with "1352000".
例如,如果问题要求提供量度值,则答案必须仅为数字,不含注释、货币符号等。对于货币指标,报告中的货币必须与问题中的货币匹配,并且必须对数字进行规范化 — 报告通常写“$1352 (in thousands)”之类的内容,系统必须回复“1352000”。

How to ensure the LLM considers all these nuances simultaneously without making errors? Simply put: you can't. The more rules you give the LLM, the higher the chance it'll ignore them. Even eight rules are dangerously many for current LLMs. A model's cognitive capacity is limited, and additional rules distract it from the main task—answering the posed question.
如何确保 LLM 同时考虑所有这些细微差别而不犯错误?简单地说:你不能。您给 LLM 的规则越多,它忽略它们的机会就越大。对于当前的 LLM 来说,即使是八条规则也是危险的。模型的认知能力是有限的,额外的规则会分散其对主要任务 — 回答提出的问题的注意力。

This logically leads to the conclusion that we should minimize the number of rules per query. One approach is to break a single query into a sequence of simpler ones.
这在逻辑上导致了我们应该最小化每个查询的规则数的结论。一种方法是将单个查询分解为一系列更简单的查询。

In our case, though, we can achieve an even simpler solution—since the expected response type is explicitly provided, we only supply the relevant instruction set to the prompt, depending on the answer type.
不过,在我们的例子中,我们可以实现一个更简单的解决方案——由于明确提供了预期的响应类型,我们只向提示提供相关的指令集,具体取决于答案类型。

I wrote four prompt variations and chose the correct one with a simple if else.
我写了四个提示变体,并使用简单的 if else 选择了正确的一个。

Routing compound queries
路由复合查询

ilya-20250325102457.png

The competition included questions comparing metrics from multiple companies. Such questions didn't fit the paradigm of other simpler queries, as they required additional steps to answer.
比赛包括比较多家公司指标的问题。此类问题不适合其他更简单查询的范式,因为它们需要额外的步骤来回答。

Example question:  示例问题:

Who has higher revenue, Apple or Microsoft?

Let's think: how would a human approach this task?
让我们想一想:人类将如何完成这项任务?

First, they'd find each company's revenue separately, then compare them.
首先,他们会分别找到每家公司的收入,然后进行比较。

We embed the same behavior into our system.
我们将相同的行为嵌入到我们的系统中。

We pass the initial comparison question to the LLM and ask it to create simpler sub-questions that extract metrics for each company individually.
我们将初始比较问题传递给 LLM,并要求它创建更简单的子问题,分别提取每家公司的指标。

In our example, the simpler sub-questions would be:
在我们的示例中,更简单的子问题是:

What is Apple's revenue? and What is Microsoft's revenue?
苹果的收入是多少? 和 Microsoft 的收入是多少?

Now we can process these simpler queries through the standard pipeline for each company separately.
现在,我们可以通过标准管道分别处理每家公司的这些更简单的查询。

After gathering answers for each company, we pass them into the context to answer the original question.
在为每家公司收集答案后,我们将它们传递到上下文中以回答原始问题。

This pattern applies to any complex queries. The key is recognizing them and identifying the necessary sub-steps.
此模式适用于任何复杂查询。关键是识别它们并确定必要的子步骤。

Chain of Thoughts  思想链

CoT significantly improves answer quality by making the model "think aloud" before providing the final response. Rather than giving an immediate answer, the LLM generates a sequence of intermediate reasoning steps leading to the solution.
CoT 通过在提供最终响应之前让模型 “大声思考” 来显著提高答案质量。LLM 不是立即给出答案,而是生成一系列中间推理步骤,从而得出解决方案。

Just like humans, LLMs handle complex problems better when breaking them down into smaller, simpler ones. CoT helps the model avoid missing crucial details, methodically process information, and reach correct conclusions. It's especially useful when context includes "traps" that might lead the model astray.
就像人类一样,LLM 在将复杂问题分解为更小、更简单的问题时,可以更好地处理这些问题。CoT 帮助模型避免遗漏关键细节,有条不紊地处理信息并得出正确的结论。当上下文包含可能导致模型误入歧途的 “陷阱” 时,它特别有用。

You've undoubtedly heard the iconic phrase, Think step by step. This was one of the earliest attempts to enhance answer quality through prompting. It practically gave rise to fancy "prompt engineering." However, for serious tasks, such generic instructions aren't sufficient.
您无疑听说过标志性的短语 Think by step。这是最早尝试通过提示提高答案质量的尝试之一。它实际上催生了花哨的“快速工程”。但是,对于严肃的任务,这样的通用指令是不够的。

LLMs can sometimes "fake" reasoning—for instance, giving an immediate answer and then retroactively justifying it or hallucinating non-existent facts. This issue is particularly common among weaker models like GPT-4o-mini or Llama 8b.
LLM 有时可以 “假” 推理——例如,立即给出答案,然后追溯性地证明它,或者幻觉不存在的事实。此问题在 GPT-4o-mini 或 Llama 8b 等较弱的模型中尤为常见。

To ensure CoT is beneficial, you must clearly guide the model on how to reason. Explain reasoning steps, objectives, and provide examples.
为确保 CoT 是有益的,您必须清楚地指导模型如何推理。解释推理步骤、目标并提供示例。

Structuring reasoning steps explicitly according to the prompt instructions can significantly enhance rule-following.
根据提示说明明确构建推理步骤可以显着增强规则遵循。

In RAG, when reports lack the exact requested metric, the context may include similar but non-identical information. Models might attempt to twist the provided data to fit the requested metric. Because models are trained to be helpful, they are biased toward providing some answer rather than admitting ignorance.
在 RAG 中,当报表缺少确切的请求量度时,上下文可能包含相似但不相同的信息。模型可能会尝试扭曲提供的数据以适应所请求的指标。因为模型被训练成有帮助的,所以它们偏向于提供一些答案,而不是承认无知。

Trying to counteract this bias by just asking the model to be skeptical does work — but not consistently and not in all cases. Instead, prompting the model explicitly to analyze information from different angles or perspectives helps much more.
试图通过要求模型持怀疑态度来抵消这种偏见确实有效——但并非始终如一,也不是在所有情况下都有效。相反,提示模型从不同的角度或角度明确分析信息会更有帮助。

I specifically instructed the model to focus on assessing the compatibility of metrics between questions and context. This significantly reduced hallucinations.
我特别指示模型专注于评估问题和上下文之间指标的兼容性。这显着减少了幻觉。

Example reasoning with ambiguous context:
上下文模棱两可的推理示例:

Question:  问题:

What is the Research and development equipment, at cost, for Ritter Pharmaceuticals, Inc.?
Ritter Pharmaceuticals, Inc. 的研发设备成本是多少?

Model's reasoning steps:  模型的推理步骤:

  1. The question asks for 'Research and development equipment, at cost' for Ritter Pharmaceuticals, Inc. This indicates a specific value from the balance sheet, representing the original purchase price of equipment specifically used for R&D, without any accumulated depreciation.
    该问题要求 Ritter Pharmaceuticals, Inc. 的“研发设备,成本高昂”。这表示资产负债表中的一个特定值,代表专门用于研发的设备的 原始购买价格  ,  没有  任何累计折旧。
  2. The context (page 35) shows 'Property and equipment, net' at $12,500. This is a net value (after depreciation), and it's a broader category, encompassing all property and equipment, not just R&D equipment.
    上下文(第 35 页)显示“财产和设备净额”为 12,500 美元。这是一个 净值  (折旧后),它是一个  更广泛的  类别,包括所有财产和设备,而不仅仅是研发设备。
  3. The context (page 37) also mentions 'Accumulated Depreciation' of $110,000 for 'Machinery and Equipment'. This represents the total depreciation, not the original cost, and, importantly, it doesn't specify that this equipment is exclusively for R&D.
    上下文(第 37 页)还提到了“机器和设备”的 110,000 美元的“累计折旧”。这代表了总 折旧 ,而不是原始成本,而且,重要的是,它并未明确说明该设备专门用于   研发。
  4. Neither of these metrics exactly matches the requested metric. 'Property and equipment, net' is too broad and represents the depreciated value. 'Accumulated Depreciation' only shows depreciation, not cost, and lacks R&D specificity.
    这些指标都不能与请求的指标 完全  匹配。“Property and equipment, net” 太宽泛,表示折旧价值。'累计折旧'只显示折旧,不显示成本,缺乏研发特异性。
  5. Since the context doesn't provide the original cost of only R&D equipment, and we cannot make assumptions, perform calculations, or combine information, the answer is 'N/A'.
    由于上下文并未仅提供研发设备的原始成本    ,而且我们无法做出假设、进行计算或合并信息,因此答案是'N/A'。

Structured outputs  结构化输出

Structured Output (SO) is a way of forcing the model to respond in a strictly defined format. It's usually passed as a separate parameter to the API, such as a Pydantic or JSON schema.
结构化输出 (SO) 是一种强制模型以严格定义的格式进行响应的方法。它通常作为单独的参数传递给 API,例如 Pydantic 或 JSON 架构。

This guarantees that the model always returns valid JSON strictly adhering to the provided schema.
这可保证模型始终返回严格遵守所提供架构的有效 JSON。

Field descriptions can also be included in the response schema. These don't affect structure but are treated by the LLM as part of the prompt.
字段描述也可以包含在响应架构中。这些不会影响结构,但 LLM 将其视为提示的一部分。

For example, here's a Pydantic schema for LLM reranking:
例如,下面是一个用于 LLM 重新排名的 Pydantic 架构:

class RetrievalRankingSingleBlock(BaseModel):
    """Rank retrieved text block relevance to a query."""

    reasoning: str = Field(
        description=(
            "Analysis of the block, identifying key information and how it "
            "relates to the query"
        )
    )
    relevance_score: float = Field(
        description=(
            "Relevance score from 0 to 1, where 0 is Completely Irrelevant "
            "and 1 is Perfectly Relevant"
        )
    )

With this schema, the LLM always returns a JSON with two fields—the first a string, the second a number.
使用此架构时,LLM 始终返回包含两个字段的 JSON——第一个是字符串,第二个是数字。

CoT SO

The methods described above are ideally combined with each other.
上述方法最好地相互结合。

During generation, the model has a dedicated field specifically for reasoning and a separate field for the final answer. This allows us to extract the answer without needing to parse it from lengthy reasoning steps.
在生成过程中,模型有一个专门用于推理的专用字段和一个用于最终答案的单独字段。这使我们能够提取答案,而无需从冗长的推理步骤中解析它。

Chain of Thought can be implemented within Structured Outputs in several ways. For example, you could use multiple JSON fields, each guiding the model to intermediate conclusions whose combination leads it to the correct final answer.
思维链可以通过多种方式在结构化输出中实现。例如,您可以使用多个 JSON 字段,每个字段都引导模型得出中间结论,这些结论的组合使其得出正确的最终答案。

However, because the logic required for answering competition questions couldn't be described by a single predefined set of step-by-step instructions, I employed a more general approach, providing the model with a single reasoning field and defining the reasoning sequence directly within the prompt.
然而,由于回答竞赛问题所需的逻辑无法通过一组预定义的分步指令来描述,因此我采用了一种更通用的方法,为模型提供单个推理字段,并直接在提示中定义推理顺序。

In my main schema for answering competition questions, there were just four fields:
在我回答比赛问题的主要架构中,只有四个字段:

  • step_by_step_analysis — preliminary reasoning (the Chain of Thought itself).
    step_by_step_analysis — 初步推理(思想链本身)。
  • reasoning_summary — a condensed summary of the previous field (for easier tracking of the model’s logic).
    reasoning_summary — 上一个字段的简明摘要(以便于跟踪模型的逻辑)。
  • relevant_pages — report page numbers referenced by the answer.
     relevant_pages — 报告答案引用的页码。
  • final_answer — a concise answer formatted as required by the competition.
    final_answer — 根据比赛要求格式化的简洁答案。

The first three fields were reused across all four prompts tailored for different answer types. The fourth field varied each time, specifying the answer type and describing particular nuances the model had to consider.
前三个字段在针对不同答案类型定制的所有四个提示中重复使用。第四个字段每次都不同,指定答案类型并描述模型必须考虑的特定细微差别。

For example, ensuring that the final_answer field would always be a number or "N/A" was done like this:
例如,确保 final_answer 字段始终为数字或 “N/A” 是这样完成的:

final_answer: Union[float, int, Literal['N/A']]

SO Reparser  SO 解析器

Not all LLMs support Structured Outputs, which guarantee full adherence to schemas.
并非所有 LLM 都支持结构化输出,结构化输出可保证完全遵守架构。

If a model doesn’t have a dedicated Structured Output feature, you can still present the output schema directly within the prompt. Models are usually smart enough to return valid JSON in most cases. However, a portion of answers will inevitably deviate from the schema, breaking the code. Smaller models, in particular, fail to conform about half the time.
如果模型没有专用的 Structured Output 功能,您仍然可以直接在提示中显示输出架构。在大多数情况下,模型通常足够智能,可以返回有效的 JSON。但是,部分答案将不可避免地偏离架构,从而破坏代码。尤其是较小的模型,大约有一半的时间无法符合。

To address this, I wrote a fallback method that validates the model’s response against the schema using schema.model_validate(answer). If validation fails, the method sends the response back to the LLM, prompting it to conform to the schema.
为了解决这个问题,我编写了一个回退方法,该方法使用 schema.model_validate(answer) 验证模型对架构的响应。如果验证失败,该方法会将响应发送回 LLM,提示其符合架构。

This method brought schema compliance back up to 100%, even for the 8b model.
这种方法使架构合规性恢复到 100%,即使对于 8b 模型也是如此。

Here's the prompt itself.
这是 提示本身

One-shot Prompts  一次性提示

This is another common and fairly obvious approach: adding an example answer pair to the prompt improves response quality and consistency.
这是另一种常见且相当明显的方法:在提示中添加示例答案对可以提高响应质量和一致性。

I added a "question → answer" pair to each prompt, writing the answer in the JSON format defined by Structured Outputs.
我在每个提示中添加了“问题→答案”对,以结构化输出定义的 JSON 格式编写答案。

The example serves multiple purposes simultaneously:
该示例同时用于多种用途:

  • Demonstrates an exemplary step-by-step reasoning process.
    演示了示例性的分步推理过程。
  • Further clarifies correct behavior in challenging cases (helping recalibrate the model's biases).
    进一步阐明具有挑战性的案例中的正确行为(帮助重新校准模型的偏差)。
  • Illustrates the JSON structure that the model’s answer should follow (particularly useful for models lacking native SO support).
    说明模型答案应遵循的 JSON 结构(对于缺乏本机 SO 支持的模型特别有用)。

I paid significant attention to crafting these example answers. The quality of examples in the prompt can either boost or diminish response quality, so each example must be perfectly consistent with the directives and nearly flawless overall. If an example answer contradicts instructions, the model becomes confused, which can negatively affect performance.
我非常注意制作这些示例答案。提示中示例的质量可以提高或降低响应质量,因此每个示例必须与指令完全一致,并且总体上几乎完美无缺。如果示例答案与指令相矛盾,模型就会变得混乱,这可能会对性能产生负面影响。

I meticulously refined the step-by-step reasoning field in the examples, manually adjusting the reasoning structure and wording of each phrase.
我仔细地细化了示例中的逐步推理字段,手动调整了每个短语的推理结构和措辞。

Instruction Refinement  指令优化

This part is comparable in labor-intensity to the entire data preparation stage due to endless iterative debugging, proofreading of answers, and manual analysis of the model's reasoning process.
这部分的劳动强度与整个数据准备阶段相当,因为无休止的迭代调试、答案的校对和模型的推理过程的手动分析。

Analyzing Questions  分析问题

Before writing prompts, I thoroughly studied both the response requirements and the question generator.
在编写提示之前,我彻底研究了回答要求和问题生成器。

The key to a good system with an LLM under the hood is understanding customer needs. Typically, this involves deep immersion into a professional domain and meticulous examination of questions. I'm convinced it's impossible to create a truly high-quality QA system for businesses unless you clearly understand the questions themselves and how to find answers (I'd be glad if someone could convince me otherwise).
拥有法学硕士的优秀系统的关键是了解客户需求。通常,这涉及深入沉浸在专业领域和对问题进行细致的审查。我相信,除非您清楚地了解问题本身以及如何找到答案,否则不可能为企业创建一个真正高质量的 QA 系统(如果有人能说服我,我会很高兴)。

This understanding is also required to clarify all implicit meanings arising from user questions.
这种理解也是澄清用户问题中产生的所有隐含含义所必需的。

Let's consider the example question Who is the CEO of ACME inc?
让我们考虑示例问题 ACME Inc 的首席执行官是谁?

In an ideal world, a report would always explicitly provide the answer, leaving no room for misinterpretation:
在理想情况下,报告总是会明确提供答案,不留任何误解的余地:

CEO responsibilities are held by John Doe

A RAG system would locate this sentence in the report, add it to the query context, and the user would receive an unambiguous answer: John Doe
RAG 系统将在报告中找到这句话,将其添加到查询上下文中,用户将收到一个明确的答案: John Doe

However, we live in the real world, where tens of thousands of companies express information in unlimited variations, with numerous additional nuances.
然而,我们生活在现实世界中,数以万计的公司以无限的变化表达信息,并带有许多额外的细微差别。

This raises the question: what exactly can fall under the term "CEO"?
这就提出了一个问题:“CEO”这个词到底可以归入什么?

  • How literally should the system interpret the client's question?
    系统应该如何解释客户的问题?
  • Does the client want to know the name of the person holding a similar managerial role, or strictly that specific job title?
    客户想知道担任类似管理职务的人的姓名,还是只想知道那个特定的职位?
  • Is stepping slightly away from a literal interpretation acceptable? How far is too far?
    稍微偏离字面解释是可以接受的吗?多远才算太远?

Potentially, the following positions could be included:
可能包括以下位置:

  • Chief Executive Officer — obviously, just the abbreviation spelled out.
     首席执行官 — 显然,只是拼写了缩写。
  • Managing Director (MD), President, Executive Director — slightly less obvious. Different countries use different titles for this role (MD in the UK and Europe, President in America and Japan, Executive Director in the UK, Asian countries, and non-profits).
    董事总经理 (MD)、总裁、执行董事  — 稍微不那么明显。不同的国家/地区对此角色使用不同的头衔(英国和欧洲的董事总经理、美国和日本的总裁、英国的执行董事、亚洲国家和非营利组织)。
  • Chief Operating Officer, Principal Executive Officer, General Manager, Administrative Officer, Representative Director — even less obvious. Depending on the country and company structure, there may not be a direct CEO equivalent; these roles, although closest to CEO, have varying levels of overlap in responsibilities and authority—from 90% down to 50%.
    首席运营官、首席执行官、总经理、行政官、代表董事  ——甚至更不明显。根据国家和公司结构,可能没有直接的 CEO 等价物;这些角色虽然最接近 CEO,但在职责和权力方面的重叠程度各不相同,从 90% 到 50% 不等。

I'm unsure if there's an existing term for this, but personally, I refer to this as the "interpretation freedom threshold" issue.
我不确定是否有现有的术语来描述这个问题,但就我个人而言,我将其称为“解释自由阈值”问题。

When responses are free-form, the interpretation freedom threshold is resolved relatively easily. In ambiguous cases, LLM tries to encompass all implicit meanings from the user's query, adding several clarifications.
当响应为自由格式时,解释自由阈值的解析相对容易。在模棱两可的情况下,LLM 会尝试包含用户查询中的所有隐含含义,并添加一些说明。

Here's a real example of a ChatGPT response:
以下是 ChatGPT 响应的 真实示例  :

Based on the provided context, Ethan Caldwell is the Managing Director, which is the closest equivalent to a CEO in this company. However, he has been formally suspended from active executive duties due to an ongoing regulatory investigation. While he retains the title, he is not currently involved in company operations, and leadership has been temporarily transferred to the senior management team under board supervision.
根据提供的背景, 伊桑·考德威尔 (Ethan Caldwell ) 是董事总经理 ,这相当于该公司的首席执行官。然而,由于正在进行的监管调查,他已被  正式暂停现役行政职务  。虽然他  保留了头衔 , 但目前并未参与公司运营 ,领导权已暂时移交给  董事会监督下的高级管理团队

However, if the system architecture requires concise answers, as in the RAG Challenge, the model behaves unpredictably in these situations, relying on its internal “intuition”.
但是,如果系统架构需要简洁的答案,就像在 RAG 挑战赛中一样,模型在这些情况下的行为是不可预测的,依赖于其内部的“直觉”。

Thus, the interpretation freedom threshold must be defined and calibrated in advance. But since it's not possible to define and quantify this threshold explicitly, all major edge cases must be identified, general query interpretation rules formulated, and ambiguities clarified with the customer.
因此,必须提前定义和校准解释自由阈值。但是,由于无法明确定义和量化此阈值,因此必须确定所有主要的边缘情况,制定一般查询解释规则,并与客户澄清歧义。

Beyond interpretation issues, general dilemmas may also occur.
除了解释问题之外,还可能出现一般的困境。

For example: Did ACME inc announce any changes to its dividend policy?  例如: Did ACME inc announce any changes to its dividend policy?

Should the system interpret the absence of information in the report as an indication that no changes have been announced?
系统是否应将报告中缺少信息解释为未宣布任何更改的迹象?

Rinat (the competition organizer) can confirm—I bombarded him with dozens of similar questions and dilemmas during competition preparation :)
Rinat(比赛组织者)可以确认——在比赛准备期间,我用数十个类似的问题和困境轰炸了他:)

Prompt Creation  提示创建

One week before the competition started, the question generator’s code was made publicly available. I immediately generated a hundred questions and created a validation set from them.
比赛开始前一周,问题生成器的代码已公开可用。我立即生成了 100 个问题,并从中创建了一个验证集。

Answering questions manually is quite tedious, but it helped me in two key areas:
手动回答问题相当乏味,但它在两个关键方面帮助了我:

  1. The validation set objectively measures the system's quality as I make improvements. By running the system on this set, I monitored how many questions it answered correctly and where it most commonly made mistakes. This feedback loop aids iterative improvements of prompts and other pipeline components.
    验证集在我进行改进时客观地衡量系统的质量。通过在这个集合上运行系统,我监控了它正确回答了多少问题以及最常犯错误的地方。此反馈循环有助于对提示和其他管道组件进行迭代改进。
  2. Manually analyzing questions highlighted non-obvious details and ambiguities in questions and reports. This allowed me to clarify response requirements with Rinat and unambiguously reflect these rules in the prompts.
    手动分析问题突出了问题和报告中不明显的细节和歧义。这使我能够与 Rinat 澄清响应要求,并在提示中明确反映这些规则。

I incorporated all these clarifications into prompts as directive sets.
我将所有这些澄清作为指令集合并到提示中。

Directive examples:  指令示例:

Answer type = Number  答案类型 = 数字

Return 'N/A' if metric provided is in a different currency than mentioned in the question. Return 'N/A' if metric is not directly stated in context EVEN IF it could be calculated from other metrics in the context. Pay special attention to any mentions in the context about whether metrics are reported in units, thousands, or millions, to adjust the number in final answer with no changes, three zeroes or six zeroes accordingly. Pay attention if the value is wrapped in parentheses; it means the value is negative.
如果提供的指标与问题中提到的货币不同,则返回“N/A”。如果指标未在上下文中直接声明,则返回“N/A”,即使它可以根据上下文中的其他指标计算出来。特别注意上下文中关于指标是以单位、数千还是百万为单位报告的任何提及,以相应地调整最终答案中的数字,不进行任何更改,三个零或六个零。注意值是否用括号括起来;这意味着该值为负数。

Answer type = Names  答案类型 = 姓名

If the question asks about positions (e.g., changes in positions), return ONLY position titles, WITHOUTnames or any additional information. Appointments to new leadership positions also should be counted as changes in positions. If several changes related to a position with the same title are mentioned, return the title of such position only once. Position title always should be in singular form.
如果问题询问位置(例如,位置的变化), 请仅  返回位置标题,  不带名称或任何其他信息。新领导职位的任命也应算作职位变动。如果提到了与具有相同标题的位置相关的多个更改,则仅返回此类位置的标题一次。职位标题应始终为单数形式。

If the question asks about newly launched products, return ONLY the product names exactly as they are in the context. Candidates for new products or products in the testing phase are not counted as newly launched products.
如果问题询问新推出的产品, 请仅  返回与上下文中完全相同的产品名称。新产品或处于测试阶段的产品的候选产品不计为新推出的产品。

The model easily followed certain directives, resisted others due to skewed biases, and struggled with some, causing errors.
该模型很容易遵循某些指令,由于偏见而抵制其他指令,并在某些指令上遇到困难,从而导致错误。

For example, the model repeatedly stumbled when tracking measurement units (thousands, millions), forgetting to append necessary zeroes to the final answer. So, I supplemented the directive with a brief example:
例如,模型在跟踪测量单位(数千、数百万)时反复磕磕绊绊,忘记在最终答案中附加必要的零。因此,我用一个简短的例子补充了该指令:

Example for numbers in thousands:
以千为单位的数字示例:

Value from context: 4970,5 (in thousands $)
上下文值: 4970,5(以千美元为单位)

Final answer: 4970500
最终答案: 4970500

Eventually, I developed prompts for each question format and several auxiliary prompts:
最终,我为每种问题格式开发了提示和几个辅助提示:

  • Final prompt for Number-type questions
    数字类问题的最终提示
  • Final prompt for Name-type questions
    名称类型问题的最终提示
  • Final prompt for Names-type questions
    名称类型问题的最终提示
  • Final prompt for Boolean-type questions
    布尔型问题的最终提示
  • Final prompt for Comparative-type questions (to compare answers from multiple companies via multi-query routing)
    Comparative 类型问题的最终提示(通过多查询路由比较来自多家公司的答案)
  • Paraphrasing prompt for Comparative-type questions (to initially find metrics in reports)
    Comparative 类型问题的释义提示(用于最初在报表中查找量度)
  • LLM reranking prompt  LLM 重新排名提示
  • SO Reparser prompt  SO Reparser 提示符

Meticulous refinement of instructions combined with one-shot and SO CoT resulted in significant benefits. The final prompts entirely recalibrated unwanted biases in the system and greatly improved attentiveness to nuances, even for weaker models.
对指令的细致改进与 one-shot 和 SO CoT 相结合,产生了显著的好处。最终提示完全重新校准了系统中不需要的偏差,并大大提高了对细微差别的关注度,即使对于较弱的模型也是如此。

System Speed  系统速度

Initially, the RAG Challenge rules were stricter, requiring the system to answer all 100 questions within 10 minutes to be eligible for a monetary prize. I took this requirement seriously and aimed to fully leverage OpenAI's Tokens Per Minute rate limits.
最初,RAG 挑战赛的规则更严格,要求系统在 100 分钟内回答所有 10 个问题才有资格获得奖金。我认真对待了这一要求,并旨在充分利用 OpenAI 的每分钟代币数速率限制。

Even at Tier 2, the limits are generous—2 million tokens/minute for GPT-4o-mini and 450k tokens/minute for GPT-4o. I estimated the token consumption per question and processed questions in batches of 25. The system completed all 100 questions in just 2 minutes.
即使在第 2 层,限制也很宽泛——GPT-4o-mini 为 200 万个代币/分钟,GPT-4o 为 450k 个代币/分钟。我估计了每个问题的令牌消耗量,并分批处理了 25 个问题。系统仅用 2 分钟就完成了所有 100 个问题。

In the end, the time limit for submitting solutions was significantly extended — the other participants simply couldn't make it in time :)
最终,提交解决方案的时限被大大延长——其他参与者根本无法及时到达:)

System Quality  系统质量

Having a validation set helped improve more than just prompts—it benefited the entire system.
拥有验证集不仅有助于改进提示,还使整个系统受益。

I made all key features configurable, allowing me to measure their real-world impact and fine-tune hyperparameters. Here are some example config fields:
我使所有关键功能都可配置,从而能够衡量它们在现实世界中的影响并微调超参数。以下是一些示例配置字段:

class RunConfig:
    use_serialized_tables: bool = False
    parent_document_retrieval: bool = False
    use_vector_dbs: bool = True
    use_bm25_db: bool = False
    llm_reranking: bool = False
    llm_reranking_sample_size: int = 30
    top_n_retrieval: int = 10
    api_provider: str = "openai"
    answering_model: str = "gpt-4o-mini-2024-07-18"

While testing configurations, I was surprised to find that table serialization—which I'd placed great hopes on—not only failed to improve the system but slightly decreased its effectiveness. Apparently, Docling parses tables from PDFs well enough, the retriever finds them effectively, and the LLM understands their structure sufficiently without extra assistance. And adding more text to the page merely reduces the signal-to-noise ratio.
在测试配置时,我惊讶地发现,我曾寄予厚望的表序列化不仅未能改进系统,而且略微降低了其有效性。显然,Docling 对 PDF 中的表格的解析足够好,检索器可以有效地找到它们,并且 LLM 无需额外帮助即可充分理解它们的结构。向页面添加更多文本只会降低信噪比。

I also prepared multiple configurations for the competition to quickly run various systems in all categories.
我还为比赛准备了多种配置,以快速运行所有类别的各种系统。

The final system performed excellently with both open-source and proprietary models: Llama 3.3 70b was only a couple of points behind OpenAI’s o3-mini. Even the small Llama 8b outperformed 80% of the participants in the overall ranking.
最终的系统在开源和专有模型上都表现出色:Llama 3.3 70b 仅落后 OpenAI 的 o3-mini 几分。即使是小型的 Llama 8b 在整体排名中也超过了 80% 的参与者。

6. Conclusion  6. 总结

Ultimately, winning the RAG Challenge wasn’t about finding a single magical solution, but rather applying a systematic approach, thoughtfully combining and fine-tuning various methods, and deeply immersing myself in the task details. The key success factors were high-quality parsing, efficient retrieval, intelligent routing, and—most notably—LLM reranking and carefully crafted prompts, which enabled achieving excellent results even with compact models.
最终,赢得 RAG 挑战赛并不是要找到单一的神奇解决方案,而是应用系统的方法,深思熟虑地组合和微调各种方法,并深深地沉浸在任务细节中。成功的关键因素是高质量的解析、高效的检索、智能路由,以及最引人注目的 LLM 重新排名和精心设计的提示,即使使用紧凑的模型也能取得出色的结果。

The main takeaway from this competition is simple: the magic of RAG lies in the details. The better you understand the task, the more precisely you can fine-tune each pipeline component, and the greater benefits you get even from the simplest techniques.
这次比赛的主要收获很简单: RAG 的魔力在于细节 。您对任务了解得越多,您就越能精确地微调每个管道组件,即使从最简单的技术中获得的好处就越大。

I’ve shared all the system code as open-source. It includes instructions on deploying the system yourself and running any stage of the pipeline.
我已将所有系统代码作为 开源共享。它包括有关自行部署系统和运行管道任何阶段的说明。

Ilya is always open to interesting ideas, projects, and collaborations. Feel free to reach out to him via Telegram or LinkedIn
Ilya 总是对有趣的想法、项目和合作持开放态度。请随时通过 Telegram 或 LinkedIn 与他联系

Published: March 25, 2025.
发布时间:2025 年 3 月 25 日

Next post in Ship with ChatGPT story: Benchmarking LLM for business workloads
下一篇文章 与 ChatGPT 一起发布 故事 : 对业务工作负载的 LLM 进行基准测试