这是用户在 2024-7-19 18:05 为 https://motherduck.com/blog/search-using-duckdb-part-3/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Join Us At Small Data SF: Get $100 off with code 'EarlyBird' until Sunday, 7/21
加入我们的小数据 SF 活动:使用代码“EarlyBird”至周日 7/21 前享受 100 美元优惠
Get Tickets 获取门票

All-in-SQL Hybrid Search in DuckDB: Integrating Full Text and Embedding Methods
DuckDB 中的 ALL-IN-SQL 混合搜索:整合全文与嵌入方法

Subscribe to MotherDuck Blog
订阅 MotherDuck 博客

E-mail 电子邮件

This is the third of the current blog series exploring search in DuckDB. So far in this series we’ve covered quite some ground on vector embeddings based text search and building a knowledge base that you could search and query using vector embeddings. In embedding-based search, semantic understanding and similarities is key to ranking the documents; however there are situations where exact keyword matching is essential. This is especially true in areas like law, compliance, and medical document retrieval, where it is crucial to link specific legal codes or medical terms to the documents being searched. Lexical searches like Full Text Search are very effective in achieving this. For situations where queries are vague, document repositories cover different domains, and documents contain overlapping keywords with semantic differences a hybrid search approach works well. By leveraging both lexical matching using full text search and semantic understanding using vector search, hybrid search provides the flexibility to adapt to a vast loop up space, and still provide good relevance and accuracy on the documents retrieved.
这是当前博客系列中探讨 DuckDB 搜索的第三篇文章。在本系列中,我们已深入探讨了基于向量嵌入的文本搜索以及构建可利用向量嵌入进行搜索和查询的知识库。在基于嵌入的搜索中,语义理解和相似性是文档排序的关键;然而,在某些情况下,精确的关键词匹配至关重要。这在法律、合规和医疗文档检索等领域尤为明显,因为将特定法律条文或医学术语与所搜索的文档关联起来至关重要。全文搜索等词汇搜索在此方面非常有效。对于查询模糊、文档库涵盖不同领域且文档包含具有语义差异的重叠关键词的情况,混合搜索方法表现出色。通过结合使用全文搜索的词汇匹配和向量搜索的语义理解,混合搜索提供了适应广阔检索空间的灵活性,同时仍能保证检索文档的相关性和准确性。

In this blog we’ll explore Full Text Search and how to combine it with Embedding Search to bring about Hybrid Search. For hybrid search document ranking, that fuses the scores from both Full Text Search and Embedding Search, we will look into Reciprocal Ranked Fusion and Convex Combination the two common metrics, their formulas and their SQL implementation.
在本篇博客中,我们将探讨全文搜索(Full Text Search)及其与嵌入搜索(Embedding Search)的结合,以实现混合搜索(Hybrid Search)。针对融合全文搜索与嵌入搜索得分的混合搜索文档排序,我们将研究两种常见度量方法——倒数排名融合(Reciprocal Ranked Fusion)与凸组合(Convex Combination),包括它们的公式及其在 SQL 中的实现。

Note: we will be referring to each row in a table as a document, due to the richness in textual data that each row contains.
注意:由于每行所包含的丰富文本数据,我们将表格中的每一行称为文档。

How does Full Text Search Work?
全文搜索是如何工作的?

Full text search, that scans the entire text for specific word matches, is particularly beneficial in scenarios where exact keyword matching is necessary. It functions by comparing the keywords in a query with those in the text of document records, focusing mainly on exact matches. This contrasts with semantic search, which uses vector embeddings to identify semantic similarities which would capture synonyms and word relationships.
全文搜索,即扫描整个文本以查找特定单词匹配,在需要精确关键词匹配的场景中特别有益。它通过将查询中的关键词与文档记录的文本中的关键词进行比较,主要关注精确匹配来实现功能。这与使用向量嵌入来识别语义相似性的语义搜索形成对比,后者能够捕捉同义词和词语关系。

In DuckDB, the FTS extension provides the means to search through strings, and this is done so by creating an Inverted Index.
在 DuckDB 中,FTS 扩展提供了搜索字符串的手段,这一功能通过创建倒排索引来实现。

An Inverted Index creates a map of the keywords to the id of the document records that contains the respective keyword. This speeds up search operations by first identifying the keywords present in the query string, matching them to the inverted index and thereby locating the documents that have these keywords. Now that the relevant documents are located, the next step is to figure out how to rank them since only the top N results are relevant for any search. The FTS extension implements the Okapi BM25  scoring function which scores a document based on the keyword terms appearing in each document. The score captures the number of times a keyword occurs, length of the document in words, average length of the documents in the collection and the number of documents that contain the keyword. An exact representation of the formula can be found over here. Note that this score does not capture the proximity or the arrangement of the keywords in the document. That being said, upon calculating the score for each document, it is then ranked to select the top N results as the most relevant documents to the query string.
倒排索引创建了一个从关键词到包含该关键词的文档记录 ID 的映射。这通过首先识别查询字符串中存在的关键词,将它们与倒排索引匹配,从而定位包含这些关键词的文档,加快了搜索操作。现在相关文档已定位,下一步是确定如何对它们进行排序,因为任何搜索通常只关注前 N 个结果。全文搜索(FTS)扩展实现了 Okapi BM25 评分函数,该函数根据每个文档中出现的关键词项对文档进行评分。评分考虑了关键词出现的次数、文档的单词长度、集合中文档的平均长度以及包含该关键词的文档数量。公式的精确表示可以在此处找到。需要注意的是,此评分不捕捉关键词在文档中的接近度或排列方式。尽管如此,在计算每个文档的评分后,它们将被排序以选择前 N 个结果作为与查询字符串最相关的文档。

inverted_index

Figure 1: An illustration of how full text search creates the inverted index mapping keywords to documents and uses it for search. Consider the colored boxes as keywords.
图 1:展示了全文搜索如何创建关键词到文档的倒排索引映射,并利用它进行搜索。将彩色方框视为关键词。

Demo: Movies Dataset 演示:电影数据集

For this blog, we’ll be using the Kaggle Movies dataset [1], the same one we used in the embedding search blog.
对于这篇博客,我们将使用 Kaggle 电影数据集[1],即我们在嵌入搜索博客中使用过的同一数据集。

To load the dataset you have two options :
要加载数据集,您有两种选择:

  1. Using MotherDuck public datasets through shares. Any MotherDuck user has access to the sample_data database where the movies dataset is present under sample_data.kaggle.movies. Plus you leverage the power of the Cloud through MotherDuck
    通过共享使用 MotherDuck 公共数据集。任何 MotherDuck 用户都可以访问包含电影数据集的 sample_data 数据库,该数据集位于 sample_data.kaggle.movies 下。此外,您还可以通过 MotherDuck 利用云的力量。
  2. Read from a public AWS S3 bucket where we maintain our open datasets (see snippet below). The file is ~ 500 MB.
    从我们维护的公共 AWS S3 存储桶中读取开放数据集(参见下方代码片段)。该文件大小约为 500 MB。

Let’s first load in the dataset:
首先加载数据集:

create table movies as select title, overview
FROM 's3://us-prd-motherduck-open-datasets/movies/parquet/movies.parquet';

describe movies;
| column_name | column_type | null | key  | default | extra |
|-------------|-------------|------|------|---------|-------|
| title       | VARCHAR     | YES  | NULL | NULL    | NULL  |
| overview    | VARCHAR     | YES  | NULL | NULL    | NULL  |
|             |             |      |      |         |       |

Full text search (FTS) in DuckDB
DuckDB 中的全文搜索(FTS)

FTS is available in DuckDB as an extension, and is autoloaded when the pragma below is called. This extension adds two PRAGMA statements one each to create and drop the FTS index. We can now create the index using:
FTS 作为扩展在 DuckDB 中可用,并在调用下方 pragma 时自动加载。此扩展添加了两个 PRAGMA 语句,分别用于创建和删除 FTS 索引。我们现在可以使用以下命令创建索引:

pragma create_fts_index(
    input_table,
    input_id,
    * input_values,
    stemmer = 'porter',
    stopwords = 'english',
    ignore = '(\\.|[^a-z])+',
    strip_accents = 1,
    lower = 1,
    overwrite = 0
)

The statement builds the index for the given table input_table and the mapping of the keywords to the documents is done using the input_id. The input_id would typically be a document identifier that is unique. Next we can specify what columns in the input_table to build the index for. There are a few optional arguments you can use to suit your custom use case. By default all characters are converted to lowercase and escape sequences are ignored. You can change the default behavior by using the optional parameters. To ignore character patterns in the textual data, you can pass in a regular expression to the ignore argument, which defaults to '(\\.|[^a-z])+' to ignore all escaped characters and non-alphabetic lowercase characters. Accents in your text data could also be removed and converted to characters without accents (example á to a) by setting strip_accents = 1 which defaults to 1. Setting lower = 1 converts all text to lowercase and overwrite = 0 overwrites an existing index. The optional arguments stemmer and stopwords are discussed in detail below. The pragma applies the normalization in the following order:
该声明为指定的表 input_table 构建索引,并将关键词映射到文档的操作通过 input_id 完成。 The input_id 通常是一个唯一的文档标识符。接下来,我们可以指定要在 input_table 中的哪些列上构建索引。有几个可选参数可用于适应您的自定义用例。默认情况下,所有字符都转换为小写,转义序列被忽略。您可以通过使用可选参数来更改默认行为。要忽略文本数据中的字符模式,可以将正则表达式传递给 ignore 参数,该参数默认设置为 '(\\.|[^a-z])+' ,以忽略所有转义字符和非字母小写字符。您的文本数据中的重音也可以被移除并转换为无重音字符(例如,á转换为 a),通过设置 strip_accents = 1 ,其默认值为 1。设置 lower = 1 将所有文本转换为小写,而 overwrite = 0 会覆盖现有索引。下面将详细讨论可选参数 stemmerstopwords 。该指令按以下顺序应用规范化:

strip_accents -> lowercase -> ignore_regex -> stemmer -> stopwords

Stemmer 斯特默

Stemming is a process of simplifying words by removing common word endings from them. For example, the word running would be converted to run, cats to cat. This makes the search for keywords in their different forms much easier. DuckDB provides various stemmers, and defaults to stemmer = porter. There is also an option to disable this process of simplifying words by passing stemmer = none to the argument.
词干提取是一种通过去除常见词尾来简化单词的过程。例如,单词"running"会被转换为"run","cats"会变为"cat"。这使得在不同形式中搜索关键词变得更加容易。DuckDB 提供了多种词干提取器,默认使用 stemmer = porter 。还可以通过向参数传递 stemmer = none 来禁用这一简化单词的过程。

Stopwords 停用词

Stopwords are commonly used words in a language, and are often removed from the search context in keyword based search systems as they add very little value. In English, “a”, “is”, “the”, “are” are examples of some stopwords. The FTS extension defaults to using English stopwords, stopwords = ‘english’.
停用词是语言中常用的词汇,在基于关键词的搜索系统中通常会被移除,因为它们提供的价值非常有限。在英语中,“a”、“is”、“the”、“are”等都是停用词的例子。全文搜索(FTS)扩展默认使用英语停用词, stopwords = ‘english’

For our example dataset, let’s build the index for the columns title and overview using the title as the document identifier since this is a unique identifier for the rows in our table, while using the defaults for the optional arguments.
对于我们的示例数据集,让我们使用标题作为文档标识符来构建标题和概述列的索引,因为标题是表中行的唯一标识符,同时使用可选参数的默认值。

pragma create_fts_index(movies, title, title, overview)

Upon executing this, a few tables are created on the same database as the input_table but in another schema, which is usually the name of the input_table with a prefix as fts_main_<input_table>. These newly created tables hold the inverted index for the full text search.
执行此操作后,会在与 input_table 相同的数据库中创建几个表,但位于另一个模式下,该模式通常是 input_table 的名称,并带有 fts_main_<input_table> 前缀。这些新创建的表保存着全文搜索的倒排索引。

select database, schema, name, column_names FROM (show all tables)
| database | schema          | name      | column_names             | column_types              |
| -------- | --------------- | --------- | ------------------------ | ------------------------- |
| memory   | fts_main_movies | dict      | [termid, term, df]       | [BIGINT, VARCHAR, BIGINT] |
| memory   | fts_main_movies | docs      | [docid, name, len]       | [BIGINT, VARCHAR, BIGINT] |
| memory   | fts_main_movies | fields    | [fieldid, field]         | [BIGINT, VARCHAR]         |
| memory   | fts_main_movies | stats     | [num_docs, avgdl]        | [BIGINT, DOUBLE]          |
| memory   | fts_main_movies | stopwords | [sw]                     | [VARCHAR]                 |
| memory   | fts_main_movies | terms     | [docid, fieldid, termid] | [BIGINT, BIGINT, BIGINT]  |
| memory   | main            | movies    | [title, overview]        | [VARCHAR, VARCHAR]        |

Exploring some of the newly created tables, we can see the effects of the parameters we chose when creating the FTS index. Below are some notes to understand each created table:
探索一些新创建的表,我们可以看到在创建 FTS 索引时所选参数的效果。以下是理解每个创建表的要点:

| schema          | name      | column_names             |     | description                                                                                                             |
| --------------- | --------- | ------------------------ | --- | ----------------------------------------------------------------------------------------------------------------------- |
| fts_main_movies | dict      | [termid, term, df]       |     | stores a mapping of all the terms (keywords in the docs) to a termid                                                    |
| fts_main_movies | docs      | [docid, name, len]       |     | creates a map of the input_id stored here as name to an internal docid for the FTS index along with the document length |
| fts_main_movies | fields    | [fieldid, field]         |     | maps a fieldid to the table columns given in input_values that were indexed                                             |
| fts_main_movies | stats     | [num_docs, avgdl]        |     | stats used to calculate the similarity scores                                                                           |
| fts_main_movies | stopwords | [sw]                     |     | lists all the stopwords for this index, corresponding to the chosen option when creating the index                      |
| fts_main_movies | terms     | [docid, fieldid, termid] |     | maps the document docid to the field with fieldid (column) to the term with termid                                      |
| main            | movies    | [title, overview]        |     | table that is indexed for the full text search

A note for the curious ones: these tables can be queried, inspected and used in other queries downstream.
好奇者须知:这些表格可被查询、检查,并可在下游查询中使用。

Text search with a query
基于查询的文本搜索

When the PRAGMA create_fts_index is executed, a retrieval macro is created along with the index and associated with it. The macro looks as follows:
当执行 PRAGMA create_fts_index 时,会创建一个检索宏以及相应的索引并与之关联。该宏的格式如下:

match_bm25(
    input_id,
    query_string,
    fields := NULL,
    k := 1.2,
    b := 0.75,
    conjunctive := 0
)

This macro calculates the search score based on the Okapi bm25 as mentioned earlier, and takes the column names for the document identifier as input_id, with the input search string as query_string, the indexed column names as a string that contains the comma separated column names and NULL indicates to search across all the columns. The parameters k and b adjust the bm25 scoring and setting conjunctive = 1 ensures the search only retrieves documents that contain all the keywords in the query_string.
此宏根据前述的 Okapi bm25 计算搜索得分,并将文档标识符的列名设为 input_id ,输入搜索字符串设为 query_string ,索引列名设为一个包含逗号分隔列名的字符串, NULL 表示搜索所有列。参数 kb 调整 bm25 评分,设置 conjunctive = 1 确保搜索仅检索包含 query_string 中所有关键词的文档。

Using this macro, we can now search over our movies dataset for a query string:
使用此宏,我们现在可以针对电影数据集进行查询字符串的搜索:

adventure across the galaxy for the ultimate power struggle
穿越银河的冒险,争夺终极力量

and indicate the fields to limit the search as fields := ‘overview’.
并指明要限制搜索的字段,即 fields := ‘overview’。

with fts as (
    select *, fts_main_movies.match_bm25(
        title,
        'adventure across the galaxy for the ultimate power struggle',
         fields := 'overview'
    ) as score
    from movies     
)         
select title, overview, score
from fts
where score is not null
order by score desc
limit 5;

Which would return the top 5 result as:
这将返回前 5 个结果为:

| title               | overview                                                                 | score      |
| ------------------- | ------------------------------------------------------------------------ | ---------- |
| Mighty Morphin Pow… | Power up with six incredible teens who out-maneuver and defeat evil eve… | 5.73340016 |
| Threads of Destiny  | 94 years after The Battle of Yavin, the New Republic has been resurrect… | 5.70256148 |
| Stargate: The Ark … | SG-1 searches for an ancient weapon which could help them defeat the Or| 5.65603264 |
| The Final Master    | Determined to pass down his art, the Final Master of Wing Chun is caugh… | 5.54863581 |
| Star Trek           | The fate of the galaxy rests in the hands of bitter rivals. One, James … | 5.14211669 |

Now with both search modes, Full Text Search (this blog) and Embedding Search implemented in DuckDB, how do we combine them both for a hybrid search? Let’s get our dataset ready to start building our hybrid search. Taking the same movies dataset as above, we generate and add embedding vectors for both the title and the overview, and create the FTS index. At the end of this our movies table would contain:
现在,DuckDB 中已经实现了两种搜索模式:全文搜索(本博客)和嵌入式搜索。我们如何将它们结合起来进行混合搜索呢?让我们准备好数据集,开始构建我们的混合搜索。采用与上述相同的电影数据集,我们为标题和概述生成并添加嵌入向量,并创建 FTS 索引。最终,我们的电影表将包含:

| column_name         | column_type | null | key  | default | extra |
| ------------------- | ----------- | ---- | ---- | ------- | ----- |
| title               | VARCHAR     | YES  | NULL | NULL    | NULL  |
| overview            | VARCHAR     | YES  | NULL | NULL    | NULL  |
| title_embeddings    | DOUBLE[]    | YES  | NULL | NULL    | NULL  |
| overview_embeddings | DOUBLE[]    | YES  | NULL | NULL    | NULL  |

Fused Metric for Ranking 融合排序的度量标准

The crux of any search algorithm is to have a metric to rank the dataset for a given query, from which the top N values are retrieved when listed in descending order. For hybrid search we would require a metric that fuses scores from both search modes. Two of the most commonly used fusion metrics are: (i) convex combination and (ii) reciprocal ranked fusion.
任何搜索算法的核心在于拥有一个指标,用于对给定查询的数据集进行排序,从而在按降序排列时检索出前 N 个值。对于混合搜索,我们需要一个融合两种搜索模式分数的指标。最常用的两种融合指标是:(i) 凸组合和 (ii) 倒数排名融合。

Reciprocal Ranked Fusion (RRF)
互惠排序融合(RRF)

RRF sums the reciprocals of the ranks of documents, where the rank of the document is the row number when sorted in descending order using a ranking score. To adjust the importance of the low ranked documents a constant k is added to the rank. This gives the formula for RRF as:
RRF 计算文档排名的倒数之和,其中文档的排名是按排名得分降序排列时的行号。为了调整低排名文档的重要性,在排名上加上一个常数 k。这给出了 RRF 的公式为:

search_formula_1

Convex Combination (weighted normalized scores)
凸组合(加权标准化分数)

This method reportedly performs better than RRF when calibrated. It is expressed as a linear function that sums the normalized scores weighted against a parameter ⍺. The calibration of the parameter requires a good amount of annotated dataset, but with a lack of it we could use a default value of 0.8 as suggested in this paper [3] for in domain datasets. Which gives:
据报道,此方法在经过校准后,其性能优于 RRF。它被表示为一个线性函数,该函数将归一化分数与参数⍺加权求和。参数的校准需要大量标注数据集,但在缺乏这些数据的情况下,我们可以根据本论文[3]的建议,对领域内数据集使用默认值 0.8。由此得出:

search_formula_2

In this post, we will focus on using Convex Combination as it reportedly performs better, beyond the suggested value for ⍺, it is customizable for any use-case in hand, with a few annotated samples the parameter can be tuned.
在这篇文章中,我们将重点介绍使用凸组合方法,据报道它在性能上表现更优。不仅限于建议的⍺值,凸组合可根据手头的任何用例进行定制,通过几个标注样本即可调整参数。

wiht fts as (
    select 
        title, 
        overview, 
        fts_main_movies.match_bm25(
            title,
            'an adventure across the galaxy for the ultimate power struggle',
            fields := 'overview'
        ) as score
    from movies
),
embd as (
    select 
        title, 
        overview, 
        array_cosine_similarity(overview_embeddings, cast([0.343.., ...] as float[1536])) as score
    from movies
),
normalized_scores as (
    select 
        fts.title, 
        fts.overview, 
        fts.score as raw_fts_score, 
        embd.score as raw_embd_score,
        (fts.score / (select max(score) from fts)) as norm_fts_score,
        ((embd.score + 1) / (select max(score) + 1 from embd)) as norm_embd_score
    from 
        fts
    inner join
        embd 
    on fts.title = embd.title
)
select 
    title,
    raw_fts_score, 
    raw_embd_score, 
    norm_fts_score, 
    norm_embd_score, 
    -- (alpha * norm_embd_score + (1-alpha) * norm_fts_score)
    (0.8*norm_embd_score + 0.2*norm_fts_score) AS score_cc
from 
    normalized_scores
order by 
    score_cc desc
limit 5;

We get the following top 5 results:
我们得到以下前五名结果:

| title                                   | overview                                                                | norm_fts_score | norm_embd_score | score_cc   |
| --------------------------------------- | ----------------------------------------------------------------------- | -------------- | --------------- | ---------- |
| Threads of Destiny                      | 94 years after The Battle of Yavin, the New Republic has been resurrec… | 0.99462122     | 1.0             | 0.99892424 |
| Stargate: The Ark of Truth              | SG-1 searches for an ancient weapon which could help them defeat the O… | 0.98650582     | 0.97546876      | 0.97767617 |
| Star Trek                               | The fate of the galaxy rests in the hands of bitter rivals. One, James… | 0.89687036     | 0.98548452      | 0.96776169 |
| Mighty Morphin Power Rangers: The Movie | Power up with six incredible teens who out-maneuver and defeat evil ev… | 1.0            | 0.94973698      | 0.95978958 |
| Ratchet & Clank                         | Ratchet and Clank tells the story of two unlikely heroes as they strug… | 0.83376640     | 0.95988163      | 0.93465858 |

Comparing the hybrid results above with FTS, we notice that the top result differs after reranking and that the ranking is not determined solely by a single score. We also observe a new result in the top 5 that wasn’t present in the FTS-only search.

Conclusion

With more unstructured textual data being ingested into analytical database systems, it is increasingly important for these systems to handle text search operations efficiently. Advancements in DuckDB and the functions it provides out of the box make it an excellent analytical tool for both full text (keyword-based) and embedding-based searches. These search functionalities can be seamlessly implemented directly in SQL without resorting to other tools to construct such a query. Moreover, the use of Common Table Expressions (CTEs) enables the calculation of fused ranking metrics for the effective integration of hybrid search modes. Even with a long query at hand, CTEs make it easy to build these queries and debug them.

References

[1] Kaggle movies dataset, that has been cleaned to remove duplicates.

[2] https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

[3] https://dl.acm.org/doi/pdf/10.1145/3596512

CONTENT 内容
  1. How does Full Text Search Work?
    全文搜索是如何工作的?
  2. Full text search in DuckDB
    DuckDB 中的全文搜索
  3. Text search with a query
    使用查询进行文本搜索
  4. Conclusion 结论

Subscribe to MotherDuck Blog
订阅 MOTHERDUCK 博客

E-mail 电子邮件