Training and Finetuning Sparse Embedding Models with Sentence Transformers v5
使用 Sentence Transformers v5 训练和微调稀疏嵌入模型
Sentence Transformers is a Python library for using and training embedding and reranker models for a wide range of applications, such as retrieval augmented generation, semantic search, semantic textual similarity, paraphrase mining, and more. The last few major versions have introduced significant improvements to training:
Sentence Transformers 是一个 Python 库,用于为广泛的应用程序使用和训练嵌入和重新排序模型,例如检索增强生成、语义搜索、语义文本相似性、释义挖掘等。最近几个主要版本对训练进行了重大改进:
- v3.0: (improved) Sentence Transformer (Dense Embedding) model training
v3.0:(改进)Sentence Transformer (Dense Embedding) 模型训练 - v4.0: (improved) Cross Encoder (Reranker) model training
v4.0:(改进)交叉编码器 (Reranker) 模型训练 - v5.0: (new) Sparse Embedding model training
v5.0:(新)稀疏嵌入模型训练
In this blogpost, I'll show you how to use it to finetune a sparse encoder/embedding model and explain why you might want to do so. This results in sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq, a cheap model that works especially well in hybrid search or retrieve and rerank scenarios.
在这篇博文中,我将向您展示如何使用它来微调稀疏编码器/嵌入模型,并解释为什么您可能希望这样做。这导致了 sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq,这是一种廉价的模型,在混合搜索或检索和重新排序场景中效果特别好。
Finetuning sparse embedding models involves several components: the model, datasets, loss functions, training arguments, evaluators, and the trainer class. I'll have a look at each of these components, accompanied by practical examples of how they can be used for finetuning strong sparse embedding models.
微调稀疏嵌入模型涉及几个组件:模型、数据集、损失函数、训练参数、评估程序和 trainer 类。我将介绍这些组件中的每一个,并附有如何使用它们来微调强稀疏嵌入模型的实际示例。
In addition to training your own models, you can choose from a wide range of pretrained sparse encoders available on the Hugging Face Hub. To help navigate this growing space, we’ve curated a SPLADE Models collection highlighting some of the most relevant models.
除了训练自己的模型外,您还可以从 Hugging Face Hub 上提供的各种预训练稀疏编码器中进行选择。为了帮助驾驭这个不断增长的领域,我们策划了一个 SPLADE 模型集合 ,重点介绍了一些最相关的模型。
We list the most prominent ones along with their benchmark results in Pretrained Models in the documentation.
我们在文档中的 Pretrained Models 中列出了最突出的 Slots 及其基准测试结果。
Table of Contents 目录
- What are Sparse Embedding models?
什么是稀疏嵌入模型? - Why Finetune? 为什么选择 Finetune?
- Training Components 培训组件
- Model 型
- Dataset 数据
- Loss Function 损失函数
- Training Arguments 训练参数
- Evaluator 计算器
- Trainer 长途汽车
- Evaluation 评估
- Training Tips 训练技巧
- Vector Database Integration
矢量数据库集成 - Additional Resources 其他资源
What are Sparse Embedding models?
什么是稀疏嵌入模型?
The broader term "embedding models" refer to models that convert some input, usually text, into a vector representation (embedding) that captures the semantic meaning of the input. Unlike with the raw inputs, you can perform mathematical operations on these embeddings, resulting in similarity scores that can be used for various tasks, such as search, clustering, or classification.
更广泛的术语 “嵌入模型” 是指将某些输入(通常是文本)转换为捕获输入语义含义的向量表示(嵌入)的模型。与原始输入不同,您可以对这些嵌入执行数学运算,从而获得可用于各种任务(如搜索、聚类或分类)的相似性分数。
With dense embedding models, i.e. the common variety, the embeddings are typically low-dimensional vectors (e.g., 384, 768, or 1024 dimensions) where most values are non-zero. Sparse embedding models, on the other hand, produce high-dimensional vectors (e.g., 30,000+ dimensions) where most values are zero. Usually, each active dimension (i.e. the dimension with a non-zero value) in a sparse embedding corresponds to a specific token in the model's vocabulary, allowing for interpretability.
对于密集嵌入模型,即常见变体,嵌入通常是低维向量(例如,384、768 或 1024 维),其中大多数值都是非零的。另一方面,稀疏嵌入模型会产生大多数值为零的高维向量(例如,30,000+ 个维度)。通常,稀疏嵌入中的每个活动维度(即具有非零值的维度)对应于模型词汇表中的特定标记,从而允许可解释性。
Let's have a look at naver/splade-v3, a state-of-the-art sparse embedding model, as an example:
让我们以最先进的稀疏嵌入模型 naver/splade-v3 为例:
from sentence_transformers import SparseEncoder
# Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")
# Run inference
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 30522)
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 32.4323, 5.8528, 0.0258],
# [ 5.8528, 26.6649, 0.0302],
# [ 0.0258, 0.0302, 24.0839]])
# Let's decode our embeddings to be able to interpret them
decoded = model.decode(embeddings, top_k=10)
for decoded, sentence in zip(decoded, sentences):
print(f"Sentence: {sentence}")
print(f"Decoded: {decoded}")
print()
Sentence: The weather is lovely today.
Decoded: [('weather', 2.754288673400879), ('today', 2.610959529876709), ('lovely', 2.431990623474121), ('currently', 1.5520408153533936), ('beautiful', 1.5046082735061646), ('cool', 1.4664798974990845), ('pretty', 0.8986214995384216), ('yesterday', 0.8603134155273438), ('nice', 0.8322536945343018), ('summer', 0.7702118158340454)]
Sentence: It's so sunny outside!
Decoded: [('outside', 2.6939032077789307), ('sunny', 2.535827398300171), ('so', 2.0600898265838623), ('out', 1.5397940874099731), ('weather', 1.1198079586029053), ('very', 0.9873268604278564), ('cool', 0.9406591057777405), ('it', 0.9026399254798889), ('summer', 0.684999406337738), ('sun', 0.6520509123802185)]
Sentence: He drove to the stadium.
Decoded: [('stadium', 2.7872302532196045), ('drove', 1.8208855390548706), ('driving', 1.6665740013122559), ('drive', 1.5565159320831299), ('he', 1.4721972942352295), ('stadiums', 1.449463129043579), ('to', 1.0441515445709229), ('car', 0.7002660632133484), ('visit', 0.5118278861045837), ('football', 0.502326250076294)]
In this example, the embeddings are 30,522-dimensional vectors, where each dimension corresponds to a token in the model's vocabulary. The decode
method returned the top 10 tokens with the highest values in the embedding, allowing us to interpret which tokens contribute most to the embedding.
在此示例中,嵌入是 30,522 维向量,其中每个维度对应于模型词汇表中的一个标记。 decode
方法返回嵌入中值最高的前 10 个标记,允许我们解释哪些标记对嵌入的贡献最大。
We can even determine the intersection or overlap between embeddings, very useful for determining why two texts are deemed similar or dissimilar:
我们甚至可以确定嵌入之间的交集或重叠,这对于确定为什么两个文本被认为是相似或不相似非常有用:
# Let's also compute the intersection/overlap of the first two embeddings
intersection_embedding = model.intersection(embeddings[0], embeddings[1])
decoded_intersection = model.decode(intersection_embedding)
print(decoded_intersection)
Decoded: [('weather', 3.0842742919921875), ('cool', 1.379457712173462), ('summer', 0.5275946259498596), ('comfort', 0.3239051103591919), ('sally', 0.22571465373039246), ('julian', 0.14787325263023376), ('nature', 0.08582140505313873), ('beauty', 0.0588383711874485), ('mood', 0.018594780936837196), ('nathan', 0.000752730411477387)]
Query and Document Expansion
查询和文档扩展
A key component of neural sparse embedding models is query/document expansion. Unlike traditional lexical methods like BM25, which only match exact tokens, neural sparse models generally automatically expand the original text with semantically related terms:
神经稀疏嵌入模型的一个关键组成部分是 查询/文档扩展 。与 BM25 等仅匹配精确标记的传统词法不同,神经稀疏模型通常会自动使用语义相关的术语扩展原始文本:
- Traditional, Lexical (e.g. BM25): Only matches on exact tokens in the text
繁体、词汇(如 BM25): 仅匹配文本中的精确标记 - Neural Sparse Models: Automatically expand with related terms
神经稀疏模型: 使用相关术语自动扩展
For example, in the code output above, the sentence "The weather is lovely today" is expanded to include terms like "beautiful", "cool", "pretty", and "nice" which weren't in the original text. Similarly, "It's so sunny outside!" is expanded to include "weather", "summer", and "sun".
例如,在上面的代码输出中,句子 “The weather is lovely today” 被扩展为包含原始文本中没有的术语 “beautiful”、“cool”、“pretty” 和 “nice” 等术语。同样,“It's so sunny outside!” 被扩展为包括 “weather”、“summer” 和 “sun”。
This expansion allows neural sparse models to match semantically related content or synonyms even without exact token matches, handle misspellings, and overcome vocabulary mismatch problems. This is why neural sparse models like SPLADE often outperform traditional lexical search methods while maintaining the efficiency benefits of sparse representations.
这种扩展允许神经稀疏模型匹配语义相关的内容或同义词,即使没有精确的标记匹配,也可以处理拼写错误,并克服词汇不匹配问题。这就是为什么像 SPLADE 这样的神经稀疏模型通常优于传统的词法搜索方法,同时保持稀疏表示的效率优势。
However, expansion has its risks. For example, query expansion for "What is the weather on Tuesday?" will likely also expand to "monday", "wednesday", etc., which may not be desired.
然而,扩张有其风险。例如,“What is the weather on Tuesday?”的查询扩展也可能扩展为“monday”、“wednesday”等,这可能不是你想要的。
Why Use Sparse Embedding Models?
为什么使用稀疏嵌入模型?
In short, neural sparse embedding models fall in a valuable niche between traditional lexical methods like BM25 and dense embedding models like Sentence Transformers. They have the following advantages:
简而言之,神经稀疏嵌入模型介于 BM25 等传统词汇方法和 Sentence Transformers 等密集嵌入模型之间。它们具有以下优点:
- Hybrid potential: Very effectively combined with dense models, which may struggle with searches where lexical matches are important
混合动力潜力: 与密集模型非常有效地结合,这可能难以处理词法匹配很重要的搜索 - Interpretability: You can see exactly which tokens contribute to a match
可解释性: 您可以准确查看哪些令牌对匹配有贡献 - Performance: Competitive or better than dense models in many retrieval tasks
性能: 在许多检索任务中具有竞争力或优于密集模型
Throughout this blogpost, I'll use "sparse embedding model" and "sparse encoder model" interchangeably.
在这篇博文中,我将交替使用“sparse embedding model”和“sparse encoder model”。
Why Finetune? 为什么选择 Finetune?
The majority of (neural) sparse embedding models employ the aforementioned query/document expansion so that you can match texts with nearly identical meaning, even if they don't share any words. In short, the model has to recognize synonyms so those tokens can be placed in the final embedding.
大多数(神经)稀疏嵌入模型都采用上述查询/文档扩展,因此你可以匹配含义几乎相同的文本,即使它们不共享任何单词。简而言之,模型必须识别同义词,以便这些标记可以放置在最终嵌入中。
Most out-of-the-box sparse embedding models will easily recognize that "supermarket", "food", and "market" are useful expansions of a text containing "grocery", but for example:
大多数开箱即用的稀疏嵌入模型很容易识别出 “supermarket”、“food” 和 “market” 是包含 “grocery” 的文本的有用扩展,但例如:
- "The patient complained of severe cephalalgia."
“病人抱怨有严重的头痛。”
expands to: 扩展为:
'##lal', 'severe', '##pha', 'ce', '##gia', 'patient', 'complaint', 'patients', 'complained', 'warning', 'suffered', 'had', 'disease', 'complain', 'diagnosis', 'syndrome', 'mild', 'pain', 'hospital', 'injury'
whereas we wish for it to expand to "headache", the common word for "cephalalgia". This example expands to many domains, e.g. not recognizing that "Java" is a programming language, that "Audi" makes cars, or that "NVIDIA" is a company that makes graphics cards.
而我们希望它扩展为 “头痛”,即 “头痛 ”的常用词。这个例子扩展到许多领域,例如,没有认识到 “Java” 是一种编程语言,没有认识到 “Audi” 制造汽车,或者 “NVIDIA” 是一家生产显卡的公司。
Through finetuning, the model can learn to focus exclusively on the domain and/or language that matters to you.
通过微调,模型可以学习只关注对您重要的领域和/或语言。
Training Components 培训组件
Training Sentence Transformer models involves the following components:
训练 Sentence Transformer 模型涉及以下组件:
- Model: The model to train or finetune, which can be a pre-trained Sparse Encoder model or a base model.
Model:要训练或微调的模型,可以是预先训练的 Sparse Encoder 模型或基础模型。 - Dataset: The data used for training and evaluation.
数据集 :用于训练和评估的数据。 - Loss Function: A function that quantifies the model's performance and guides the optimization process.
Loss Function:量化模型性能并指导优化过程的函数。 - Training Arguments (optional): Parameters that influence training performance and tracking/debugging.
Training Arguments (可选):影响训练性能和跟踪/调试的参数。 - Evaluator (optional): A tool for evaluating the model before, during, or after training.
Evaluator (可选):用于在训练之前、期间或之后评估模型的工具。 - Trainer: Brings together the model, dataset, loss function, and other components for training.
Trainer:将模型、数据集、损失函数和其他组件汇集在一起进行训练。
Now, let's dive into each of these components in more detail.
现在,让我们更详细地了解这些组件中的每一个。
Model 型
Sparse Encoder models consist of a sequence of Modules, Sparse Encoder specific Modules or Custom Modules, allowing for a lot of flexibility. If you want to further finetune a Sparse Encoder model (e.g. it has a modules.json file), then you don't have to worry about which modules are used:
稀疏编码器模型由一系列 模块 、 稀疏编码器特定模块 或 自定义模块组成,具有很大的灵活性。如果你想进一步微调 Sparse Encoder 模型(例如,它有一个 modules.json 文件 ),那么你不必担心使用了哪些模块:
from sentence_transformers import SparseEncoder
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
But if instead you want to train from another checkpoint, or from scratch, then these are the most common architectures you can use:
但是,如果您想从另一个检查点或从头开始训练,那么这些是您可以使用的最常见架构:
Splade 板
Splade models use the MLMTransformer
followed by a SpladePooling
modules. The former loads a pretrained Masked Language Modeling transformer model (e.g. BERT, RoBERTa, DistilBERT, ModernBERT, etc.) and the latter pools the output of the MLMHead to produce a single sparse embedding of the size of the vocabulary.
Splade 模型使用 MLMTransformer
,后跟 SpladePooling
模块。前者加载预先训练的掩码语言建模转换器模型 (例如 BERT、RoBERTa、DistilBERT、ModernBERT 等),后者将 MLMHead 的输出汇集在一起,以生成词汇表大小的单个稀疏嵌入。
from sentence_transformers import models, SparseEncoder
from sentence_transformers.sparse_encoder.models import MLMTransformer, SpladePooling
# Initialize MLM Transformer (use a fill-mask model)
mlm_transformer = MLMTransformer("google-bert/bert-base-uncased")
# Initialize SpladePooling module
splade_pooling = SpladePooling(pooling_strategy="max")
# Create the Splade model
model = SparseEncoder(modules=[mlm_transformer, splade_pooling])
This architecture is the default if you provide a fill-mask model architecture to SparseEncoder, so it's easier to use the shortcut:
如果你为 SparseEncoder 提供填充掩码模型架构,则此架构是默认架构,因此使用快捷方式会更容易:
from sentence_transformers import SparseEncoder
model = SparseEncoder("google-bert/bert-base-uncased")
# SparseEncoder(
# (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
# (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': None})
# )
Inference-free Splade 免推理 Splade
Inference-free Splade uses a Router
module with different modules for queries and documents. Usually for this type of architecture, the documents part is a traditional Splade architecture (a MLMTransformer
followed by a SpladePooling
module) and the query part is an SparseStaticEmbedding
module, which just returns a pre-computed score for every token in the query.
免推理 Splade 使用带有不同模块的 Router
模块进行查询和文档。通常对于这种类型的架构,文档部分是传统的 Splade 架构( MLMTransformer
后跟 SpladePooling
模块),查询部分是 SparseStaticEmbedding
模块,它只返回查询中每个标记的预计算分数。
from sentence_transformers import SparseEncoder
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import SparseStaticEmbedding, MLMTransformer, SpladePooling
# Initialize MLM Transformer for document encoding
doc_encoder = MLMTransformer("google-bert/bert-base-uncased")
# Create a router model with different paths for queries and documents
router = Router.for_query_document(
query_modules=[SparseStaticEmbedding(tokenizer=doc_encoder.tokenizer, frozen=False)],
# Document path: full MLM transformer + pooling
document_modules=[doc_encoder, SpladePooling("max")],
)
# Create the inference-free model
model = SparseEncoder(modules=[router], similarity_fn_name="dot")
# SparseEncoder(
# (0): Router(
# (query_0_SparseStaticEmbedding): SparseStaticEmbedding ({'frozen': False}, dim:30522, tokenizer: BertTokenizerFast)
# (document_0_MLMTransformer): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
# (document_1_SpladePooling): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': None})
# )
# )
This architecture allows for fast query-time processing using the lightweight SparseStaticEmbedding approach, that can be trained and seen as a linear weights, while documents are processed with the full MLM transformer and SpladePooling.
这种架构允许使用轻量级的 SparseStaticEmbedding 方法进行快速查询时处理,该方法可以被训练并被视为线性权重,而文档则使用完整的 MLM 转换器和 SpladePooling 进行处理。
Inference-free Splade is particularly useful for search applications where query latency is critical, as it shifts the computational complexity to the document indexing phase which can be done offline.
免推理 Splade 对于查询延迟至关重要的搜索应用程序特别有用,因为它将计算复杂性转移到可以离线完成的文档索引阶段。
When training models with the
Router
module, you must use therouter_mapping
argument in theSparseEncoderTrainingArguments
to map the training dataset columns to the correct route ("query" or "document"). For example, if your dataset(s) have["question", "answer"]
columns, then you can use the following mapping:
使用Router
模块训练模型时,必须使用SparseEncoderTrainingArguments
中的router_mapping
参数将训练数据集列映射到正确的路由(“query”或“document”)。例如,如果您的数据集具有[“question”, “answer”]
列,则可以使用以下映射:
args = SparseEncoderTrainingArguments( ..., router_mapping={ "question": "query", "answer": "document", } )
Additionally, it is recommended to use a much higher learning rate for the SparseStaticEmbedding module than for the rest of the model. For this, you should use the
learning_rate_mapping
argument in theSparseEncoderTrainingArguments
to map parameter patterns to their learning rates. For example, if you want to use a learning rate of1e-3
for the SparseStaticEmbedding module and2e-5
for the rest of the model, you can do this:
此外,建议对 SparseStaticEmbedding 模块使用比模型其余部分高得多的学习率。为此,您应该使用SparseEncoderTrainingArguments
中的learning_rate_mapping
参数将参数模式映射到它们的学习率。例如,如果要对 SparseStaticEmbedding 模块使用1e-3
的学习率,对模型的其余部分使用2e-5
的学习率,则可以执行此作:
args = SparseEncoderTrainingArguments( ..., learning_rate=2e-5, learning_rate_mapping={ r"SparseStaticEmbedding\.*": 1e-3, } )
Contrastive Sparse Representation (CSR)
对比稀疏表示 (CSR)
Contrastive Sparse Representation (CSR) models, introduced in Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation, apply a SparseAutoEncoder
module on top of a dense Sentence Transformer model, which usually consist of a Transformer
followed by a Pooling
module. You can initialize one from scratch like so:
在 超越俄罗斯套娃:重新审视自适应表示的稀疏编码中引入的对比稀疏表示 (CSR) 模型,在密集的 Sentence Transformer 模型之上应用了一个 SparseAutoEncoder
模块,该模型通常由一个 Transformer
和一个 Pooling
模块组成。您可以从头开始初始化一个,如下所示:
from sentence_transformers import models, SparseEncoder
from sentence_transformers.sparse_encoder.models import SparseAutoEncoder
# Initialize transformer (can be any dense encoder model)
transformer = models.Transformer("google-bert/bert-base-uncased")
# Initialize pooling
pooling = models.Pooling(transformer.get_word_embedding_dimension(), pooling_mode="mean")
# Initialize SparseAutoEncoder module
sparse_auto_encoder = SparseAutoEncoder(
input_dim=transformer.get_word_embedding_dimension(),
hidden_dim=4 * transformer.get_word_embedding_dimension(),
k=256, # Number of top values to keep
k_aux=512, # Number of top values for auxiliary loss
)
# Create the CSR model
model = SparseEncoder(modules=[transformer, pooling, sparse_auto_encoder])
Or if your base model is 1) a dense Sentence Transformer model or 2) a non-MLM Transformer model (those are loaded as Splade models by default), then this shortcut will automatically initialize the CSR model for you:
或者,如果你的基本模型是 1) 一个密集的 Sentence Transformer 模型或 2) 一个非 MLM Transformer 模型(默认情况下作为 Splade 模型加载),那么这个快捷方式将自动为你初始化 CSR 模型:
from sentence_transformers import SparseEncoder
model = SparseEncoder("mixedbread-ai/mxbai-embed-large-v1")
# SparseEncoder(
# (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
# (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
# (2): SparseAutoEncoder({'input_dim': 1024, 'hidden_dim': 4096, 'k': 256, 'k_aux': 512, 'normalize': False, 'dead_threshold': 30})
# )
Unlike (Inference-free) Splade models, sparse embeddings by CSR models don't have the same size as the vocabulary of the base model. This means you can't directly interpret which words are activated in your embedding like you can with Splade models, where each dimension corresponds to a specific token in the vocabulary.
与(无推理的)Splade 模型不同,CSR 模型的稀疏嵌入与基本模型的词汇表大小不同。这意味着你不能像使用 Splade 模型那样直接解释嵌入中激活了哪些单词,其中每个维度对应于词汇表中的一个特定标记。Beyond that, CSR models are most effective on dense encoder models that use high-dimensional representations (e.g. 1024-4096 dimensions).
除此之外,CSR 模型在使用高维表示(例如 1024-4096 维)的密集编码器模型上最有效。
Architecture Picker Guide
架构选取器指南
If you're unsure which architecture to use, here's a quick guide:
如果您不确定要使用哪种架构,这里有一个快速指南:
- Do you want to sparsify an existing Dense Embedding model? If yes, use CSR.
是否要稀疏化现有的 Dense Embedding 模型?如果是,请使用 CSR。 - Do you want your query inference to be instantaneous at the cost of slight performance? If yes, use Inference-free SPLADE.
您是否希望查询推理是即时的,但代价是性能不佳?如果是,请使用 Inference-free SPLADE。 - Otherwise, use SPLADE.
否则,请使用 SPLADE。
Dataset 数据
The SparseEncoderTrainer
uses datasets.Dataset
or datasets.DatasetDict
instances for training and evaluation. You can load data from the Hugging Face Datasets Hub or use local data in various formats such as CSV, JSON, Parquet, Arrow, or SQL.SparseEncoderTrainer
使用数据集。一个或多个数据集
。DatasetDict
实例进行训练和评估。您可以从 Hugging Face Datasets Hub 加载数据,也可以使用各种格式(如 CSV、JSON、Parquet、Arrow 或 SQL)的本地数据。
Note: Lots of public datasets that work out of the box with Sentence Transformers have been tagged with sentence-transformers
on the Hugging Face Hub, so you can easily find them on https://huggingface.co/datasets?other=sentence-transformers. Consider browsing through these to find ready-to-go datasets that might be useful for your tasks, domains, or languages.
注意: 许多与 Sentence Transformers 一起使用的开箱即用的公共数据集已在 Hugging Face Hub 上用 sentence-transformers
标记,因此您可以在 https://huggingface.co/datasets?other=sentence-transformers 上轻松找到它们。考虑浏览这些内容以查找可能对您的任务、域或语言有用的现成数据集。
Data on the Hugging Face Hub
Hugging Face Hub 上的数据
You can use the load_dataset
function to load data from datasets in the Hugging Face Hub
您可以使用 load_dataset
函数从 Hugging Face Hub 中的数据集加载数据
from datasets import load_dataset
train_dataset = load_dataset("sentence-transformers/natural-questions", split="train")
print(train_dataset)
"""
Dataset({
features: ['query', 'answer'],
num_rows: 100231
})
"""
Some datasets, like nthakur/swim-ir-monolingual
, have multiple subsets with different data formats. You need to specify the subset name along with the dataset name, e.g. dataset = load_dataset("nthakur/swim-ir-monolingual", "de", split="train")
.
一些数据集(如 nthakur/swim-ir-monolingual
)具有具有不同数据格式的多个子集。您需要指定子集名称以及数据集名称,例如 dataset = load_dataset("nthakur/swim-ir-monolingual", "de", split="train")
。
Local Data (CSV, JSON, Parquet, Arrow, SQL)
本地数据(CSV、JSON、Parquet、Arrow、SQL)
You can also use load_dataset
for loading local data in certain file formats:
您还可以使用 load_dataset
以某些文件格式加载本地数据:
from datasets import load_dataset
dataset = load_dataset("csv", data_files="my_file.csv")
# or
dataset = load_dataset("json", data_files="my_file.json")
Local Data that requires pre-processing
需要预处理的本地数据
You can use datasets.Dataset.from_dict
if your local data requires pre-processing. This allows you to initialize your dataset with a dictionary of lists:
您可以使用 数据集。Dataset.from_dict
您的本地数据是否需要预处理。这允许您使用列表字典初始化数据集:
from datasets import Dataset
queries = []
documents = []
# Open a file, perform preprocessing, filtering, cleaning, etc.
# and append to the lists
dataset = Dataset.from_dict({
"query": queries,
"document": documents,
})
Each key in the dictionary becomes a column in the resulting dataset.
字典中的每个键都将成为结果数据集中的一列。
Dataset Format 数据集格式
It's crucial to ensure that your dataset format matches your chosen loss function. This involves checking two things:
确保您的数据集格式与您选择的 损失函数匹配至关重要。这涉及检查两件事:
- If your loss function requires a Label (as indicated in the Loss Overview table), your dataset must have a column named "label" or "score".
如果您的损失函数需要 Label (如 Loss Overview 表中所示),则您的数据集必须具有名为 “label” 或 “score” 的列。 - All columns other than "label" or "score" are considered Inputs (as indicated in the Loss Overview table). The number of these columns must match the number of valid inputs for your chosen loss function. The names of the columns don't matter, only their order matters.
除 “label” 或 “score” 之外的所有列都被视为 Input(如损失概览表所示)。这些列的数量必须与所选损失函数的有效输入数量相匹配。列的名称无关紧要, 只有它们的顺序很重要 。
For example, if your loss function accepts (anchor, positive, negative) triplets
, then your first, second, and third dataset columns correspond with anchor
, positive
, and negative
, respectively. This means that your first and second column must contain texts that should embed closely, and that your first and third column must contain texts that should embed far apart. That is why depending on your loss function, your dataset column order matters.
例如,如果您的损失函数接受 (anchor, positive, negative) triplets
,则您的第一个、第二个和第三个数据集列分别对应于 anchor
、 positive
和 negative
。这意味着第一列和第二列必须包含应紧密嵌入的文本,而第一列和第三列必须包含应相距较远的文本。这就是为什么根据您的损失函数,数据集列顺序很重要。
Consider a dataset with columns ["text1", "text2", "label"]
, where the "label"
column contains floating point similarity scores. This dataset can be used with SparseCoSENTLoss
, SparseAnglELoss
, and SparseCosineSimilarityLoss
because:
考虑一个包含列 [“text1”, “text2”, “label”]
的数据集,其中 “label”
列包含浮点相似度分数。该数据集可以与 SparseCoSENTLoss
、 SparseAnglELoss
和 SparseCosineSimilarityLoss
一起使用,因为:
- The dataset has a "label" column, which is required by these loss functions.
数据集有一个 “label” 列,这是这些损失函数所必需的。 - The dataset has 2 non-label columns, matching the number of inputs required by these loss functions.
数据集有 2 个非标签列,与这些损失函数所需的输入数量相匹配。
If the columns in your dataset are not ordered correctly, use Dataset.select_columns
to reorder them. Additionally, remove any extraneous columns (e.g., sample_id
, metadata
, source
, type
) using Dataset.remove_columns
, as they will be treated as inputs otherwise.
如果数据集中的列排序不正确,请使用 Dataset.select_columns
对它们重新排序。此外,使用 Dataset.remove_columns
删除任何无关的列(例如,sample_id
、 元数据
、 源
、 类型
),否则它们将被视为输入。
Loss Function 损失函数
Loss functions measure how well a model performs on a given batch of data and guide the optimization process. The choice of loss function depends on your available data and target task. Refer to the Loss Overview for a comprehensive list of options.
损失函数衡量模型对给定批次数据的性能并指导优化过程。损失函数的选择取决于您的可用数据和目标任务。请参阅 损失概述 有关选项的完整列表。
To train a
SparseEncoder
, you either need aSpladeLoss
orCSRLoss
, depending on the architecture. These are wrapper losses that add sparsity regularization on top of a main loss function, which must be provided as a parameter. The only loss that can be used independently isSparseMSELoss
, as it performs embedding-level distillation, ensuring sparsity by directly copying the teacher's sparse embedding.
要训练SparseEncoder
,你需要SpladeLoss
或CSRLoss
,具体取决于架构。这些是包装器损失,它在主损失函数之上添加了稀疏正则则化,该函数必须作为参数提供。唯一可以独立使用的损失是SparseMSELoss
,因为它执行 embedding 级别的蒸馏,通过直接复制老师的稀疏 embedding 来保证稀疏性。
Most loss functions can be initialized with just the SparseEncoder
that you're training, alongside some optional parameters, e.g.:
大多数损失函数可以只使用您正在训练的 SparseEncoder
以及一些可选参数进行初始化,例如:
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss
# Load a model to train/finetune
model = SparseEncoder("distilbert/distilbert-base-uncased")
# Initialize the SpladeLoss with a SparseMultipleNegativesRankingLoss
# This loss requires pairs of related texts or triplets
loss = SpladeLoss(
model=model,
loss=SparseMultipleNegativesRankingLoss(model=model),
query_regularizer_weight=5e-5, # Weight for query loss
document_regularizer_weight=3e-5,
)
# Load an example training dataset that works with our loss function:
train_dataset = load_dataset("sentence-transformers/natural-questions", split="train")
print(train_dataset)
"""
Dataset({
features: ['query', 'answer'],
num_rows: 100231
})
"""
Documentation 文档
Training Arguments 训练参数
The SparseEncoderTrainingArguments
class allows you to specify parameters that influence training performance and tracking/debugging. While optional, experimenting with these arguments can help improve training efficiency and provide insights into the training process.
SparseEncoderTrainingArguments
类允许您指定影响训练性能和跟踪/调试的参数。虽然这些参数是可选的,但尝试这些参数有助于提高训练效率并提供对训练过程的见解。
In the Sentence Transformers documentation, I've outlined some of the most useful training arguments. I would recommend reading it in Training Overview > Training Arguments.
在 Sentence Transformers 文档中,我概述了一些最有用的训练参数。我建议阅读 Training Overview > Training Arguments 中的它。
Here's an example of how to initialize SparseEncoderTrainingArguments
:
下面是一个如何初始化 SparseEncoderTrainingArguments
的示例:
from sentence_transformers import SparseEncoderTrainingArguments
args = SparseEncoderTrainingArguments(
# Required parameter:
output_dir="models/splade-distilbert-base-uncased-nq",
# Optional training parameters:
num_train_epochs=1,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=True, # Set to False if your GPU can't handle FP16
bf16=False, # Set to True if your GPU supports BF16
batch_sampler=BatchSamplers.NO_DUPLICATES, # Losses using "in-batch negatives" benefit from no duplicates
# Optional tracking/debugging parameters:
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
save_total_limit=2,
logging_steps=100,
run_name="splade-distilbert-base-uncased-nq", # Used in W&B if `wandb` is installed
)
Note that eval_strategy
was introduced in transformers
version 4.41.0
. Prior versions should use evaluation_strategy
instead.
请注意, eval_strategy
是在 transformers
版本 4.41.0
中引入的。以前的版本应改用 evaluation_strategy
。
Evaluator 计算器
You can provide the SparseEncoderTrainer
with an eval_dataset
to get the evaluation loss during training, but it may be useful to get more concrete metrics during training, too. For this, you can use evaluators to assess the model's performance with useful metrics before, during, or after training. You can use both an eval_dataset
and an evaluator, one or the other, or neither. They evaluate based on the eval_strategy
and eval_steps
Training Arguments.
你可以为 SparseEncoderTrainer
提供 一个 eval_dataset
来获取训练期间的评估损失,但在训练期间获取更具体的指标也可能很有用。为此,您可以使用评估器在训练之前、期间或之后使用有用的指标来评估模型的性能。您可以同时使用 eval_dataset
和 EVLUATOR,一个或另一个,或者两者都不使用。它们根据 eval_strategy
和 eval_steps
训练参数进行评估。
Here are the implemented Evaluators that come with Sentence Transformers for Sparse Encoder models:
以下是 Sparse Encoder 模型的 Sentence Transformers 附带的已实现的 Evaluator:
Evaluator | Required Data |
---|---|
SparseBinaryClassificationEvaluator |
Pairs with class labels. 与类标签配对。 |
SparseEmbeddingSimilarityEvaluator |
Pairs with similarity scores. 具有相似性分数的对。 |
SparseInformationRetrievalEvaluator |
Queries (qid => question), Corpus (cid => document), and relevant documents (qid => set[cid]). 查询 (qid => 个问题)、语料库 (cid => 文档)和相关文档 (qid => set[cid])。 |
SparseNanoBEIREvaluator |
No data required. 不需要数据。 |
SparseMSEEvaluator |
Source sentences to embed with a teacher model and target sentences to embed with the student model. Can be the same texts. 要嵌入到教师模型中的源句子和要嵌入到学生模型中的目标句子。可以是相同的文本。 |
SparseRerankingEvaluator |
List of {'query': '...', 'positive': [...], 'negative': [...]} dictionaries.{'query': '...', 'positive': [...], 'negative': [...]} 词典列表。 |
SparseTranslationEvaluator |
Pairs of sentences in two separate languages. 两种不同语言的成对句子。 |
SparseTripletEvaluator |
(anchor, positive, negative) pairs. (锚点、正、负) 对。 |
Additionally, SequentialEvaluator
should be used to combine multiple evaluators into one Evaluator that can be passed to the SparseEncoderTrainer
.
此外, SequentialEvaluator
应用于将多个计算器合并为一个计算器,该计算器可以传递给 SparseEncoderTrainer
。
Sometimes you don't have the required evaluation data to prepare one of these evaluators on your own, but you still want to track how well the model performs on some common benchmarks. In that case, you can use these evaluators with data from Hugging Face.
有时,您没有所需的评估数据来自行准备其中一个评估器,但您仍然希望跟踪模型在某些常见基准上的执行情况。在这种情况下,您可以将这些赋值器与 Hugging Face 中的数据一起使用。
SparseNanoBEIREvaluator
Documentation 文档
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
# Initialize the evaluator. Unlike most other evaluators, this one loads the relevant datasets
# directly from Hugging Face, so there's no mandatory arguments
dev_evaluator = SparseNanoBEIREvaluator()
# You can run evaluation like so:
# results = dev_evaluator(model)
SparseEmbeddingSimilarityEvaluator with STSb
使用 STSb 的 SparseEmbeddingSimilarityEvaluator
Documentation 文档
- sentence-transformers/stsb
句子转换符/STSB sentence_transformers.sparse_encoder.evaluation.SparseEmbeddingSimilarityEvaluator
sentence_transformers.SimilarityFunction
from datasets import load_dataset
from sentence_transformers.evaluation import SimilarityFunction
from sentence_transformers.sparse_encoder.evaluation import SparseEmbeddingSimilarityEvaluator
# Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb)
eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
# Initialize the evaluator
dev_evaluator = SparseEmbeddingSimilarityEvaluator(
sentences1=eval_dataset["sentence1"],
sentences2=eval_dataset["sentence2"],
scores=eval_dataset["score"],
main_similarity=SimilarityFunction.COSINE,
name="sts-dev",
)
# You can run evaluation like so:
# results = dev_evaluator(model)
SparseTripletEvaluator with AllNLI
使用 AllNLI 的 SparseTripletEvaluator
Documentation 文档
- sentence-transformers/all-nli
句子变形符/all-nli sentence_transformers.sparse_encoder.evaluation.SparseTripletEvaluator
sentence_transformers.SimilarityFunction
from datasets import load_dataset
from sentence_transformers.evaluation import SimilarityFunction
from sentence_transformers.sparse_encoder.evaluation import SparseTripletEvaluator
# Load triplets from the AllNLI dataset (https://huggingface.co/datasets/sentence-transformers/all-nli)
max_samples = 1000
eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split=f"dev[:{max_samples}]")
# Initialize the evaluator
dev_evaluator = SparseTripletEvaluator(
anchors=eval_dataset["anchor"],
positives=eval_dataset["positive"],
negatives=eval_dataset["negative"],
main_distance_function=SimilarityFunction.DOT,
name="all-nli-dev",
)
# You can run evaluation like so:
# results = dev_evaluator(model)
When evaluating frequently during training with a small
eval_steps
, consider using a tinyeval_dataset
to minimize evaluation overhead. If you're concerned about the evaluation set size, a 90-1-9 train-eval-test split can provide a balance, reserving a reasonably sized test set for final evaluations. After training, you can assess your model's performance usingtrainer.evaluate(test_dataset)
for test loss or initialize a testing evaluator withtest_evaluator(model)
for detailed test metrics.
在训练期间使用小eval_steps
频繁评估时,请考虑使用小eval_dataset
以最大程度地减少评估开销。如果您担心评估集的大小,90-1-9 train-eval-test 拆分可以提供平衡,为最终评估保留合理大小的测试集。训练后,您可以使用trainer.evaluate(test_dataset)
评估模型的性能以获取测试损失,或使用test_evaluator(model)
初始化测试评估器以获取详细的测试指标。If you evaluate after training, but before saving the model, your automatically generated model card will still include the test results.
如果您在训练后但在保存模型之前进行评估,则自动生成的模型卡仍将包含测试结果。
When using Distributed Training, the evaluator only runs on the first device, unlike the training and evaluation datasets, which are shared across all devices.
使用 分布式训练时,评估器仅在第一台设备上运行,这与在所有设备之间共享的训练和评估数据集不同。
Trainer 长途汽车
The SparseEncoderTrainer
is where all previous components come together. We only have to specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we can start training. Let’s have a look at a script where all of these components come together:SparseEncoderTrainer
是所有先前组件汇集在一起的地方。我们只需要使用模型、训练参数(可选)、训练数据集、评估数据集(可选)、损失函数、评估器(可选)指定训练器,我们就可以开始训练了。让我们看一下所有这些组件都汇集在一起的脚本:
import logging
from datasets import load_dataset
from sentence_transformers import (
SparseEncoder,
SparseEncoderModelCardData,
SparseEncoderTrainer,
SparseEncoderTrainingArguments,
)
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.sparse_encoder.losses import SparseMultipleNegativesRankingLoss, SpladeLoss
from sentence_transformers.sparse_encoder.models import SparseStaticEmbedding, MLMTransformer, SpladePooling
from sentence_transformers.training_args import BatchSamplers
logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
# 1. Load a model to finetune with 2. (Optional) model card data
mlm_transformer = MLMTransformer("distilbert/distilbert-base-uncased", tokenizer_args={"model_max_length": 512})
splade_pooling = SpladePooling(
pooling_strategy="max", word_embedding_dimension=mlm_transformer.get_sentence_embedding_dimension()
)
router = Router.for_query_document(
query_modules=[SparseStaticEmbedding(tokenizer=mlm_transformer.tokenizer, frozen=False)],
document_modules=[mlm_transformer, splade_pooling],
)
model = SparseEncoder(
modules=[router],
model_card_data=SparseEncoderModelCardData(
language="en",
license="apache-2.0",
model_name="Inference-free SPLADE distilbert-base-uncased trained on Natural-Questions tuples",
),
)
# 3. Load a dataset to finetune on
full_dataset = load_dataset("sentence-transformers/natural-questions", split="train").select(range(100_000))
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]
print(train_dataset)
print(train_dataset[0])
# 4. Define a loss function
loss = SpladeLoss(
model=model,
loss=SparseMultipleNegativesRankingLoss(model=model),
query_regularizer_weight=0,
document_regularizer_weight=3e-3,
)
# 5. (Optional) Specify training arguments
run_name = "inference-free-splade-distilbert-base-uncased-nq"
args = SparseEncoderTrainingArguments(
# Required parameter:
output_dir=f"models/{run_name}",
# Optional training parameters:
num_train_epochs=1,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=2e-5,
learning_rate_mapping={r"SparseStaticEmbedding\.weight": 1e-3}, # Set a higher learning rate for the SparseStaticEmbedding module
warmup_ratio=0.1,
fp16=True, # Set to False if you get an error that your GPU can't run on FP16
bf16=False, # Set to True if you have a GPU that supports BF16
batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
router_mapping={"query": "query", "answer": "document"}, # Map the column names to the routes
# Optional tracking/debugging parameters:
eval_strategy="steps",
eval_steps=1000,
save_strategy="steps",
save_steps=1000,
save_total_limit=2,
logging_steps=200,
run_name=run_name, # Will be used in W&B if `wandb` is installed
)
# 6. (Optional) Create an evaluator & evaluate the base model
dev_evaluator = SparseNanoBEIREvaluator(dataset_names=["msmarco", "nfcorpus", "nq"], batch_size=16)
# 7. Create a trainer & train
trainer = SparseEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=dev_evaluator,
)
trainer.train()
# 8. Evaluate the model performance again after training
dev_evaluator(model)
# 9. Save the trained model
model.save_pretrained(f"models/{run_name}/final")
# 10. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(run_name)
In this example I'm finetuning from distilbert/distilbert-base-uncased
, a base model that is not yet a Sparse Encoder model. This requires more training data than finetuning an existing Sparse Encoder model, like naver/splade-cocondenser-ensembledistil
.
在这个例子中,我从 distilbert/distilbert-base-uncased
进行了微调,这是一个还不是 Sparse Encoder 模型的基本模型。这需要比微调现有的 Sparse Encoder 模型(如 naver/splade-cocondenser-ensembledistil
)更多的训练数据。
After running this script, the sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq model was uploaded for me. The model scores 0.5241 NDCG@10 on NanoMSMARCO, 0.3299 NDCG@10 on NanoNFCorpus and 0.5357 NDCG@10 NanoNQ, which is a good result for an inference-free distilbert-based model trained on just 100k pairs from the Natural Questions dataset.
运行此脚本后,为我上传了 sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq 模型。该模型在 NanoMSMARCO 上的得分为 0.5241 NDCG@10,在 NanoNFCorpus 上得分为 0.3299 NDCG@10,在 NanoNQ NDCG@10 得分为 0.5357,这对于仅使用 Natural Questions 数据集中的 100k 对训练的基于推理的基于 distilbert 的模型来说是一个很好的结果。
The model uses an average of 184 active dimensions in the sparse embeddings for the documents, compared to 7.7 active dimensions for the queries (i.e. the average number of tokens in the query). This corresponds to a sparsity of 99.39% and 99.97%, respectively.
该模型在文档的稀疏嵌入中平均使用 184 个活动维度,而查询使用 7.7 个活动维度(即查询中的平均标记数)。这分别对应于 99.39% 和 99.97% 的稀疏度。
All of this information is stored in the automatically generated model card, including the base model, language, license, evaluation results, training & evaluation dataset info, hyperparameters, training logs, and more. Without any effort, your uploaded models should contain all the information that your potential users would need to determine whether your model is suitable for them.
所有这些信息都存储在自动生成的模型卡中,包括基础模型、语言、许可证、评估结果、训练和评估数据集信息、超参数、训练日志等等。无需任何努力,您上传的模型应该包含潜在用户确定您的模型是否适合他们所需的所有信息。
Callbacks 回调
The Sentence Transformers trainer supports various transformers.TrainerCallback
subclasses, including:
Sentence Transformers trainer 支持各种 Transformer。TrainerCallback
子类,包括:
WandbCallback
for logging training metrics to W&B ifwandb
is installedWandbCallback
,用于将训练指标记录到 W&B(如果安装了wandb
)TensorBoardCallback
for logging training metrics to TensorBoard iftensorboard
is accessibleTensorBoardCallback
,用于在tensorboard
可访问时将训练指标记录到 TensorBoardCodeCarbonCallback
for tracking carbon emissions during training ifcodecarbon
is installedCodeCarbonCallback
,用于在训练期间跟踪碳排放(如果安装了codecarbon
)
These are automatically used without you having to specify anything, as long as the required dependency is installed.
只要安装了所需的依赖项,这些就会自动使用,而无需您指定任何内容。
Refer to the Transformers Callbacks documentation for more information on these callbacks and how to create your own.
请参阅 Transformers Callbacks 文档 ,了解有关这些回调以及如何创建自己的回调的更多信息。
Multi-Dataset Training 多数据集训练
Top-performing models are often trained using multiple datasets simultaneously. The SparseEncoderTrainer
simplifies this process by allowing you to train with multiple datasets without converting them to the same format. You can even apply different loss functions to each dataset. Here are the steps for multi-dataset training:
性能最佳的模型通常同时使用多个数据集进行训练。 SparseEncoderTrainer
允许您使用多个数据集进行训练,而无需将它们转换为相同的格式,从而简化了此过程。您甚至可以对每个数据集应用不同的损失函数。以下是多数据集训练的步骤:
- Use a dictionary of
datasets.Dataset
instances (or adatasets.DatasetDict
) as thetrain_dataset
andeval_dataset
.
使用数据集字典。数据集
实例(或a.DatasetDict
) 作为train_dataset
和eval_dataset
。 - (Optional) Use a dictionary of loss functions mapping dataset names to losses if you want to use different losses for different datasets.
(可选)如果要对不同的数据集使用不同的损失,请使用将数据集名称映射到损失的损失函数字典。
Each training/evaluation batch will contain samples from only one of the datasets. The order in which batches are sampled from the multiple datasets is determined by the MultiDatasetBatchSamplers
enum, which can be passed to the SparseEncoderTrainingArguments
via multi_dataset_batch_sampler
. The valid options are:
每个训练/评估批次将仅包含来自其中一个数据集的样本。从多个数据集中采样批次的顺序由 MultiDatasetBatchSamplers
枚举确定,该枚举可以通过 multi_dataset_batch_sampler
传递给 SparseEncoderTrainingArguments
。有效选项包括:
MultiDatasetBatchSamplers.ROUND_ROBIN
: Samples from each dataset in a round-robin fashion until one is exhausted. This strategy may not use all samples from each dataset, but it ensures equal sampling from each dataset.
MultiDatasetBatchSamplers.ROUND_ROBIN
:以循环方式从每个数据集中采样,直到一个数据集用完。此策略可能不会使用每个数据集中的所有样本,但它可确保每个数据集的样本相等。MultiDatasetBatchSamplers.PROPORTIONAL
(default): Samples from each dataset proportionally to its size. This strategy ensures that all samples from each dataset are used, and larger datasets are sampled from more frequently.
MultiDatasetBatchSamplers.PROPORTIONAL
(默认):每个数据集的样本与其大小成比例。此策略可确保使用每个数据集中的所有样本,并且更频繁地从较大的数据集中采样。
Evaluation 评估
Let's evaluate our newly trained inference-free SPLADE model using the NanoMSMARCO dataset, and see how it compares to dense retrieval approaches. We'll also explore hybrid retrieval methods that combine sparse and dense vectors, as well as reranking to further improve search quality.
让我们使用 NanoMSMARCO 数据集评估我们新训练的无推理 SPLADE 模型,看看它与密集检索方法的比较。我们还将探索结合稀疏向量和密集向量的混合检索方法,以及重新排名以进一步提高搜索质量。
After running a slightly modified version of our hybrid_search.py script, we get the following results for the NanoMSMARCO dataset, using these models:
在运行略微修改的 hybrid_search.py 脚本版本后,我们使用这些模型获得了 NanoMSMARCO 数据集的以下结果:
- Sparse:
sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq
(the model we just trained)
稀疏 :sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq
(我们刚刚训练的模型) - Dense:
sentence-transformers/all-MiniLM-L6-v2
密集度:sentence-transformers/all-MiniLM-L6-v2
- Reranker:
cross-encoder/ms-marco-MiniLM-L6-v2
重新排名器 :cross-encoder/ms-marco-MiniLM-L6-v2
Sparse | Dense | Reranker | NDCG@10 | MRR@10 | MAP |
---|---|---|---|---|---|
x | 52.41 | 43.06 | 44.20 | ||
x | 55.40 | 47.96 | 49.08 | ||
x | x | 62.22 | 53.02 | 53.44 | |
x | x | 66.31 | 59.45 | 60.36 | |
x | x | 66.28 | 59.43 | 60.34 | |
x | x | x | 66.28 | 59.43 | 60.34 |
The Sparse and Dense rankings can be combined using Reciprocal Rank Fusion (RRF), which is a simple way to combine the results of multiple rankings. If a Reranker is applied, it will rerank the results of the prior retrieval step.
稀疏和密集排名可以使用 Reciprocal Rank Fusion (RRF) 进行组合,这是一种组合多个排名结果的简单方法。如果应用了 Reranker,它将对先前检索步骤的结果进行重新排序。
The results indicate that for this dataset, combining Dense and Sparse rankings is very performant, resulting in 12.3% and 18.7% increases over the Dense and Sparse baselines, respectively. In short, combining Sparse and Dense retrieval methods is a very effective way to improve search performance.
结果表明,对于此数据集,将 Dense 和 Sparse 排名相结合的性能非常高,与 Dense 和 Sparse 基线相比,分别增加了 12.3% 和 18.7%。简而言之,结合 Sparse 和 Dense 检索方法是提高搜索性能的一种非常有效的方法。
Furthermore, applying a reranker on any of the rankings improved the performance to approximately 66.3 NDCG@10, showing that either Sparse, Dense, or Hybrid (Dense + Sparse) found the relevant documents in their top 100, which the reranker then ranked to the top 10. So, replacing a Dense -> Reranker pipeline with a Sparse -> Reranker pipeline might improve both latency and costs:
此外,对任何排名应用重新排名程序都将性能提高到大约 66.3 NDCG@10,这表明 Sparse、Density 或 Hybrid (Dense + Sparse) 在其前 100 名中找到了相关文档,然后重新排名程序将其排名到前 10 名。因此,将 Dense -> Reranker 管道替换为 Sparse -> Reranker 管道可能会降低延迟和成本:
- Sparse embeddings can be cheaper to store, e.g. our model only uses ~180 active dimensions for MS MARCO documents instead of the common 1024 dimensions for dense models.
稀疏嵌入的存储成本更低,例如,我们的模型对 MS MARCO 文档仅使用 ~180 个活动维度,而不是对密集模型使用常见的 1024 个维度。 - Some Sparse Encoders allow for inference-free query processing, allowing for a near-instant first-stage retrieval, akin to lexical solutions like BM25.
一些稀疏编码器允许无推理的查询处理,允许近乎即时的第一阶段检索,类似于 BM25 等词法解决方案。
Training Tips 训练技巧
Sparse Encoder models have a few quirks that you should be aware of when training them:
Sparse Encoder 模型有一些怪癖,在训练它们时应该注意:
- Sparse Encoder models should not be evaluated solely using the evaluation scores, but also with the sparsity of the embeddings. After all, a low sparsity means that the model embeddings are expensive to store and slow to retrieve.
Sparse Encoder 模型不应仅使用评估分数进行评估,还应使用嵌入的稀疏性进行评估。毕竟,低稀疏性意味着模型嵌入的存储成本很高,而且检索速度很慢。 - The stronger Sparse Encoder models are trained almost exclusively with distillation from a stronger teacher model (e.g. a CrossEncoder model), instead of training directly from text pairs or triplets. See for example the SPLADE-v3 paper, which uses
SparseDistillKLDivLoss
andSparseMarginMSELoss
for distillation. We don't cover this in detail in this blog as it requires more data preparation, but a distillation setup should be seriously considered.
更强的 Sparse Encoder 模型几乎完全是通过从更强的教师模型(例如 CrossEncoder 模型 )中提炼出来训练的,而不是直接从文本对或三元组进行训练。例如,请参阅 SPLADE-v3 论文 ,该论文使用SparseDistillKLDivLoss
和SparseMarginMSELoss
进行蒸馏。我们在本博客中没有详细介绍这一点,因为它需要更多的数据准备,但应该认真考虑蒸馏设置。
Vector Database Integration
矢量数据库集成
After training sparse embedding models, the next crucial step is deploying them effectively in production environments. Vector databases provide the essential infrastructure for storing, indexing, and retrieving sparse embeddings at scale. Popular options include Qdrant, OpenSearch, Elasticsearch, and Seismic, among others.
在训练稀疏嵌入模型之后,下一个关键步骤是将它们有效地部署到生产环境中。矢量数据库为大规模存储、索引和检索稀疏嵌入提供了必要的基础设施。流行的选项包括 Qdrant、OpenSearch、Elasticsearch 和 Seismic 等。
For comprehensive examples covering vector databases mentioned above, refer to the semantic search with vector database documentation or below for the Qdrant example.
有关涵盖上述向量数据库的综合示例,请参阅 使用向量数据库进行语义搜索文档 ,或参阅下面的 Qdrant 示例。
Qdrant Integration Example
Qdrant 集成示例
Qdrant offers excellent support for sparse vectors with efficient storage and fast retrieval capabilities. Below is a comprehensive implementation example:
Qdrant 通过高效存储和快速检索功能为稀疏向量提供出色的支持。下面是一个全面的实现示例:
Prerequisites: 先决条件:
- Qdrant running locally (or accessible), see the Qdrant Quickstart for more details.
Qdrant 在本地(或可访问)运行,请参阅 Qdrant 快速入门 了解更多详细信息。 - Python Qdrant Client installed:
已安装 Python Qdrant 客户端:pip install qdrant-client
This example demonstrates how to set up Qdrant for sparse vector search by showing how to efficiently encode and index documents with sparse encoders, formulating search queries with sparse vectors, and providing an interactive query interface. See below:
此示例通过展示如何使用稀疏编码器有效地对文档进行编码和索引、使用稀疏向量构建搜索查询以及提供交互式查询界面,演示如何为稀疏向量搜索设置 Qdrant。请参阅下文:
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_qdrant
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
# Initially, we don't have a qdrant index yet
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using qdrant
results, search_time, corpus_index = semantic_search_qdrant(
query_embeddings,
corpus_index=corpus_index,
corpus_embeddings=corpus_embeddings if corpus_index is None else None,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
Additional Resources 其他资源
Training Examples 培训示例
The following pages contain training examples with explanations as well as links to code. We recommend that you browse through these to familiarize yourself with the training loop:
以下页面包含带有说明的训练示例以及代码链接。我们建议您浏览这些内容以熟悉训练循环:
- Model Distillation - Examples to make models smaller, faster and lighter.
模型蒸馏 - 使模型更小、更快、更轻的示例。 - MS MARCO - Example training scripts for training on the MS MARCO information retrieval dataset.
MS MARCO - 用于在 MS MARCO 信息检索数据集上进行训练的示例训练脚本。 - Retrievers - Example training scripts for training on generic information retrieval datasets.
检索器 - 用于在通用信息检索数据集上进行训练的示例训练脚本。 - Natural Language Inference - Natural Language Inference (NLI) data can be quite helpful to pre-train and fine-tune models to create meaningful sparse embeddings.
自然语言推理 - 自然语言推理 (NLI) 数据对于预训练和微调模型以创建有意义的稀疏嵌入非常有帮助。 - Quora Duplicate Questions - Quora Duplicate Questions is a large set corpus with duplicate questions from the Quora community. The folder contains examples how to train models for duplicate questions mining and for semantic search.
Quora 重复问题 - Quora 重复问题是一个大型语料库,其中包含来自 Quora 社区的重复问题。该文件夹包含如何训练模型以进行重复问题挖掘和语义搜索的示例。 - STS - The most basic method to train models is using Semantic Textual Similarity (STS) data. Here, we use sentence pairs and a score indicating the semantic similarity.
STS - 训练模型的最基本方法是使用语义文本相似性 (STS) 数据。在这里,我们使用句子对和表示语义相似性的分数。
Documentation 文档
Additionally, the following pages may be useful to learn more about Sentence Transformers:
此外,以下页面可能有助于了解有关 Sentence Transformers 的更多信息:
- Installation 安装
- Quickstart 快速入门
- Usage 用法
- Pretrained Models 预训练模型
- Training Overview (This blogpost is a distillation of the Training Overview documentation)
培训概述 (此博客文章是培训概述文档的精华) - Dataset Overview 数据集概述
- Loss Overview 损失概述
- API Reference API 参考
And lastly, here are some advanced pages that might interest you:
最后,以下是一些您可能感兴趣的高级页面: