Training and Finetuning Sparse Embedding Models with Sentence Transformers v5
使用 Sentence Transformers v5 训练和微调稀疏嵌入模型

Published July 1, 2025 发布时间 2025 年 7 月 1 日

Update on GitHub GitHub 上的更新

Upvote

Tom Aarsen 汤姆·阿森

tomaarsen 托马森

Arthur BRESNU 亚瑟·布雷斯努

arthurbresnu 亚瑟布雷斯努

Sentence Transformers is a Python library for using and training embedding and reranker models for a wide range of applications, such as retrieval augmented generation, semantic search, semantic textual similarity, paraphrase mining, and more. The last few major versions have introduced significant improvements to training:
Sentence Transformers 是一个 Python 库，用于为广泛的应用程序使用和训练嵌入和重新排序模型，例如检索增强生成、语义搜索、语义文本相似性、释义挖掘等。最近几个主要版本对训练进行了重大改进：

v3.0: (improved) Sentence Transformer (Dense Embedding) model training
v3.0：（改进）Sentence Transformer （Dense Embedding）模型训练
v4.0: (improved) Cross Encoder (Reranker) model training
v4.0：（改进）交叉编码器（Reranker）模型训练
v5.0: (new) Sparse Embedding model training
v5.0：（新）稀疏嵌入模型训练

In this blogpost, I'll show you how to use it to finetune a sparse encoder/embedding model and explain why you might want to do so. This results in sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq, a cheap model that works especially well in hybrid search or retrieve and rerank scenarios.
在这篇博文中，我将向您展示如何使用它来微调稀疏编码器/嵌入模型，并解释为什么您可能希望这样做。这导致了 sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq，这是一种廉价的模型，在混合搜索或检索和重新排序场景中效果特别好。

Finetuning sparse embedding models involves several components: the model, datasets, loss functions, training arguments, evaluators, and the trainer class. I'll have a look at each of these components, accompanied by practical examples of how they can be used for finetuning strong sparse embedding models.
微调稀疏嵌入模型涉及几个组件：模型、数据集、损失函数、训练参数、评估程序和 trainer 类。我将介绍这些组件中的每一个，并附有如何使用它们来微调强稀疏嵌入模型的实际示例。

In addition to training your own models, you can choose from a wide range of pretrained sparse encoders available on the Hugging Face Hub. To help navigate this growing space, we’ve curated a SPLADE Models collection highlighting some of the most relevant models.
除了训练自己的模型外，您还可以从 Hugging Face Hub 上提供的各种预训练稀疏编码器中进行选择。为了帮助驾驭这个不断增长的领域，我们策划了一个 SPLADE 模型集合，重点介绍了一些最相关的模型。
We list the most prominent ones along with their benchmark results in Pretrained Models in the documentation.
我们在文档中的 Pretrained Models 中列出了最突出的 Slots 及其基准测试结果。

Table of Contents 目录

What are Sparse Embedding models?
什么是稀疏嵌入模型？
- Query and Document Expansion
  查询和文档扩展
- Why Use Sparse Embedding Models?
  为什么使用稀疏嵌入模型？
Why Finetune? 为什么选择 Finetune？
Training Components 培训组件
Model 型
Dataset 数据
Loss Function 损失函数
Training Arguments 训练参数
Evaluator 计算器
Trainer 长途汽车
- Callbacks 回调
- Multi-Dataset Training 多数据集训练
Evaluation 评估
Training Tips 训练技巧
Vector Database Integration
矢量数据库集成
Additional Resources 其他资源
- Training Examples 培训示例
- Documentation 文档

What are Sparse Embedding models?
什么是稀疏嵌入模型？

The broader term "embedding models" refer to models that convert some input, usually text, into a vector representation (embedding) that captures the semantic meaning of the input. Unlike with the raw inputs, you can perform mathematical operations on these embeddings, resulting in similarity scores that can be used for various tasks, such as search, clustering, or classification.
更广泛的术语 “嵌入模型” 是指将某些输入（通常是文本）转换为捕获输入语义含义的向量表示（嵌入）的模型。与原始输入不同，您可以对这些嵌入执行数学运算，从而获得可用于各种任务（如搜索、聚类或分类）的相似性分数。

With dense embedding models, i.e. the common variety, the embeddings are typically low-dimensional vectors (e.g., 384, 768, or 1024 dimensions) where most values are non-zero. Sparse embedding models, on the other hand, produce high-dimensional vectors (e.g., 30,000+ dimensions) where most values are zero. Usually, each active dimension (i.e. the dimension with a non-zero value) in a sparse embedding corresponds to a specific token in the model's vocabulary, allowing for interpretability.
对于密集嵌入模型，即常见变体，嵌入通常是低维向量（例如，384、768 或 1024 维），其中大多数值都是非零的。另一方面，稀疏嵌入模型会产生大多数值为零的高维向量（例如，30,000+ 个维度）。通常，稀疏嵌入中的每个活动维度（即具有非零值的维度）对应于模型词汇表中的特定标记，从而允许可解释性。

Let's have a look at naver/splade-v3, a state-of-the-art sparse embedding model, as an example:
让我们以最先进的稀疏嵌入模型 naver/splade-v3 为例：

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")

# Run inference
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 30522)

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[   32.4323,     5.8528,     0.0258],
#         [    5.8528,    26.6649,     0.0302],
#         [    0.0258,     0.0302,    24.0839]])

# Let's decode our embeddings to be able to interpret them
decoded = model.decode(embeddings, top_k=10)
for decoded, sentence in zip(decoded, sentences):
    print(f"Sentence: {sentence}")
    print(f"Decoded: {decoded}")
    print()

Sentence: The weather is lovely today.
Decoded: [('weather', 2.754288673400879), ('today', 2.610959529876709), ('lovely', 2.431990623474121), ('currently', 1.5520408153533936), ('beautiful', 1.5046082735061646), ('cool', 1.4664798974990845), ('pretty', 0.8986214995384216), ('yesterday', 0.8603134155273438), ('nice', 0.8322536945343018), ('summer', 0.7702118158340454)]

Sentence: It's so sunny outside!
Decoded: [('outside', 2.6939032077789307), ('sunny', 2.535827398300171), ('so', 2.0600898265838623), ('out', 1.5397940874099731), ('weather', 1.1198079586029053), ('very', 0.9873268604278564), ('cool', 0.9406591057777405), ('it', 0.9026399254798889), ('summer', 0.684999406337738), ('sun', 0.6520509123802185)]

Sentence: He drove to the stadium.
Decoded: [('stadium', 2.7872302532196045), ('drove', 1.8208855390548706), ('driving', 1.6665740013122559), ('drive', 1.5565159320831299), ('he', 1.4721972942352295), ('stadiums', 1.449463129043579), ('to', 1.0441515445709229), ('car', 0.7002660632133484), ('visit', 0.5118278861045837), ('football', 0.502326250076294)]

In this example, the embeddings are 30,522-dimensional vectors, where each dimension corresponds to a token in the model's vocabulary. The decode method returned the top 10 tokens with the highest values in the embedding, allowing us to interpret which tokens contribute most to the embedding.
在此示例中，嵌入是 30,522 维向量，其中每个维度对应于模型词汇表中的一个标记。 decode 方法返回嵌入中值最高的前 10 个标记，允许我们解释哪些标记对嵌入的贡献最大。

We can even determine the intersection or overlap between embeddings, very useful for determining why two texts are deemed similar or dissimilar:
我们甚至可以确定嵌入之间的交集或重叠，这对于确定为什么两个文本被认为是相似或不相似非常有用：

# Let's also compute the intersection/overlap of the first two embeddings
intersection_embedding = model.intersection(embeddings[0], embeddings[1])
decoded_intersection = model.decode(intersection_embedding)
print(decoded_intersection)

Decoded: [('weather', 3.0842742919921875), ('cool', 1.379457712173462), ('summer', 0.5275946259498596), ('comfort', 0.3239051103591919), ('sally', 0.22571465373039246), ('julian', 0.14787325263023376), ('nature', 0.08582140505313873), ('beauty', 0.0588383711874485), ('mood', 0.018594780936837196), ('nathan', 0.000752730411477387)]

Query and Document Expansion
查询和文档扩展

A key component of neural sparse embedding models is query/document expansion. Unlike traditional lexical methods like BM25, which only match exact tokens, neural sparse models generally automatically expand the original text with semantically related terms:
神经稀疏嵌入模型的一个关键组成部分是 查询/文档扩展 。与 BM25 等仅匹配精确标记的传统词法不同，神经稀疏模型通常会自动使用语义相关的术语扩展原始文本：

Traditional, Lexical (e.g. BM25): Only matches on exact tokens in the text
繁体、词汇（如 BM25）： 仅匹配文本中的精确标记
Neural Sparse Models: Automatically expand with related terms
神经稀疏模型： 使用相关术语自动扩展

For example, in the code output above, the sentence "The weather is lovely today" is expanded to include terms like "beautiful", "cool", "pretty", and "nice" which weren't in the original text. Similarly, "It's so sunny outside!" is expanded to include "weather", "summer", and "sun".
例如，在上面的代码输出中，句子 “The weather is lovely today” 被扩展为包含原始文本中没有的术语 “beautiful”、“cool”、“pretty” 和 “nice” 等术语。同样，“It's so sunny outside！” 被扩展为包括 “weather”、“summer” 和 “sun”。

This expansion allows neural sparse models to match semantically related content or synonyms even without exact token matches, handle misspellings, and overcome vocabulary mismatch problems. This is why neural sparse models like SPLADE often outperform traditional lexical search methods while maintaining the efficiency benefits of sparse representations.
这种扩展允许神经稀疏模型匹配语义相关的内容或同义词，即使没有精确的标记匹配，也可以处理拼写错误，并克服词汇不匹配问题。这就是为什么像 SPLADE 这样的神经稀疏模型通常优于传统的词法搜索方法，同时保持稀疏表示的效率优势。

However, expansion has its risks. For example, query expansion for "What is the weather on Tuesday?" will likely also expand to "monday", "wednesday", etc., which may not be desired.
然而，扩张有其风险。例如，“What is the weather on Tuesday？”的查询扩展也可能扩展为“monday”、“wednesday”等，这可能不是你想要的。

Why Use Sparse Embedding Models?
为什么使用稀疏嵌入模型？

In short, neural sparse embedding models fall in a valuable niche between traditional lexical methods like BM25 and dense embedding models like Sentence Transformers. They have the following advantages:
简而言之，神经稀疏嵌入模型介于 BM25 等传统词汇方法和 Sentence Transformers 等密集嵌入模型之间。它们具有以下优点：

Hybrid potential: Very effectively combined with dense models, which may struggle with searches where lexical matches are important
混合动力潜力： 与密集模型非常有效地结合，这可能难以处理词法匹配很重要的搜索
Interpretability: You can see exactly which tokens contribute to a match
可解释性： 您可以准确查看哪些令牌对匹配有贡献
Performance: Competitive or better than dense models in many retrieval tasks
性能： 在许多检索任务中具有竞争力或优于密集模型

Throughout this blogpost, I'll use "sparse embedding model" and "sparse encoder model" interchangeably.
在这篇博文中，我将交替使用“sparse embedding model”和“sparse encoder model”。

Why Finetune? 为什么选择 Finetune？

The majority of (neural) sparse embedding models employ the aforementioned query/document expansion so that you can match texts with nearly identical meaning, even if they don't share any words. In short, the model has to recognize synonyms so those tokens can be placed in the final embedding.
大多数（神经）稀疏嵌入模型都采用上述查询/文档扩展，因此你可以匹配含义几乎相同的文本，即使它们不共享任何单词。简而言之，模型必须识别同义词，以便这些标记可以放置在最终嵌入中。

Most out-of-the-box sparse embedding models will easily recognize that "supermarket", "food", and "market" are useful expansions of a text containing "grocery", but for example:
大多数开箱即用的稀疏嵌入模型很容易识别出 “supermarket”、“food” 和 “market” 是包含 “grocery” 的文本的有用扩展，但例如：

"The patient complained of severe cephalalgia."
“病人抱怨有严重的头痛。”

expands to: 扩展为：

'##lal', 'severe', '##pha', 'ce', '##gia', 'patient', 'complaint', 'patients', 'complained', 'warning', 'suffered', 'had', 'disease', 'complain', 'diagnosis', 'syndrome', 'mild', 'pain', 'hospital', 'injury'

whereas we wish for it to expand to "headache", the common word for "cephalalgia". This example expands to many domains, e.g. not recognizing that "Java" is a programming language, that "Audi" makes cars, or that "NVIDIA" is a company that makes graphics cards.
而我们希望它扩展为 “头痛”，即 “头痛 ”的常用词。这个例子扩展到许多领域，例如，没有认识到 “Java” 是一种编程语言，没有认识到 “Audi” 制造汽车，或者 “NVIDIA” 是一家生产显卡的公司。

Through finetuning, the model can learn to focus exclusively on the domain and/or language that matters to you.
通过微调，模型可以学习只关注对您重要的领域和/或语言。

Training Components 培训组件

Training Sentence Transformer models involves the following components:
训练 Sentence Transformer 模型涉及以下组件：

Model: The model to train or finetune, which can be a pre-trained Sparse Encoder model or a base model.
Model：要训练或微调的模型，可以是预先训练的 Sparse Encoder 模型或基础模型。
Dataset: The data used for training and evaluation.
数据集 ：用于训练和评估的数据。
Loss Function: A function that quantifies the model's performance and guides the optimization process.
Loss Function：量化模型性能并指导优化过程的函数。
Training Arguments (optional): Parameters that influence training performance and tracking/debugging.
Training Arguments （可选）：影响训练性能和跟踪/调试的参数。
Evaluator (optional): A tool for evaluating the model before, during, or after training.
Evaluator （可选）：用于在训练之前、期间或之后评估模型的工具。
Trainer: Brings together the model, dataset, loss function, and other components for training.
Trainer：将模型、数据集、损失函数和其他组件汇集在一起进行训练。

Now, let's dive into each of these components in more detail.
现在，让我们更详细地了解这些组件中的每一个。

Model 型

Sparse Encoder models consist of a sequence of Modules, Sparse Encoder specific Modules or Custom Modules, allowing for a lot of flexibility. If you want to further finetune a Sparse Encoder model (e.g. it has a modules.json file), then you don't have to worry about which modules are used:
稀疏编码器模型由一系列模块、稀疏编码器特定模块或自定义模块组成，具有很大的灵活性。如果你想进一步微调 Sparse Encoder 模型（例如，它有一个 modules.json 文件），那么你不必担心使用了哪些模块：

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

But if instead you want to train from another checkpoint, or from scratch, then these are the most common architectures you can use:
但是，如果您想从另一个检查点或从头开始训练，那么这些是您可以使用的最常见架构：

Splade 板

Splade models use the MLMTransformer followed by a SpladePooling modules. The former loads a pretrained Masked Language Modeling transformer model (e.g. BERT, RoBERTa, DistilBERT, ModernBERT, etc.) and the latter pools the output of the MLMHead to produce a single sparse embedding of the size of the vocabulary.
Splade 模型使用 MLMTransformer，后跟 SpladePooling 模块。前者加载预先训练的掩码语言建模转换器模型（例如 BERT、RoBERTa、DistilBERT、ModernBERT 等），后者将 MLMHead 的输出汇集在一起，以生成词汇表大小的单个稀疏嵌入。

from sentence_transformers import models, SparseEncoder
from sentence_transformers.sparse_encoder.models import MLMTransformer, SpladePooling

# Initialize MLM Transformer (use a fill-mask model)
mlm_transformer = MLMTransformer("google-bert/bert-base-uncased")

# Initialize SpladePooling module
splade_pooling = SpladePooling(pooling_strategy="max")

# Create the Splade model
model = SparseEncoder(modules=[mlm_transformer, splade_pooling])

This architecture is the default if you provide a fill-mask model architecture to SparseEncoder, so it's easier to use the shortcut:
如果你为 SparseEncoder 提供填充掩码模型架构，则此架构是默认架构，因此使用快捷方式会更容易：

from sentence_transformers import SparseEncoder

model = SparseEncoder("google-bert/bert-base-uncased")
# SparseEncoder(
#   (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
#   (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': None})
# )

Inference-free Splade 免推理 Splade

Inference-free Splade uses a Router module with different modules for queries and documents. Usually for this type of architecture, the documents part is a traditional Splade architecture (a MLMTransformer followed by a SpladePooling module) and the query part is an SparseStaticEmbedding module, which just returns a pre-computed score for every token in the query.
免推理 Splade 使用带有不同模块的 Router 模块进行查询和文档。通常对于这种类型的架构，文档部分是传统的 Splade 架构（ MLMTransformer 后跟 SpladePooling 模块），查询部分是 SparseStaticEmbedding 模块，它只返回查询中每个标记的预计算分数。

from sentence_transformers import SparseEncoder
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import SparseStaticEmbedding, MLMTransformer, SpladePooling

# Initialize MLM Transformer for document encoding
doc_encoder = MLMTransformer("google-bert/bert-base-uncased")

# Create a router model with different paths for queries and documents
router = Router.for_query_document(
    query_modules=[SparseStaticEmbedding(tokenizer=doc_encoder.tokenizer, frozen=False)],
    # Document path: full MLM transformer + pooling
    document_modules=[doc_encoder, SpladePooling("max")],
)

# Create the inference-free model
model = SparseEncoder(modules=[router], similarity_fn_name="dot")
# SparseEncoder(
#   (0): Router(
#     (query_0_SparseStaticEmbedding): SparseStaticEmbedding ({'frozen': False}, dim:30522, tokenizer: BertTokenizerFast)
#     (document_0_MLMTransformer): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
#     (document_1_SpladePooling): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': None})
#   )
# )

This architecture allows for fast query-time processing using the lightweight SparseStaticEmbedding approach, that can be trained and seen as a linear weights, while documents are processed with the full MLM transformer and SpladePooling.
这种架构允许使用轻量级的 SparseStaticEmbedding 方法进行快速查询时处理，该方法可以被训练并被视为线性权重，而文档则使用完整的 MLM 转换器和 SpladePooling 进行处理。

Inference-free Splade is particularly useful for search applications where query latency is critical, as it shifts the computational complexity to the document indexing phase which can be done offline.
免推理 Splade 对于查询延迟至关重要的搜索应用程序特别有用，因为它将计算复杂性转移到可以离线完成的文档索引阶段。

When training models with the Router module, you must use the router_mapping argument in the SparseEncoderTrainingArguments to map the training dataset columns to the correct route ("query" or "document"). For example, if your dataset(s) have ["question", "answer"] columns, then you can use the following mapping:
使用 Router 模块训练模型时，必须使用 SparseEncoderTrainingArguments 中的 router_mapping 参数将训练数据集列映射到正确的路由（“query”或“document”）。例如，如果您的数据集具有 [“question”， “answer”] 列，则可以使用以下映射：
args = SparseEncoderTrainingArguments(
    ...,
    router_mapping={
        "question": "query",
        "answer": "document",
    }
)
Additionally, it is recommended to use a much higher learning rate for the SparseStaticEmbedding module than for the rest of the model. For this, you should use the learning_rate_mapping argument in the SparseEncoderTrainingArguments to map parameter patterns to their learning rates. For example, if you want to use a learning rate of 1e-3 for the SparseStaticEmbedding module and 2e-5 for the rest of the model, you can do this:
此外，建议对 SparseStaticEmbedding 模块使用比模型其余部分高得多的学习率。为此，您应该使用 SparseEncoderTrainingArguments 中的 learning_rate_mapping 参数将参数模式映射到它们的学习率。例如，如果要对 SparseStaticEmbedding 模块使用 1e-3 的学习率，对模型的其余部分使用 2e-5 的学习率，则可以执行此作：
args = SparseEncoderTrainingArguments(
    ...,
    learning_rate=2e-5,
    learning_rate_mapping={
        r"SparseStaticEmbedding\.*": 1e-3,
    }
)

Contrastive Sparse Representation (CSR)
对比稀疏表示（CSR）

Contrastive Sparse Representation (CSR) models, introduced in Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation, apply a SparseAutoEncoder module on top of a dense Sentence Transformer model, which usually consist of a Transformer followed by a Pooling module. You can initialize one from scratch like so:
在超越俄罗斯套娃：重新审视自适应表示的稀疏编码中引入的对比稀疏表示（CSR）模型，在密集的 Sentence Transformer 模型之上应用了一个 SparseAutoEncoder 模块，该模型通常由一个 Transformer 和一个 Pooling 模块组成。您可以从头开始初始化一个，如下所示：

from sentence_transformers import models, SparseEncoder
from sentence_transformers.sparse_encoder.models import SparseAutoEncoder

# Initialize transformer (can be any dense encoder model)
transformer = models.Transformer("google-bert/bert-base-uncased")

# Initialize pooling
pooling = models.Pooling(transformer.get_word_embedding_dimension(), pooling_mode="mean")

# Initialize SparseAutoEncoder module
sparse_auto_encoder = SparseAutoEncoder(
    input_dim=transformer.get_word_embedding_dimension(),
    hidden_dim=4 * transformer.get_word_embedding_dimension(),
    k=256,  # Number of top values to keep
    k_aux=512,  # Number of top values for auxiliary loss
)
# Create the CSR model
model = SparseEncoder(modules=[transformer, pooling, sparse_auto_encoder])

Or if your base model is 1) a dense Sentence Transformer model or 2) a non-MLM Transformer model (those are loaded as Splade models by default), then this shortcut will automatically initialize the CSR model for you:
或者，如果你的基本模型是 1）一个密集的 Sentence Transformer 模型或 2）一个非 MLM Transformer 模型（默认情况下作为 Splade 模型加载），那么这个快捷方式将自动为你初始化 CSR 模型：

from sentence_transformers import SparseEncoder

model = SparseEncoder("mixedbread-ai/mxbai-embed-large-v1")
# SparseEncoder(
#   (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
#   (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
#   (2): SparseAutoEncoder({'input_dim': 1024, 'hidden_dim': 4096, 'k': 256, 'k_aux': 512, 'normalize': False, 'dead_threshold': 30})
# )

Unlike (Inference-free) Splade models, sparse embeddings by CSR models don't have the same size as the vocabulary of the base model. This means you can't directly interpret which words are activated in your embedding like you can with Splade models, where each dimension corresponds to a specific token in the vocabulary.
与（无推理的）Splade 模型不同，CSR 模型的稀疏嵌入与基本模型的词汇表大小不同。这意味着你不能像使用 Splade 模型那样直接解释嵌入中激活了哪些单词，其中每个维度对应于词汇表中的一个特定标记。

Beyond that, CSR models are most effective on dense encoder models that use high-dimensional representations (e.g. 1024-4096 dimensions).
除此之外，CSR 模型在使用高维表示（例如 1024-4096 维）的密集编码器模型上最有效。

Architecture Picker Guide
架构选取器指南

If you're unsure which architecture to use, here's a quick guide:
如果您不确定要使用哪种架构，这里有一个快速指南：

Do you want to sparsify an existing Dense Embedding model? If yes, use CSR.
是否要稀疏化现有的 Dense Embedding 模型？如果是，请使用 CSR。
Do you want your query inference to be instantaneous at the cost of slight performance? If yes, use Inference-free SPLADE.
您是否希望查询推理是即时的，但代价是性能不佳？如果是，请使用 Inference-free SPLADE。
Otherwise, use SPLADE.
否则，请使用 SPLADE。

Dataset 数据

The SparseEncoderTrainer uses datasets.Dataset or datasets.DatasetDict instances for training and evaluation. You can load data from the Hugging Face Datasets Hub or use local data in various formats such as CSV, JSON, Parquet, Arrow, or SQL.
SparseEncoderTrainer 使用数据集。一个或多个数据集 。DatasetDict 实例进行训练和评估。您可以从 Hugging Face Datasets Hub 加载数据，也可以使用各种格式（如 CSV、JSON、Parquet、Arrow 或 SQL）的本地数据。

Note: Lots of public datasets that work out of the box with Sentence Transformers have been tagged with sentence-transformers on the Hugging Face Hub, so you can easily find them on https://huggingface.co/datasets?other=sentence-transformers. Consider browsing through these to find ready-to-go datasets that might be useful for your tasks, domains, or languages.
注意： 许多与 Sentence Transformers 一起使用的开箱即用的公共数据集已在 Hugging Face Hub 上用 sentence-transformers 标记，因此您可以在 https://huggingface.co/datasets?other=sentence-transformers 上轻松找到它们。考虑浏览这些内容以查找可能对您的任务、域或语言有用的现成数据集。

Data on the Hugging Face Hub
Hugging Face Hub 上的数据

You can use the load_dataset function to load data from datasets in the Hugging Face Hub
您可以使用 load_dataset 函数从 Hugging Face Hub 中的数据集加载数据

from datasets import load_dataset

train_dataset = load_dataset("sentence-transformers/natural-questions", split="train")

print(train_dataset)
"""
Dataset({
    features: ['query', 'answer'],
    num_rows: 100231
})
"""

Some datasets, like nthakur/swim-ir-monolingual, have multiple subsets with different data formats. You need to specify the subset name along with the dataset name, e.g. dataset = load_dataset("nthakur/swim-ir-monolingual", "de", split="train").
一些数据集（如 nthakur/swim-ir-monolingual）具有具有不同数据格式的多个子集。您需要指定子集名称以及数据集名称，例如 dataset = load_dataset("nthakur/swim-ir-monolingual", "de", split="train") 。

Local Data (CSV, JSON, Parquet, Arrow, SQL)
本地数据（CSV、JSON、Parquet、Arrow、SQL）

You can also use load_dataset for loading local data in certain file formats:
您还可以使用 load_dataset 以某些文件格式加载本地数据：

from datasets import load_dataset

dataset = load_dataset("csv", data_files="my_file.csv")
# or
dataset = load_dataset("json", data_files="my_file.json")

Local Data that requires pre-processing
需要预处理的本地数据

You can use datasets.Dataset.from_dict if your local data requires pre-processing. This allows you to initialize your dataset with a dictionary of lists:
您可以使用 数据集。Dataset.from_dict 您的本地数据是否需要预处理。这允许您使用列表字典初始化数据集：

from datasets import Dataset

queries = []
documents = []
# Open a file, perform preprocessing, filtering, cleaning, etc.
# and append to the lists

dataset = Dataset.from_dict({
    "query": queries,
    "document": documents,
})

Each key in the dictionary becomes a column in the resulting dataset.
字典中的每个键都将成为结果数据集中的一列。

Dataset Format 数据集格式

It's crucial to ensure that your dataset format matches your chosen loss function. This involves checking two things:
确保您的数据集格式与您选择的损失函数匹配至关重要。这涉及检查两件事：

If your loss function requires a Label (as indicated in the Loss Overview table), your dataset must have a column named "label" or "score".
如果您的损失函数需要 Label （如 Loss Overview 表中所示），则您的数据集必须具有名为 “label” 或 “score” 的列。
All columns other than "label" or "score" are considered Inputs (as indicated in the Loss Overview table). The number of these columns must match the number of valid inputs for your chosen loss function. The names of the columns don't matter, only their order matters.
除 “label” 或 “score” 之外的所有列都被视为 Input（如损失概览表所示）。这些列的数量必须与所选损失函数的有效输入数量相匹配。列的名称无关紧要， 只有它们的顺序很重要 。

For example, if your loss function accepts (anchor, positive, negative) triplets, then your first, second, and third dataset columns correspond with anchor, positive, and negative, respectively. This means that your first and second column must contain texts that should embed closely, and that your first and third column must contain texts that should embed far apart. That is why depending on your loss function, your dataset column order matters.
例如，如果您的损失函数接受 (anchor, positive, negative) triplets ，则您的第一个、第二个和第三个数据集列分别对应于 anchor、 positive 和 negative。这意味着第一列和第二列必须包含应紧密嵌入的文本，而第一列和第三列必须包含应相距较远的文本。这就是为什么根据您的损失函数，数据集列顺序很重要。

Consider a dataset with columns ["text1", "text2", "label"], where the "label" column contains floating point similarity scores. This dataset can be used with SparseCoSENTLoss, SparseAnglELoss, and SparseCosineSimilarityLoss because:
考虑一个包含列 [“text1”， “text2”， “label”] 的数据集，其中 “label” 列包含浮点相似度分数。该数据集可以与 SparseCoSENTLoss、 SparseAnglELoss 和 SparseCosineSimilarityLoss 一起使用，因为：

The dataset has a "label" column, which is required by these loss functions.
数据集有一个 “label” 列，这是这些损失函数所必需的。
The dataset has 2 non-label columns, matching the number of inputs required by these loss functions.
数据集有 2 个非标签列，与这些损失函数所需的输入数量相匹配。

If the columns in your dataset are not ordered correctly, use Dataset.select_columns to reorder them. Additionally, remove any extraneous columns (e.g., sample_id, metadata, source, type) using Dataset.remove_columns, as they will be treated as inputs otherwise.
如果数据集中的列排序不正确，请使用 Dataset.select_columns 对它们重新排序。此外，使用 Dataset.remove_columns 删除任何无关的列（例如，sample_id、 元数据 、 源 、 类型 ），否则它们将被视为输入。

Loss Function 损失函数

Loss functions measure how well a model performs on a given batch of data and guide the optimization process. The choice of loss function depends on your available data and target task. Refer to the Loss Overview for a comprehensive list of options.
损失函数衡量模型对给定批次数据的性能并指导优化过程。损失函数的选择取决于您的可用数据和目标任务。请参阅损失概述有关选项的完整列表。

To train a SparseEncoder, you either need a SpladeLoss or CSRLoss, depending on the architecture. These are wrapper losses that add sparsity regularization on top of a main loss function, which must be provided as a parameter. The only loss that can be used independently is SparseMSELoss, as it performs embedding-level distillation, ensuring sparsity by directly copying the teacher's sparse embedding.
要训练 SparseEncoder，你需要 SpladeLoss 或 CSRLoss，具体取决于架构。这些是包装器损失，它在主损失函数之上添加了稀疏正则则化，该函数必须作为参数提供。唯一可以独立使用的损失是 SparseMSELoss，因为它执行 embedding 级别的蒸馏，通过直接复制老师的稀疏 embedding 来保证稀疏性。

Most loss functions can be initialized with just the SparseEncoder that you're training, alongside some optional parameters, e.g.:
大多数损失函数可以只使用您正在训练的 SparseEncoder 以及一些可选参数进行初始化，例如：

from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss

# Load a model to train/finetune
model = SparseEncoder("distilbert/distilbert-base-uncased")

# Initialize the SpladeLoss with a SparseMultipleNegativesRankingLoss
# This loss requires pairs of related texts or triplets
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=5e-5,  # Weight for query loss
    document_regularizer_weight=3e-5,
) 

# Load an example training dataset that works with our loss function:
train_dataset = load_dataset("sentence-transformers/natural-questions", split="train")
print(train_dataset)
"""
Dataset({
    features: ['query', 'answer'],
    num_rows: 100231
})
"""

Documentation 文档

Training Arguments 训练参数

The SparseEncoderTrainingArguments class allows you to specify parameters that influence training performance and tracking/debugging. While optional, experimenting with these arguments can help improve training efficiency and provide insights into the training process.
SparseEncoderTrainingArguments 类允许您指定影响训练性能和跟踪/调试的参数。虽然这些参数是可选的，但尝试这些参数有助于提高训练效率并提供对训练过程的见解。

In the Sentence Transformers documentation, I've outlined some of the most useful training arguments. I would recommend reading it in Training Overview > Training Arguments.
在 Sentence Transformers 文档中，我概述了一些最有用的训练参数。我建议阅读 Training Overview > Training Arguments 中的它。

Here's an example of how to initialize SparseEncoderTrainingArguments:
下面是一个如何初始化 SparseEncoderTrainingArguments 的示例：

from sentence_transformers import SparseEncoderTrainingArguments

args = SparseEncoderTrainingArguments(
    # Required parameter:
    output_dir="models/splade-distilbert-base-uncased-nq",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,  # Set to False if your GPU can't handle FP16
    bf16=False,  # Set to True if your GPU supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # Losses using "in-batch negatives" benefit from no duplicates
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=100,
    run_name="splade-distilbert-base-uncased-nq",  # Used in W&B if `wandb` is installed
)

Note that eval_strategy was introduced in transformers version 4.41.0. Prior versions should use evaluation_strategy instead.
请注意， eval_strategy 是在 transformers 版本 4.41.0 中引入的。以前的版本应改用 evaluation_strategy 。

Evaluator 计算器

You can provide the SparseEncoderTrainer with an eval_dataset to get the evaluation loss during training, but it may be useful to get more concrete metrics during training, too. For this, you can use evaluators to assess the model's performance with useful metrics before, during, or after training. You can use both an eval_dataset and an evaluator, one or the other, or neither. They evaluate based on the eval_strategy and eval_steps Training Arguments.
你可以为 SparseEncoderTrainer 提供 一个 eval_dataset 来获取训练期间的评估损失，但在训练期间获取更具体的指标也可能很有用。为此，您可以使用评估器在训练之前、期间或之后使用有用的指标来评估模型的性能。您可以同时使用 eval_dataset 和 EVLUATOR，一个或另一个，或者两者都不使用。它们根据 eval_strategy 和 eval_steps 训练参数进行评估。

Here are the implemented Evaluators that come with Sentence Transformers for Sparse Encoder models:
以下是 Sparse Encoder 模型的 Sentence Transformers 附带的已实现的 Evaluator：

Evaluator	Required Data
`SparseBinaryClassificationEvaluator`	Pairs with class labels. 与类标签配对。
`SparseEmbeddingSimilarityEvaluator`	Pairs with similarity scores. 具有相似性分数的对。
`SparseInformationRetrievalEvaluator`	Queries (qid => question), Corpus (cid => document), and relevant documents (qid => set[cid]). 查询（qid => 个问题）、语料库（cid => 文档）和相关文档（qid => set[cid]）。
`SparseNanoBEIREvaluator`	No data required. 不需要数据。
`SparseMSEEvaluator`	Source sentences to embed with a teacher model and target sentences to embed with the student model. Can be the same texts. 要嵌入到教师模型中的源句子和要嵌入到学生模型中的目标句子。可以是相同的文本。
`SparseRerankingEvaluator`	List of `{'query': '...', 'positive': [...], 'negative': [...]}` dictionaries. `{'query': '...', 'positive': [...], 'negative': [...]}` 词典列表。
`SparseTranslationEvaluator`	Pairs of sentences in two separate languages. 两种不同语言的成对句子。
`SparseTripletEvaluator`	(anchor, positive, negative) pairs. （锚点、正、负）对。

Additionally, SequentialEvaluator should be used to combine multiple evaluators into one Evaluator that can be passed to the SparseEncoderTrainer.
此外， SequentialEvaluator 应用于将多个计算器合并为一个计算器，该计算器可以传递给 SparseEncoderTrainer。

Sometimes you don't have the required evaluation data to prepare one of these evaluators on your own, but you still want to track how well the model performs on some common benchmarks. In that case, you can use these evaluators with data from Hugging Face.
有时，您没有所需的评估数据来自行准备其中一个评估器，但您仍然希望跟踪模型在某些常见基准上的执行情况。在这种情况下，您可以将这些赋值器与 Hugging Face 中的数据一起使用。

SparseNanoBEIREvaluator

Documentation 文档

sentence_transformers.sparse_encoder.evaluation.SparseNanoBEIREvaluator

from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator

# Initialize the evaluator. Unlike most other evaluators, this one loads the relevant datasets
# directly from Hugging Face, so there's no mandatory arguments
dev_evaluator = SparseNanoBEIREvaluator()
# You can run evaluation like so:
# results = dev_evaluator(model)

SparseEmbeddingSimilarityEvaluator with STSb
使用 STSb 的 SparseEmbeddingSimilarityEvaluator

Documentation 文档

from datasets import load_dataset
from sentence_transformers.evaluation import SimilarityFunction
from sentence_transformers.sparse_encoder.evaluation import SparseEmbeddingSimilarityEvaluator

# Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb)
eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")

# Initialize the evaluator
dev_evaluator = SparseEmbeddingSimilarityEvaluator(
    sentences1=eval_dataset["sentence1"],
    sentences2=eval_dataset["sentence2"],
    scores=eval_dataset["score"],
    main_similarity=SimilarityFunction.COSINE,
    name="sts-dev",
)
# You can run evaluation like so:
# results = dev_evaluator(model)

SparseTripletEvaluator with AllNLI
使用 AllNLI 的 SparseTripletEvaluator

Documentation 文档

from datasets import load_dataset
from sentence_transformers.evaluation import SimilarityFunction
from sentence_transformers.sparse_encoder.evaluation import SparseTripletEvaluator

# Load triplets from the AllNLI dataset (https://huggingface.co/datasets/sentence-transformers/all-nli)
max_samples = 1000
eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split=f"dev[:{max_samples}]")

# Initialize the evaluator
dev_evaluator = SparseTripletEvaluator(
    anchors=eval_dataset["anchor"],
    positives=eval_dataset["positive"],
    negatives=eval_dataset["negative"],
    main_distance_function=SimilarityFunction.DOT,
    name="all-nli-dev",
)
# You can run evaluation like so:
# results = dev_evaluator(model)

When evaluating frequently during training with a small eval_steps, consider using a tiny eval_dataset to minimize evaluation overhead. If you're concerned about the evaluation set size, a 90-1-9 train-eval-test split can provide a balance, reserving a reasonably sized test set for final evaluations. After training, you can assess your model's performance using trainer.evaluate(test_dataset) for test loss or initialize a testing evaluator with test_evaluator(model) for detailed test metrics.
在训练期间使用小 eval_steps 频繁评估时，请考虑使用小 eval_dataset 以最大程度地减少评估开销。如果您担心评估集的大小，90-1-9 train-eval-test 拆分可以提供平衡，为最终评估保留合理大小的测试集。训练后，您可以使用 trainer.evaluate(test_dataset) 评估模型的性能以获取测试损失，或使用 test_evaluator（model） 初始化测试评估器以获取详细的测试指标。

If you evaluate after training, but before saving the model, your automatically generated model card will still include the test results.
如果您在训练后但在保存模型之前进行评估，则自动生成的模型卡仍将包含测试结果。

When using Distributed Training, the evaluator only runs on the first device, unlike the training and evaluation datasets, which are shared across all devices.
使用分布式训练时，评估器仅在第一台设备上运行，这与在所有设备之间共享的训练和评估数据集不同。

Trainer 长途汽车

The SparseEncoderTrainer is where all previous components come together. We only have to specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we can start training. Let’s have a look at a script where all of these components come together:
SparseEncoderTrainer 是所有先前组件汇集在一起的地方。我们只需要使用模型、训练参数（可选）、训练数据集、评估数据集（可选）、损失函数、评估器（可选）指定训练器，我们就可以开始训练了。让我们看一下所有这些组件都汇集在一起的脚本：

import logging

from datasets import load_dataset

from sentence_transformers import (
    SparseEncoder,
    SparseEncoderModelCardData,
    SparseEncoderTrainer,
    SparseEncoderTrainingArguments,
)
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.sparse_encoder.losses import SparseMultipleNegativesRankingLoss, SpladeLoss
from sentence_transformers.sparse_encoder.models import SparseStaticEmbedding, MLMTransformer, SpladePooling
from sentence_transformers.training_args import BatchSamplers

logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)

# 1. Load a model to finetune with 2. (Optional) model card data
mlm_transformer = MLMTransformer("distilbert/distilbert-base-uncased", tokenizer_args={"model_max_length": 512})
splade_pooling = SpladePooling(
    pooling_strategy="max", word_embedding_dimension=mlm_transformer.get_sentence_embedding_dimension()
)
router = Router.for_query_document(
    query_modules=[SparseStaticEmbedding(tokenizer=mlm_transformer.tokenizer, frozen=False)],
    document_modules=[mlm_transformer, splade_pooling],
)

model = SparseEncoder(
    modules=[router],
    model_card_data=SparseEncoderModelCardData(
        language="en",
        license="apache-2.0",
        model_name="Inference-free SPLADE distilbert-base-uncased trained on Natural-Questions tuples",
    ),
)

# 3. Load a dataset to finetune on
full_dataset = load_dataset("sentence-transformers/natural-questions", split="train").select(range(100_000))
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]
print(train_dataset)
print(train_dataset[0])

# 4. Define a loss function
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=0,
    document_regularizer_weight=3e-3,
)

# 5. (Optional) Specify training arguments
run_name = "inference-free-splade-distilbert-base-uncased-nq"
args = SparseEncoderTrainingArguments(
    # Required parameter:
    output_dir=f"models/{run_name}",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    learning_rate_mapping={r"SparseStaticEmbedding\.weight": 1e-3},  # Set a higher learning rate for the SparseStaticEmbedding module
    warmup_ratio=0.1,
    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=False,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    router_mapping={"query": "query", "answer": "document"},  # Map the column names to the routes
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=1000,
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=2,
    logging_steps=200,
    run_name=run_name,  # Will be used in W&B if `wandb` is installed
)

# 6. (Optional) Create an evaluator & evaluate the base model
dev_evaluator = SparseNanoBEIREvaluator(dataset_names=["msmarco", "nfcorpus", "nq"], batch_size=16)

# 7. Create a trainer & train
trainer = SparseEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=dev_evaluator,
)
trainer.train()

# 8. Evaluate the model performance again after training
dev_evaluator(model)

# 9. Save the trained model
model.save_pretrained(f"models/{run_name}/final")

# 10. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(run_name)

In this example I'm finetuning from distilbert/distilbert-base-uncased, a base model that is not yet a Sparse Encoder model. This requires more training data than finetuning an existing Sparse Encoder model, like naver/splade-cocondenser-ensembledistil.
在这个例子中，我从 distilbert/distilbert-base-uncased 进行了微调，这是一个还不是 Sparse Encoder 模型的基本模型。这需要比微调现有的 Sparse Encoder 模型（如 naver/splade-cocondenser-ensembledistil ）更多的训练数据。

After running this script, the sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq model was uploaded for me. The model scores 0.5241 NDCG@10 on NanoMSMARCO, 0.3299 NDCG@10 on NanoNFCorpus and 0.5357 NDCG@10 NanoNQ, which is a good result for an inference-free distilbert-based model trained on just 100k pairs from the Natural Questions dataset.
运行此脚本后，为我上传了 sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq 模型。该模型在 NanoMSMARCO 上的得分为 0.5241 NDCG@10，在 NanoNFCorpus 上得分为 0.3299 NDCG@10，在 NanoNQ NDCG@10 得分为 0.5357，这对于仅使用 Natural Questions 数据集中的 100k 对训练的基于推理的基于 distilbert 的模型来说是一个很好的结果。

The model uses an average of 184 active dimensions in the sparse embeddings for the documents, compared to 7.7 active dimensions for the queries (i.e. the average number of tokens in the query). This corresponds to a sparsity of 99.39% and 99.97%, respectively.
该模型在文档的稀疏嵌入中平均使用 184 个活动维度，而查询使用 7.7 个活动维度（即查询中的平均标记数）。这分别对应于 99.39% 和 99.97% 的稀疏度。

All of this information is stored in the automatically generated model card, including the base model, language, license, evaluation results, training & evaluation dataset info, hyperparameters, training logs, and more. Without any effort, your uploaded models should contain all the information that your potential users would need to determine whether your model is suitable for them.
所有这些信息都存储在自动生成的模型卡中，包括基础模型、语言、许可证、评估结果、训练和评估数据集信息、超参数、训练日志等等。无需任何努力，您上传的模型应该包含潜在用户确定您的模型是否适合他们所需的所有信息。

Callbacks 回调

The Sentence Transformers trainer supports various transformers.TrainerCallback subclasses, including:
Sentence Transformers trainer 支持各种 Transformer。TrainerCallback 子类，包括：

WandbCallback for logging training metrics to W&B if wandb is installed
WandbCallback ，用于将训练指标记录到 W&B（如果安装了 wandb ）
TensorBoardCallback for logging training metrics to TensorBoard if tensorboard is accessible
TensorBoardCallback ，用于在 tensorboard 可访问时将训练指标记录到 TensorBoard
CodeCarbonCallback for tracking carbon emissions during training if codecarbon is installed
CodeCarbonCallback ，用于在训练期间跟踪碳排放（如果安装了 codecarbon ）

These are automatically used without you having to specify anything, as long as the required dependency is installed.
只要安装了所需的依赖项，这些就会自动使用，而无需您指定任何内容。

Refer to the Transformers Callbacks documentation for more information on these callbacks and how to create your own.
请参阅 Transformers Callbacks 文档，了解有关这些回调以及如何创建自己的回调的更多信息。

Multi-Dataset Training 多数据集训练

Top-performing models are often trained using multiple datasets simultaneously. The SparseEncoderTrainer simplifies this process by allowing you to train with multiple datasets without converting them to the same format. You can even apply different loss functions to each dataset. Here are the steps for multi-dataset training:
性能最佳的模型通常同时使用多个数据集进行训练。 SparseEncoderTrainer 允许您使用多个数据集进行训练，而无需将它们转换为相同的格式，从而简化了此过程。您甚至可以对每个数据集应用不同的损失函数。以下是多数据集训练的步骤：

Use a dictionary of datasets.Dataset instances (or a datasets.DatasetDict) as the train_dataset and eval_dataset.
使用 数据集字典。数据集 实例（或 a.DatasetDict）作为 train_dataset 和 eval_dataset。
(Optional) Use a dictionary of loss functions mapping dataset names to losses if you want to use different losses for different datasets.
（可选）如果要对不同的数据集使用不同的损失，请使用将数据集名称映射到损失的损失函数字典。

Each training/evaluation batch will contain samples from only one of the datasets. The order in which batches are sampled from the multiple datasets is determined by the MultiDatasetBatchSamplers enum, which can be passed to the SparseEncoderTrainingArguments via multi_dataset_batch_sampler. The valid options are:
每个训练/评估批次将仅包含来自其中一个数据集的样本。从多个数据集中采样批次的顺序由 MultiDatasetBatchSamplers 枚举确定，该枚举可以通过 multi_dataset_batch_sampler 传递给 SparseEncoderTrainingArguments 。有效选项包括：

MultiDatasetBatchSamplers.ROUND_ROBIN: Samples from each dataset in a round-robin fashion until one is exhausted. This strategy may not use all samples from each dataset, but it ensures equal sampling from each dataset.
MultiDatasetBatchSamplers.ROUND_ROBIN ：以循环方式从每个数据集中采样，直到一个数据集用完。此策略可能不会使用每个数据集中的所有样本，但它可确保每个数据集的样本相等。
MultiDatasetBatchSamplers.PROPORTIONAL (default): Samples from each dataset proportionally to its size. This strategy ensures that all samples from each dataset are used, and larger datasets are sampled from more frequently.
MultiDatasetBatchSamplers.PROPORTIONAL （默认）：每个数据集的样本与其大小成比例。此策略可确保使用每个数据集中的所有样本，并且更频繁地从较大的数据集中采样。

Evaluation 评估

Let's evaluate our newly trained inference-free SPLADE model using the NanoMSMARCO dataset, and see how it compares to dense retrieval approaches. We'll also explore hybrid retrieval methods that combine sparse and dense vectors, as well as reranking to further improve search quality.
让我们使用 NanoMSMARCO 数据集评估我们新训练的无推理 SPLADE 模型，看看它与密集检索方法的比较。我们还将探索结合稀疏向量和密集向量的混合检索方法，以及重新排名以进一步提高搜索质量。

After running a slightly modified version of our hybrid_search.py script, we get the following results for the NanoMSMARCO dataset, using these models:
在运行略微修改的 hybrid_search.py 脚本版本后，我们使用这些模型获得了 NanoMSMARCO 数据集的以下结果：

Sparse: sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq (the model we just trained)
稀疏： sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq （我们刚刚训练的模型）
Dense: sentence-transformers/all-MiniLM-L6-v2
密集度： sentence-transformers/all-MiniLM-L6-v2
Reranker: cross-encoder/ms-marco-MiniLM-L6-v2
重新排名器 ： cross-encoder/ms-marco-MiniLM-L6-v2

Sparse	Dense	Reranker	NDCG@10	MRR@10	MAP
x			52.41	43.06	44.20
	x		55.40	47.96	49.08
x	x		62.22	53.02	53.44
x		x	66.31	59.45	60.36
	x	x	66.28	59.43	60.34
x	x	x	66.28	59.43	60.34

The Sparse and Dense rankings can be combined using Reciprocal Rank Fusion (RRF), which is a simple way to combine the results of multiple rankings. If a Reranker is applied, it will rerank the results of the prior retrieval step.
稀疏和密集排名可以使用 Reciprocal Rank Fusion （RRF）进行组合，这是一种组合多个排名结果的简单方法。如果应用了 Reranker，它将对先前检索步骤的结果进行重新排序。

The results indicate that for this dataset, combining Dense and Sparse rankings is very performant, resulting in 12.3% and 18.7% increases over the Dense and Sparse baselines, respectively. In short, combining Sparse and Dense retrieval methods is a very effective way to improve search performance.
结果表明，对于此数据集，将 Dense 和 Sparse 排名相结合的性能非常高，与 Dense 和 Sparse 基线相比，分别增加了 12.3% 和 18.7%。简而言之，结合 Sparse 和 Dense 检索方法是提高搜索性能的一种非常有效的方法。

Furthermore, applying a reranker on any of the rankings improved the performance to approximately 66.3 NDCG@10, showing that either Sparse, Dense, or Hybrid (Dense + Sparse) found the relevant documents in their top 100, which the reranker then ranked to the top 10. So, replacing a Dense -> Reranker pipeline with a Sparse -> Reranker pipeline might improve both latency and costs:
此外，对任何排名应用重新排名程序都将性能提高到大约 66.3 NDCG@10，这表明 Sparse、Density 或 Hybrid （Dense + Sparse）在其前 100 名中找到了相关文档，然后重新排名程序将其排名到前 10 名。因此，将 Dense -> Reranker 管道替换为 Sparse -> Reranker 管道可能会降低延迟和成本：

Sparse embeddings can be cheaper to store, e.g. our model only uses ~180 active dimensions for MS MARCO documents instead of the common 1024 dimensions for dense models.
稀疏嵌入的存储成本更低，例如，我们的模型对 MS MARCO 文档仅使用 ~180 个活动维度，而不是对密集模型使用常见的 1024 个维度。
Some Sparse Encoders allow for inference-free query processing, allowing for a near-instant first-stage retrieval, akin to lexical solutions like BM25.
一些稀疏编码器允许无推理的查询处理，允许近乎即时的第一阶段检索，类似于 BM25 等词法解决方案。

Training Tips 训练技巧

Sparse Encoder models have a few quirks that you should be aware of when training them:
Sparse Encoder 模型有一些怪癖，在训练它们时应该注意：

Sparse Encoder models should not be evaluated solely using the evaluation scores, but also with the sparsity of the embeddings. After all, a low sparsity means that the model embeddings are expensive to store and slow to retrieve.
Sparse Encoder 模型不应仅使用评估分数进行评估，还应使用嵌入的稀疏性进行评估。毕竟，低稀疏性意味着模型嵌入的存储成本很高，而且检索速度很慢。
The stronger Sparse Encoder models are trained almost exclusively with distillation from a stronger teacher model (e.g. a CrossEncoder model), instead of training directly from text pairs or triplets. See for example the SPLADE-v3 paper, which uses SparseDistillKLDivLoss and SparseMarginMSELoss for distillation. We don't cover this in detail in this blog as it requires more data preparation, but a distillation setup should be seriously considered.
更强的 Sparse Encoder 模型几乎完全是通过从更强的教师模型（例如 CrossEncoder 模型）中提炼出来训练的，而不是直接从文本对或三元组进行训练。例如，请参阅 SPLADE-v3 论文，该论文使用 SparseDistillKLDivLoss 和 SparseMarginMSELoss 进行蒸馏。我们在本博客中没有详细介绍这一点，因为它需要更多的数据准备，但应该认真考虑蒸馏设置。

Vector Database Integration
矢量数据库集成

After training sparse embedding models, the next crucial step is deploying them effectively in production environments. Vector databases provide the essential infrastructure for storing, indexing, and retrieving sparse embeddings at scale. Popular options include Qdrant, OpenSearch, Elasticsearch, and Seismic, among others.
在训练稀疏嵌入模型之后，下一个关键步骤是将它们有效地部署到生产环境中。矢量数据库为大规模存储、索引和检索稀疏嵌入提供了必要的基础设施。流行的选项包括 Qdrant、OpenSearch、Elasticsearch 和 Seismic 等。

For comprehensive examples covering vector databases mentioned above, refer to the semantic search with vector database documentation or below for the Qdrant example.
有关涵盖上述向量数据库的综合示例，请参阅使用向量数据库进行语义搜索文档，或参阅下面的 Qdrant 示例。

Qdrant Integration Example
Qdrant 集成示例

Qdrant offers excellent support for sparse vectors with efficient storage and fast retrieval capabilities. Below is a comprehensive implementation example:
Qdrant 通过高效存储和快速检索功能为稀疏向量提供出色的支持。下面是一个全面的实现示例：

Prerequisites: 先决条件：

Qdrant running locally (or accessible), see the Qdrant Quickstart for more details.
Qdrant 在本地（或可访问）运行，请参阅 Qdrant 快速入门了解更多详细信息。
Python Qdrant Client installed:
已安装 Python Qdrant 客户端：
```
pip install qdrant-client
```

This example demonstrates how to set up Qdrant for sparse vector search by showing how to efficiently encode and index documents with sparse encoders, formulating search queries with sparse vectors, and providing an interactive query interface. See below:
此示例通过展示如何使用稀疏编码器有效地对文档进行编码和索引、使用稀疏向量构建搜索查询以及提供交互式查询界面，演示如何为稀疏向量搜索设置 Qdrant。请参阅下文：

import time

from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_qdrant

# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]

# 2. Come up with some queries
queries = dataset["query"][:2]

# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# 4. Encode the corpus
corpus_embeddings = sparse_model.encode_document(
    corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)

# Initially, we don't have a qdrant index yet
corpus_index = None
while True:
    # 5. Encode the queries using the full precision
    start_time = time.time()
    query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
    print(f"Encoding time: {time.time() - start_time:.6f} seconds")

    # 6. Perform semantic search using qdrant
    results, search_time, corpus_index = semantic_search_qdrant(
        query_embeddings,
        corpus_index=corpus_index,
        corpus_embeddings=corpus_embeddings if corpus_index is None else None,
        top_k=5,
        output_index=True,
    )

    # 7. Output the results
    print(f"Search time: {search_time:.6f} seconds")
    for query, result in zip(queries, results):
        print(f"Query: {query}")
        for entry in result:
            print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
        print("")

    # 8. Prompt for more queries
    queries = [input("Please enter a question: ")]

Additional Resources 其他资源

Training Examples 培训示例

The following pages contain training examples with explanations as well as links to code. We recommend that you browse through these to familiarize yourself with the training loop:
以下页面包含带有说明的训练示例以及代码链接。我们建议您浏览这些内容以熟悉训练循环：

Model Distillation - Examples to make models smaller, faster and lighter.
模型蒸馏 - 使模型更小、更快、更轻的示例。
MS MARCO - Example training scripts for training on the MS MARCO information retrieval dataset.
MS MARCO - 用于在 MS MARCO 信息检索数据集上进行训练的示例训练脚本。
Retrievers - Example training scripts for training on generic information retrieval datasets.
检索器 - 用于在通用信息检索数据集上进行训练的示例训练脚本。
Natural Language Inference - Natural Language Inference (NLI) data can be quite helpful to pre-train and fine-tune models to create meaningful sparse embeddings.
自然语言推理 - 自然语言推理（NLI）数据对于预训练和微调模型以创建有意义的稀疏嵌入非常有帮助。
Quora Duplicate Questions - Quora Duplicate Questions is a large set corpus with duplicate questions from the Quora community. The folder contains examples how to train models for duplicate questions mining and for semantic search.
Quora 重复问题 - Quora 重复问题是一个大型语料库，其中包含来自 Quora 社区的重复问题。该文件夹包含如何训练模型以进行重复问题挖掘和语义搜索的示例。
STS - The most basic method to train models is using Semantic Textual Similarity (STS) data. Here, we use sentence pairs and a score indicating the semantic similarity.
STS - 训练模型的最基本方法是使用语义文本相似性（STS）数据。在这里，我们使用句子对和表示语义相似性的分数。

Documentation 文档

Additionally, the following pages may be useful to learn more about Sentence Transformers:
此外，以下页面可能有助于了解有关 Sentence Transformers 的更多信息：

Installation 安装
Quickstart 快速入门
Usage 用法
Pretrained Models 预训练模型
Training Overview (This blogpost is a distillation of the Training Overview documentation)
培训概述（此博客文章是培训概述文档的精华）
Dataset Overview 数据集概述
Loss Overview 损失概述
API Reference API 参考

And lastly, here are some advanced pages that might interest you:
最后，以下是一些您可能感兴趣的高级页面：

More Articles from our Blog
我们博客中的更多文章

Training and Finetuning Reranker Models with Sentence Transformers v4
使用 Sentence Transformers v4 训练和微调 Reranker 模型

By March 26, 2025 3月 26， 2025 • 143

Train 400x faster Static Embedding Models with Sentence Transformers
使用 Sentence Transformers 训练静态嵌入模型的速度提高 400 倍

By January 15, 2025 1月 15， 2025 • 195

Community 社区

pulkitmehtawork

about 1 hour ago

Great work . Best part is interpretability and speed ..
伟大的工作。最好的部分是可解释性和速度.. @tomaarsen - I am planning to fine tune a model for text to code retrieval with below setup .. please guide if this setting seems fine for start or anything I can tune to do better .. Idea is to do decent on text to code and eval on (https://github.com/CoIR-team/coir)
- 我计划使用以下设置微调文本到代码检索的模型 ..请指导这个设置是否适合开始或任何我可以调整以做得更好的..想法是在文本到代码上做体面的事情，并在（https://github.com/CoIR-team/coir）上评估
Training dataset - claudios/code_search_net .. filter on Python code .. query is doc string of code and passage is code ... loss - SparseMultipleNegativesRankingLoss.. not able to think of decent dev evaluation .. shall I use SparseTripletEvaluator .. also , just query and positive passage is fine because I believe negative options will be all other data in that batch or we have to explicitly prepare data ( mine negative data ) .. please guide ..
训练数据集 - claudios/code_search_net ..在 Python 代码上过滤 ..query 是 doc 代码字符串，passage 是 code ...损失 - SparseMultipleNegativesRankingLoss..无法想象像样的开发评估..我应该使用 SparseTripletEvaluator 吗..另外，只有 query 和 positive passage 就可以了，因为我相信 negative 选项将是该批次中的所有其他数据，或者我们必须明确准备数据（挖掘 negative data）..请指导..

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
通过拖动文本输入、粘贴或单击此处来上传图像、音频和视频。

Tap or paste here to upload images

· Sign up or log in to comment
注册或登录以发表评论

Upvote 赞成

Training and Finetuning Sparse Embedding Models with Sentence Transformers v5 使用 Sentence Transformers v5 训练和微调稀疏嵌入模型

Table of Contents 目录

What are Sparse Embedding models? 什么是稀疏嵌入模型？

Query and Document Expansion 查询和文档扩展

Why Use Sparse Embedding Models? 为什么使用稀疏嵌入模型？

Why Finetune? 为什么选择 Finetune？

Training Components 培训组件

Model 型

Splade 板

Inference-free Splade 免推理 Splade

Contrastive Sparse Representation (CSR) 对比稀疏表示 （CSR）

Architecture Picker Guide 架构选取器指南

Dataset 数据

Data on the Hugging Face Hub Hugging Face Hub 上的数据

Local Data (CSV, JSON, Parquet, Arrow, SQL) 本地数据（CSV、JSON、Parquet、Arrow、SQL）

Local Data that requires pre-processing 需要预处理的本地数据

Dataset Format 数据集格式

Loss Function 损失函数

Training Arguments 训练参数

Evaluator 计算器

SparseNanoBEIREvaluator

SparseEmbeddingSimilarityEvaluator with STSb 使用 STSb 的 SparseEmbeddingSimilarityEvaluator

SparseTripletEvaluator with AllNLI 使用 AllNLI 的 SparseTripletEvaluator

Trainer 长途汽车

Callbacks 回调

Multi-Dataset Training 多数据集训练

Evaluation 评估

Training Tips 训练技巧

Vector Database Integration 矢量数据库集成

Qdrant Integration Example Qdrant 集成示例

Prerequisites: 先决条件：

Additional Resources 其他资源

Training Examples 培训示例

Documentation 文档

Training and Finetuning Reranker Models with Sentence Transformers v4使用 Sentence Transformers v4 训练和微调 Reranker 模型

Train 400x faster Static Embedding Models with Sentence Transformers使用 Sentence Transformers 训练静态嵌入模型的速度提高 400 倍

Community 社区

Training and Finetuning Sparse Embedding Models with Sentence Transformers v5
使用 Sentence Transformers v5 训练和微调稀疏嵌入模型

What are Sparse Embedding models?
什么是稀疏嵌入模型？

Query and Document Expansion
查询和文档扩展

Why Use Sparse Embedding Models?
为什么使用稀疏嵌入模型？

Contrastive Sparse Representation (CSR)
对比稀疏表示（CSR）

Architecture Picker Guide
架构选取器指南

Data on the Hugging Face Hub
Hugging Face Hub 上的数据

Local Data (CSV, JSON, Parquet, Arrow, SQL)
本地数据（CSV、JSON、Parquet、Arrow、SQL）

Local Data that requires pre-processing
需要预处理的本地数据

SparseEmbeddingSimilarityEvaluator with STSb
使用 STSb 的 SparseEmbeddingSimilarityEvaluator

SparseTripletEvaluator with AllNLI
使用 AllNLI 的 SparseTripletEvaluator

Vector Database Integration
矢量数据库集成

Qdrant Integration Example
Qdrant 集成示例

Training and Finetuning Reranker Models with Sentence Transformers v4
使用 Sentence Transformers v4 训练和微调 Reranker 模型

Train 400x faster Static Embedding Models with Sentence Transformers
使用 Sentence Transformers 训练静态嵌入模型的速度提高 400 倍