选择嵌入模型

MTEB 排行榜
下载测试数据

大多数人都在使用 OpenAI 的 Ada 002 进行文本嵌入。这是因为 OpenAI 早在其他人之前就建立了一个易于使用的优秀嵌入模型。然而，这是很久以前的事了。从 MTEB 排行榜上看，Ada 远非嵌入文本的最佳选择。

那么，什么是最佳的嵌入模型？这取决于您的数据，无论您是需要优化准确性还是延迟，等等 — 正如我们将在本文中看到的那样。

您可能还想知道嵌入模型究竟是什么？如果您在这里，您很可能已经意识到它是检索增强生成（RAG）中的关键组成部分。

嵌入模型在给定用户查询时识别相关信息。这些模型可以通过查看查询背后的“人类含义”，并将其与更广泛的一组文档、网页、视频或其他信息来源的“含义”进行匹配来实现这一点。

Embedding models translate human-readable text into machine-readable and searchable vectors.

嵌入模型将人类可读文本转换为机器可读且可搜索的向量。

如今，许多专有嵌入模型的性能远远超过 Ada，甚至有一些性能相当的微型开源模型，比如 E5。

在本文中，我们将探讨两种模型 - 开源的 E5 和 Cohere 的 embed v3 模型 - 并看看它们与现有的 Ada 002 相比如何。

本章视频演示

来源：YouTube

MTEB 排行榜

寻找最新性能基准的最受欢迎的地方是由 Hugging Face 主办的 MTEB 排行榜。MTEB 是一个很好的起点，但需要一些谨慎和怀疑 - 结果是自我报告的，不幸的是，许多结果在尝试在真实数据上使用模型时被证明不准确。

许多这些模型（通常是开源模型）似乎已经在 MTEB 基准测试上进行了微调，从而产生了夸大的性能数字。尽管如此，一些开源模型（如 E5）报告的性能是准确的。

在 MTEB 中有许多领域是我们大多数时候可以忽略的。对于我们这些在现实世界中使用这些模型的人来说，最重要的领域是：

得分：我们应该关注的得分是“平均得分”和“检索平均得分”。两者高度相关，因此专注于任一方面都可以。
序列长度告诉我们模型可以消耗和压缩成单个嵌入的令牌数量。一般来说，我们不建议将大于一个段落的内容塞入单个嵌入中 - 因此，支持最多 512 个令牌的模型通常已经足够。
模型大小：模型的大小表示其运行的容易程度。在 MTEB 顶部附近的所有模型都是合理大小的。其中最大的之一是 instructor-xl（需要 4.96GB 的内存），我们可以轻松在消费者硬件上运行。

专注于这些列，我们可以获得选择可能符合我们需求的模型所需的所有信息。考虑到这一点，我们选择了三个模型来在本文中展示 — 两个专有模型，Ada 002 和 embed-english-v3.0 — 以及一个小巧但性能出色的开源模型；e5-base-v2。

下载测试数据

为了进行比较，我们需要一个数据集。我们将使用 HuggingFace 数据集中的预分块 AI ArXiv 数据集。

!pip install -qU datasets==2.14.6

from datasets import load_dataset

data = load_dataset(
    "jamescalam/ai-arxiv-chunked",
    split= "train"
)

该数据集为我们提供了约 42K 个文本块进行嵌入，每个大致相当于一个段落或两个。

先决条件

每个模型的先决条件略有不同。OpenAI 和 Cohere 将两个专有模型存储在 API 后面，因此它们的客户端库非常轻量级，我们只需要安装它们并获取它们各自的 API 密钥。

!pip install -qU \
    cohere==4.34 \
    openai==1.2.2

import os
import cohere
import openai

# initialize cohere
os.environ["COHERE_API_KEY"] = "your cohere api key"
co = cohere.Client()

# openai doesn't need to be initialized, but need to set api key
os.environ["OPENAI_API_KEY"] = "your openai api key"

E5 is a local model, and we need a little more code and installs to run it — including fairly heavy libraries like PyTorch. The model is comparatively lightweight, so we don't need heavy GPU instances or a ton of run memory. However ideally, we do want at least a GPU to run it on for faster performance, but this isn't a strict requirement, and we can manage with CPU only.

!pip install -qU \
    torch==2.1.2 \
    transformers==4.25.0

import torch
from transformers import AutoModel, AutoTokenizer

# use GPU if available, on mac can use MPS
device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "intfloat/e5-base-v2"

# initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

Creating Embeddings

To create our embeddings, we will create an embed function for each model. We'll pass a list of strings into these embed functions and expect to return a list of vector embeddings.

API Embeddings

Naturally, the code for our proprietary embedding models is more straightforward, so let's cover those two first:

# cohere embedding function
def embed(docs: list[str]) -> list[list[float]]:
    doc_embeds = co.embed(
        docs,
        input_type="search_document",
        model="embed-english-v3.0"
    )
    return doc_embeds.embeddings

# openai embedding function
def embed(docs: list[str]) -> list[list[float]]:
    res = openai.embeddings.create(
        input=docs,
        model="text-embedding-ada-002"
    )
    doc_embeds = [r.embedding for r in res.data]
    return doc_embeds

With Cohere and OpenAI, we're making a simple API call. There's little to note here other than the input_type parameter for the Cohere API. The input_type defines whether the current inputs are document vectors or query vectors. We define this to support improved performance for asymmetric semantic search — where we are querying with a smaller chunk of text (i.e., a search query) and attempting to retrieve larger chunks (i.e., a couple of sentences or paragraphs).

Local Embeddings

E5 works similarly to the Cohere embedding model with support for asymmetric search. However, the implementation is slightly different. Rather than specifying whether an input is a query or document via a parameter, we prefix that information to the input text. For query, we prefix "query:" , and for documents, we prefix "passage:" (another name for documents).

def embed(docs: list[str]) -> list[list[float]]:
    docs = [f"passage: {d}" for d in docs]
    # tokenize
    tokens = tokenizer(
        docs, padding=True, max_length=512, truncation=True, return_tensors="pt"
    ).to(device)
    with torch.no_grad():
        # process with model for token-level embeddings
        out = model(**tokens)
        # mask padding tokens
        last_hidden = out.last_hidden_state.masked_fill(
            ~tokens["attention_mask"][..., None].bool(), 0.0
        )
        # create mean pooled embeddings
        doc_embeds = last_hidden.sum(dim=1) / \
            tokens["attention_mask"].sum(dim=1)[..., None]
    return doc_embeds.cpu().numpy()

After specifying that these chunks are documents, we tokenize them to give us the tokens parameter. Every transformer-based model requires a tokenization step. Tokenization is where we translate human-readable plain text into transformer-readable inputs, which is simply a list of integers like [0, 531, 81, 944, ...], where each integer represents a word or sub-word.

Once we have our tokens we feed them into our model with model(**tokens). From this, we get our output logits (i.e., predictions) in the out parameter.

Some of our input tokens are padding tokens. These are used as placeholders to align the dimensions of the arrays/tensors that we feed through each model layer. By default, we ignore the output logits produced by these tokens, but in some cases (e.g., with embedding models), we must calculate an average value across all output logits. If we were to consider the output logits produced by the padding tokens in this calculation, we would degrade embedding quality.

To avoid degradation of embedding quality, we must mask (i.e., hide by setting to None) the output logits produced by padding tokens. That is what the out.last_hidden_state.masked_fill line is doing.

Finally, we're ready to calculate our single vector embedding — which we do so by mean pooling. Mean pooling means taking the average values from all of our output logit vectors to produce a single vector, which we store in the doc_embeds parameter.

From there, we return our doc_embeds after moving it from GPU to CPU (if we used a GPU) with .cpu() and transforming the PyTorch tensor of doc_embeds into a Numpy array with .numpy().

Building a Vector Index

We can create our vector index with the same logic once we have defined our chosen embed function. We define a batch_size and iterate through our dataset to create the embeddings and add them to a local vector index called arr.

from tqdm.auto import tqdm
import numpy as np

chunks = data["chunk"]
batch_size = 256

for i in tqdm(range(0, len(chunks), batch_size)):
    i_end = min(len(chunks), i+batch_size)
    chunk_batch = chunks[i:i_end]
    # embed current batch
    embed_batch = embed(chunk_batch)
    # add to existing np array if exists (otherwise create)
    if i == 0:
        arr = embed_batch.copy()
    else:
        arr = np.concatenate([arr, embed_batch.copy()])

Here, we can measure two metrics — embedding latency and vector dimensionality. Running all of these on Google Colab, we see the time taken to index the entire dataset for each model is:

Model	Batch size	Time taken	Vector dim
embed-english-v3.0	128	05:32	1024
text-embedding-ada-002	128	09:07	1536
intfloat/e5-base-v2	256	03:53	768

Ada 002 is the slowest method here. E5 was the fastest, _but_ was run on a V100 GPU instance in Google Colab — the API models don't require us to run our code on GPU instances. Another consideration is storage requirements. Higher dimensional vectors cost more to store, and these costs can build up over time.

Performance

When testing these models, we will see relatively similar results. We're using a messy dataset, which is more challenging _but_ also more realistic.

Q1: Why should I use Llama 2?

Note: we have paraphrased the results below for brevity. See the original notebooks for full results and code [Ada 002, Cohere embed v3, E5 base v2].

Ada 002	Embed v3	E5 base v2
✅ "Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks."	⚪️ "The focus of this work is to train a series of language models that achieve optimal performance at various sizes. The resulting models, called LLaMA... LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10x smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU."	❌ "rflowanswerability classificationtask957e2e datato _texttask288_gigaword_title generationtask1728web _nlg_data_to_texttask1358 xlsum title generationtask1529_ scitailv1.1_textual..."
✅ "We develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. They are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed source models."	✅ "We develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. They are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed source models."	✅ "We develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. They are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed source models."
✅ "These closed product LLMs are heavily fine-tuned to align with human preferences, which greatly enhances their usability and safety. We develop and release Llama 2, a family of pretrained and fine-tuned LLMs, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2 models generally perform better than existing open-source models. They also appear to be on par with some of the closed-source models."	✅ "These closed product LLMs are heavily fine-tuned to align with human preferences, which greatly enhances their usability and safety. We develop and release Llama 2, a family of pretrained and fine-tuned LLMs, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2 models generally perform better than existing open-source models. They also appear to be on par with some of the closed-source models."	✅ "These closed product LLMs are heavily fine-tuned to align with human preferences, which greatly enhances their usability and safety. We develop and release Llama 2, a family of pretrained and fine-tuned LLMs, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2 models generally perform better than existing open-source models. They also appear to be on par with some of the closed-source models."

For the first set of results, we get slightly better results from Ada. For that first result, Cohere's model returns a result about the original Llama model (rather than Llama 2), and the E5 model returns some strange malformed text. However, all of the models return the same relevant results in positions two and three.

Q2: Can you tell me about red teaming for llama 2?

Ada 002	Embed v3	E5 base v2
⚪️ "Visualization of the red team attacks. Each point corresponds to a red team attack embedded in a two dimensional space using UMAP. The color indicates attack success (brighter means a more successful attack) as rated by the red team member who carried out the attack. We manually annotated attacks and found several clusters of attack types."	⚪️ "... aiding in disinformation campaigns, generating extremist texts, spreading falsehoods, and more. As AI systems improve, the scope of possible harms seems likely to grow. One potentially useful tool for adressing harm is red teaming. We describe our early efforts to implement red teaming to make our models safer."	⚪️ "We created this dataset to analyze and address potential harms in LLMs through red teaming. This dataset adds to a limited number of publicly-available red team datasets, and is the only dataset of red team attacks on a language model trained with RLHF as a safety technique."
⚪️ "Red teaming ChatGPT via Jailbreaking: Observations indicate that LLMs may exhibit social prejudice and toxicity, posing ethical and societal dangers. We perform a qualitative research method called red teaming on OpenAI's ChatGPT."	⚪️ "A red team exercise is an effort to find flaws and vulnerabilities, often performed by dedicated red teams that seek to adopy an attacker's mindset. In security, red teams are routinely tasked with emulating attackers."	⚪️ "We conducted interviews with Trust & Safety experts and incorporated their suggested best practices into our experiments to ensure the well-being of the red team. Red team members enjoyed participating in our experiments and felt motivated to make AI systems less harmful.
⚪️ "In the red team task instructions, we provide clear warnings that red team members may be exposed to sensitive content. Through surveys and feedback, we found that red team members enjoyed the task and did not experience significant negative emotions."	⚪️ "Red teaming ChatGPT via Jailbreaking: Observations indicate that LLMs may exhibit social prejudice and toxicity, posing ethical and societal dangers. We perform a qualitative research method called red teaming on OpenAI's ChatGPT."	❌ "果，并解释我所持立场的原因。因此，我致力于提供积极、有趣、实用和吸引人的回答。我的逻辑和推理力求严密、智能和有理有据。另外，我可以提供更多相关细节来 Seed Prompts for Topic-Guided Red-Teaming Self-Instruct"

The red teaming question returned the worst results across all models. Despite information about red teaming Llama 2 existing in the dataset, none of that information was returned. All models did return information about generic red teaming, with Cohere's model returning the most informative results (in the author's opinion). E5 returned what seems to be the poorest result due to the lack of English text — however, it does seem to be related to red teaming, just in the wrong language.

Q3: What is the difference between gpt-4 and llama?

Ada 002	Embed v3	E5 base v2
✅ "31.39%LLaMA-GPT4 25.99% Tie 42.61% HonestyAlpaca 25.43%LLaMA-GPT4 16.48% Tie 58.10% Harmlessness(a) LLaMA-GPT4 vs Alpaca ( i.e.,LLaMA-GPT3 ) GPT4 44.11% LLaMA-GPT4 42.78% Tie 13.11% Helpfulness GPT4 37.48% LLaMA-GPT4 37.88% Tie 24.64% Honesty GPT4 35.36% LLaMA-GPT4 31.66% Tie 32.98% Harmlessness (b) LLaMA-GPT4 vs GPT-4"	✅ "Second, we compare GPT-4-instruction-tuned LLaMA models against the teacher model GPT-4. The observations are quite consistent over the three criteria: GPT-4-instruction-tuned LLaMA performs similarly to the original GPT-4."	✅ "LLaMA-GPT4 is a closer proxy to GPT-4 than Alpaca. closely follow the behavior of GPT-4. When the sequence length is short, both LLaMA-GPT4 and GPT-4 can generate responses that contains the simple ground truth answers, but add extra words to make the response more chat-like."
✅ "Second, we compare GPT-4-instruction-tuned LLaMA models against the teacher model GPT-4. The observations are quite consistent over the three criteria: GPT-4-instruction-tuned LLaMA performs similarly to the original GPT-4."	✅ "Instruction tuning of LLaMA with GPT-4 often achieves higher performance than tuning with text-davinci-003 (i.e. Alpaca) and no tuning (i.e. LLaMA): The 7B LLaMA GPT4 outperforms the 13B Alpaca and LLaMA.	✅ "We compare LLaMA-GPT4 with GPT-4 and Alpaca unnatural instructions. For ROUGE-L scores, Alpaca outperforms the other models. We note that LLaMA-GPT4 and GPT4 gradually perform better when the ground truth response length increases, eventually showing higher performance when the length is longer than 4."
✅ "We compare LLaMA-GPT4 with GPT-4 and Alpaca unnatural instructions. For ROUGE-L scores, Alpaca outperforms the other models. We note that LLaMA-GPT4 and GPT4 gradually perform better when the ground truth response length increases, eventually showing higher performance when the length is longer than 4."	✅ "LLaMA-GPT4 is a closer proxy to GPT-4 than Alpaca. closely follow the behavior of GPT-4. When the sequence length is short, both LLaMA-GPT4 and GPT-4 can generate responses that contains the simple ground truth answers, but add extra words to make the response more chat-like."	✅ "31.39%LLaMA-GPT4 25.99% Tie 42.61% HonestyAlpaca 25.43%LLaMA-GPT4 16.48% Tie 58.10% Harmlessness(a) LLaMA-GPT4 vs Alpaca ( i.e.,LLaMA-GPT3 ) GPT4 44.11% LLaMA-GPT4 42.78% Tie 13.11% Helpfulness GPT4 37.48% LLaMA-GPT4 37.88% Tie 24.64% Honesty GPT4 35.36% LLaMA-GPT4 31.66% Tie 32.98% Harmlessness (b) LLaMA-GPT4 vs GPT-4"

The results when asking for a comparison between GPT-4 and Llama are good from each model. The primary difference is that Ada 002 and E5 both return a plaintext table response — that is harder for us to read, but most LLMs would likely get some good information from that. Cohere returns a set of three valuable text-only responses.

Using what we have learned here, we have a good overview of the different types of embedding models and the qualities we might be most interested in when assessing which of those we should use. Naturally, a big part of that assessment should consist of us focusing on evaluation, which we did a little — qualitatively — here.

New models are being added to the MTEB leaderboards almost daily — many of those showing promising state-of-the-art results. So, there is no shortage of high-quality embedding models we can use in retrieval.

PreviousRerankers for RAG