Piotr Migdał

Don't use cosine similarity carelessly

14 Jan 2025 | by Piotr Migdał
2025 年 1 月 14 日|作者:Piotr Migdał

Midas turned everything he touched into gold. Data scientists turn everything into vectors. We do it for a reason — as gold is the language of merchants, vectors are the language of AI1.
國王米達斯觸碰任何東西都會變成黃金,而資料科學家則將一切轉化為向量。我們這麼做的原因是——黃金是商人的語言,向量則是 AI 的語言 1

Just as Midas discovered that turning everything to gold wasn't always helpful, we'll see that blindly applying cosine similarity to vectors can lead us astray. While embeddings do capture similarities, they often reflect the wrong kind - matching questions to questions rather than questions to answers, or getting distracted by superficial patterns like writing style and typos rather than meaning. This post shows you how to be more intentional about similarity and get better results.

Embeddings  嵌入

Embeddings are so captivating that my most popular blog post remains king - man + woman = queen; but why?. We have word2vec, node2vec, food2vec, game2vec, and if you can name it, someone has probably turned it into a vec. If not yet, it's your turn!
嵌入如此引人入勝,以至於我人氣最高的部落格文章仍然是王者——男人 + 女人 = 女王;但為什麼呢?我們有 word2vec、node2vec、food2vec、game2vec,只要你能說出名稱,就可能有人已經將其轉化為向量。如果還沒人做,就輪到你了!

When we work with raw IDs, we're blind to relationships. Take the words "brother" and "sister" — to a computer, they might as well be "xkcd42" and "banana". But with vectors, we can chart entities and relationships between them — both to provide as a structured input to a machine learning models, and on its own, to find similar items.
當我們使用原始 ID 時,我們對關係一無所知。以「兄弟」和「姐妹」兩個詞為例——對電腦來說,它們可能與「xkcd42」和「香蕉」沒什麼兩樣。但是有了向量,我們就可以繪製實體及其之間的關係圖——既可以作為結構化輸入提供給機器學習模型,也可以自行查找相似的項目。

Let's focus on sentence embeddings from Large Language Models (LLMs), as they are one of the most popular use cases for embeddings. Modern LLMs are so powerful at this that they can capture the essence of text without any fine-tuning. In fact, recent research shows these embeddings are almost as revealing as the original text - see Morris et al., Text Embeddings Reveal (Almost) As Much As Text, (2023). Yet, with great power comes great responsibility.
讓我們專注於大型語言模型 (LLM) 的句子嵌入,因為它們是嵌入最流行的用例之一。現代 LLM 在這方面功能強大,它們幾乎不需要任何微調就能捕捉文本的精髓。事實上,最近的研究表明,這些嵌入幾乎與原文一樣具有揭示性——參見 Morris 等人,《文本嵌入揭示(幾乎)與文本一樣多的資訊》(2023)。然而,能力越大,責任越大。

Example  例子

Let's look at three sentences:

  • A: "Python can make you rich."
    A:「Python 能讓你致富。」
  • B: "Python can make you itch."
    B:「Python 能讓你發癢。」
  • C: "Mastering Python can fill your pockets."
    C:「掌握 Python 能讓你荷包滿滿。」

If you treated them as raw IDs, there are different strings, with no notion of similarity. Using string similarity (Levenshtein distance), A and B differ by 2 characters, while A and C are 21 characters apart. Yet semantically (unless you're allergic to money), A is closer to C than B.
如果你將它們視為原始 ID,它們就是不同的字串,沒有任何相似性的概念。使用字串相似度(Levenshtein 距離),A 和 B 相差 2 個字元,而 A 和 C 相差 21 個字元。然而,從語義上來說(除非你對錢過敏),A 比 B 更接近 C。

We can use OpenAI text-embedding-3-large, to get the following vectors:
我們可以使用 OpenAI text-embedding-3-large 來獲得以下向量:

  • A: [-0.003738, -0.033263, -0.017596, 0.029024, -0.015251, ...]  A: [-0.003738, -0.033263, -0.017596, 0.029024, -0.015251, ...]
  • B: [-0.066795, -0.052274, -0.015973, 0.077706, 0.044226, ...]  B: [-0.066795, -0.052274, -0.015973, 0.077706, 0.044226, ...]
  • C: [-0.011167, 0.017812, -0.018655, 0.006625, 0.018506, ...]  C: [-0.011167, 0.017812, -0.018655, 0.006625, 0.018506, ...]

These vectors are quite long - text-embedding-3-large has up 3072 dimensions - to the point that we can truncate them at a minimal loss of quality. When we calculate cosine similarity, we get 0.750 between A and C (the semantically similar sentences), and 0.576 between A and B (the lexically similar ones). These numbers align with what we'd expect - the meaning matters more than spelling!
這些向量相當長——text-embedding-3-large 有高達 3072 個維度——以至於我們可以在幾乎不損失質量的情況下將它們截斷。當我們計算餘弦相似度時,我們得到 A 和 C(語義上相似的句子)之間的相似度為 0.750,而 A 和 B(詞彙上相似的句子)之間的相似度為 0.576。這些數字與我們的預期相符——意義比拼寫更重要!

What is cosine similarity?

When comparing vectors, there's a temptingly simple solution that every data scientist reaches for — cosine similarity:

Geometrically speaking, it is the cosine of the angle between two vectors. However, I avoid thinking about it this way - we're dealing with spaces of dozens, hundreds, or thousands of dimensions. Our geometric intuition fails us in such high-dimensional spaces, and we shouldn't pretend otherwise.

From a numerical perspective, it is a dot product with normalized vectors. It has some appealing properties:

  • Identical vectors score a perfect 1.
    完全相同的向量得分為完美的 1。
  • Random vectors hover around 0 (there are many dimensions, so it averages out).
    隨機向量徘徊在 0 附近(維度很多,所以會平均化)。
  • The result is between -1 and 1.
    結果介於 -1 和 1 之間。

Yet, this simplicity is misleading. Just because the values usually fall between 0 and 1 doesn't mean they represent probabilities or any other meaningful metric. The value 0.6 tells little if we have something really similar, or not so much. And while negative values are possible, they rarely indicate semantic opposites — more often, the opposite of something is gibberish.
然而,這種簡潔性卻具有誤導性。僅僅因為值通常落在 0 和 1 之間,並不意味著它們代表概率或任何其他有意義的指標。如果我們有一些非常相似或不太相似的事物,0.6 的值就說明不了什麼。雖然負值是可能的,但它們很少表示語義上的相反——更常見的是,某事物的相反是無意義的。

In other words, cosine similarity is the duct tape of vector comparisons. Sure, it sticks everything together — images, text, audio, code — but like real duct tape, it's a quick fix that often masks deeper problems rather than solving them. And just as you wouldn't use duct tape to permanently repair a water pipe, you shouldn't blindly trust cosine similarity for all your vector comparison needs.

Like a Greek tragedy, this blessing comes with a curse: when it works, it feels like effortless magic. But when it fails, we are clueless, and we often run into impromptu fixes, each one bringing issues on its own.

Relation to correlation  與相關性的關係

Pearson correlation can be seen as a sequence of three operations:

  • Subtracting means to center the data.
  • Normalizing vectors to unit length.
  • Computing dot products between them.

When we with vectors that are both centered () and normalized (), Pearson correlation, cosine similarity and dot product are the same.
當我們使用同時已集中 ( ) 和正規化 ( ) 的向量時,皮爾森相關係數、餘弦相似度和點積是相同的。

In practical cases, we don't want to center or normalize vectors during each pairwise comparison - we do it once, and just use dot product. In any case, when you are fine with using cosine similarity, you should be as fine with using Pearson correlation (and vice versa).

Problems with cosine similarity as a measure of similarity

Using cosine similarity as a training objective for machine learning models is perfectly valid and mathematically sound. As we just seen, it's a combination of two fundamental operations in deep learning: dot product and normalization. The trouble begins when we venture beyond its comfort zone, specifically when:

  • The cost function used in model training isn't cosine similarity (usually it is the case!).
  • The training objective differs from what we actually care about.

Has the model ever seen cosine similarity?

A common scenario involves training with unnormalized vectors, when we are dealing with a function of dot product - for example, predicting probabilities with a sigmoid function and applying log loss cost function. Other networks operate differently, e.g. they use Euclidean distance, minimized for members of the same class and maximized for members of different classes.
一個常見的場景是使用未標準化的向量進行訓練,而我們處理的是點積的函數——例如,使用 sigmoid 函數 預測概率並應用 log loss 代價函數。其他網路則運作方式不同,例如,它們使用歐幾里德距離,將相同類別的成員最小化,將不同類別的成員最大化。

The normalization gives us some nice mathematical properties (keeping results between -1 and +1, regardless of dimensions), but it's ultimately a hack. Sometimes it helps, sometimes it doesn't — see the aptly titled paper Is Cosine-Similarity of Embeddings Really About Similarity?.
標準化賦予我們一些良好的數學特性(無論維度如何,都能將結果保持在 -1 和 +1 之間),但它最終只是一個權宜之計。有時它有效,有時則無效——請參閱一篇題目貼切的論文《嵌入的餘弦相似度真的關乎相似度嗎?》。

Sure, back in the days of an image detection model VGG16 I was using logit vectors from the classification layer and Pearson correlation to find similar images. It kind of worked - being fully aware it is a hack and just a hack.
當然,在使用 VGG16 影像偵測模型的年代,我使用分類層的 logit 向量和皮爾森相關性來尋找相似的影像。它有點作用——完全意識到這只是一個權宜之計,僅此而已。

We are safe only if the model itself uses cosine similarity or a direct function of it - usually implemented as a dot product of vectors that are kept normalized. Otherwise, we use a quantity we have no control over. It may work in one instance, but not in another. If some things are extremely similar, sure, it is likely that many different measures of similarity will give similar results. But if they are not, we are in trouble.

In general, it is a part of a broader subject of unsupervised machine vs self-supervised learning. In the first one, we take an arbitrary function and we get some notions or similarity. Yet, there is no way to evaluate it. The second one, self-supervised learning, is a predictive model, in which we can directly evaluate the quality of prediction.

Is it the right kind of similarity?

And here is the second issue - even if a model is explicitly trained on cosine similarity, we run into a deeper question: whose definition of similarity are we using?

Consider books. For a literary critic, similarity might mean sharing thematic elements. For a librarian, it's about genre classification. For a reader, it's about emotions it evokes. For a typesetter, it's page count and format. Each perspective is valid, yet cosine similarity smashes all these nuanced views into a single number — with confidence and an illusion of objectivity.

Cartoon by Dmitry Malkov
Cartoon by Dmitry Malkov  圖畫:Dmitry Malkov

In the US, word2vec might tell you espresso and cappuccino are practically identical. It is not a claim you would make in Italy.
在美國,word2vec 可能會告訴你,espresso 和 cappuccino 幾乎完全相同。但在義大利,你不會這樣說。

When it falls apart

Let's have a task that looks simple, a simple quest from our everyday life:

  • "What did I do with my keys?"

Now, using cosine similarity, we can compare it to other notes:

  • "I left them in my pocket"
  • "They are on the table"
  • "What did I put my wallet?"
  • "What I did to my life?"

And remember, this is just a toy example with five sentences. In real-world applications, we're often dealing with thousands of documents — far more than could fit in a single context window. As your dataset grows, so does the noise sensitivity, turning your similarity scores into a game of high-dimensional roulette.

So, what can we use instead?

The most powerful approach

The best approach is to directly use LLM query to compare two entries. So, first, start with a powerful model of your choice. Then, we can write something in the line of:
最好的方法是直接使用大型語言模型 (LLM) 查詢來比較兩個項目。因此,首先,從您選擇的強大模型開始。然後,我們可以寫類似這樣的程式碼:

"Is {sentence_a} similar to {sentence_b}?"
{sentence_a} 與 {sentence_b} 相似嗎?

This way we harness the full power of an LLM to extract meaningful comparisons. We typically want our answers in structured output - what the field calls "tools" or "function calls" (which is really just a fancy way of saying "JSON").
這樣我們就能充分利用大型語言模型 (LLM) 的能力來提取有意義的比較結果。我們通常希望答案以結構化輸出的形式呈現——該領域稱之為「工具」或「函數呼叫」(這其實只是「JSON」的另一種更花俏的說法)。

However, in most cases this approach is impractical - we don't want to run such a costly operation for each query. Unless our dataset is very small, it would be prohibitively expensive. Even with a small dataset, the delays would be noticeable compared to a simple numerical operation.

Extracting the right features

So, we can go back to using embeddings. But instead of blindly trusting a black box, we can directly optimize for what we actually care about by creating task-specific embeddings. There are two main approaches:

  • Fine-tuning (teaching an old model new tricks by adjusting its weights).
  • Transfer learning (using the model's knowledge to create new, more focused embeddings).

Which one we use is ultimately a technical question - depending on the access to the model, costs, etc. Let's start with a symmetric case. Say we want to ask, "Is A similar to B?" We can write this as:
我們最終使用哪一種方法是一個技術問題——取決於模型的存取權限、成本等等。讓我們從對稱的情況開始。假設我們想問:「A 與 B 相似嗎?」我們可以寫成:

where , and is a matrix that reduces the embedding space to dimensions we actually care about. Think of it as decluttering — we're keeping only the features relevant to our specific similarity definition.
其中 ,而 是一個矩陣,它將嵌入空間縮減到我們真正關心的維度。可以把它想像成整理房間——我們只保留與我們特定相似度定義相關的特徵。

But often, similarity isn't what we're really after. Consider the question "Is document B a correct answer to question A?" (note the word "correct") and the relevant probability:
但通常情況下,我們真正想要的並非相似度。考慮一下這個問題:「文件 B 是否是問題 A 的正確答案?」(注意「正確」這個詞)以及相關的概率:

where and . The matrices and transform our embeddings into specialized spaces for questions and answers. It's like having two different languages and learning to translate between them, rather than assuming they're the same thing.
其中 。矩陣 將我們的嵌入式模型轉換到針對問題和答案的專門空間。這就像擁有兩種不同的語言,學習在它們之間進行翻譯,而不是假設它們是同一件事。

This approach works beautifully for retrieval augmented generation (RAG) too, as we usually care not only about similar documents but about the relevant ones.
這種方法也適用於增強生成式擷取 (RAG),因為我們通常不僅關心相似的文件,也關心相關的文件。

But where do we get the training data? We can use the same AI models we're working with to generate training data. Then feed it into PyTorch, TensorFlow, or your framework of choice.
但是我們從哪裡獲得訓練資料呢?我們可以使用我們正在使用的相同 AI 模型來生成訓練資料。然後將其饋送到 PyTorch、TensorFlow 或您選擇的框架中。

Pre-prompt engineering  預提示工程

Sure, we can train a model. Maybe even train on artificially generated data - but what if we want to avoid this step entirely? We got used to zero-shot learning, and it is not easy to go back.

One of the quickest fixes is to add prompt to the text, so to set the apparent context. A simple example — let's have the list of Time's 100 Most Significant Figures in History. Let's say we want to see who is the most similar to Isaac Newton.

No surprise, it's other physicists and natural philosophers. Yet, let's say we want to focus on his nationality - so we add a prompt "Nationality of {person}".

Sadly, the results are underwhelming - sure, Galileo went a few places down, but Albert Einstein is listed as the most similar. So, let's try another approach, by making nationality the subject of the sentence - "This is a country that has produced many influential historical figures, including {person}".

Now we get much better answer! To be clear - while I have found this approach useful, it is not a silver bullet. Depending on how to formulate the prompt, we can get a slight bias towards our goal, or something actually solving our problem.

Rewriting and context extraction

Another approach is to preprocess the text before embedding it. Here's a generic trick I often use — I ask the model:

"Rewrite the following text in standard English using Markdown. Focus on content, ignore style. Limit to 200 words."
「使用 Markdown 將以下文字改寫成標準英文。專注於內容,忽略風格。限制在 200 字以內。」

This simple prompt works wonders. It helps avoid false matches based on superficial similarities like formatting quirks, typos, or unnecessary verbosity.

Often we want more - e.g. to extract information from a text while ignoring the rest. For example, let's say we have a chat with a client and want to suggest relevant pages, be it FAQ or product recommendations. A naive way would be to compare their discussion's embedding with the embeddings of our pages. A better approach is to first transform the conversation into a structured format focused on needs:

"You have a conversation with a client. Summarize their needs and pain points in up to 10 Markdown bullet points, up to 20 words each. Consider both explicit needs and those implied by context, tone, and other signals."
「您正在與客戶交談。將他們的需求和痛點總結為最多 10 個 Markdown 列表項目,每個項目最多 20 個字。考慮到明確的需求以及由語境、語氣和其他訊號暗示的需求。」

Similarly, rewrite each of your pages in the same format before embedding them. This strips away everything that isn't relevant to matching needs with solutions.

This approach has worked wonders in many of my projects. Perhaps it will work for you too.

Recap  回顧

Let's recap the key points:

  • Cosine similarity gives us a number between -1 and 1, but don't mistake it for a probability.
    Cosine 相似度給我們一個介於 -1 和 1 之間的數字,但不要把它誤認為是機率。
  • Most models aren't trained using cosine similarity - then the results are just "some sort of correlations" without any guarantees.
    大多數模型並非使用 Cosine 相似度進行訓練——那麼結果就只是「某種相關性」,沒有任何保證。
  • Even when a model is trained with cosine similarity, we need to understand what kind of similarity it learned and if that matches our needs.
    即使模型是用 Cosine 相似度訓練的,我們也需要了解它學習了哪種相似性,以及它是否符合我們的需求。
  • To use vector similarity effectively, there are a few approaches:
    • Train custom embeddings on your specific data
    • Engineer prompts to focus on relevant aspects
    • Clean and standardize text before embedding

Have you found other ways to make vector similarity work better for your use case? What approaches have you tried? What were the results?

Photo from Python Summit 2024 Warsaw - a laptop of Piotr Migdał showing "Don't use cosine similarity" talk front slide.
來自 2024 年華沙 Python 研討會的照片——Piotr Migdał 的筆記型電腦螢幕顯示著「不要使用餘弦相似度」演講的投影片。

Thanks  謝謝

I first presented this topic as a flash talk at Warsaw AI Breakfast - I am grateful for feedback from Grzegorz Kossakowski and Max Salamonowicz. I thank Rafał Małanij for inviting me to speak at Python Summit 2024 Warsaw. This blog post stemmed from interest after these presentations, as well as multiple questions on the LinkedIn post.
我最初在華沙 AI 早餐會上以閃電演講的形式發表了這個主題——我非常感謝 Grzegorz Kossakowski 和 Max Salamonowicz 提供的回饋。我感謝 Rafał Małanij 邀請我在 2024 年華沙 Python 研討會上發表演講。這篇部落格文章源於這些演講後的興趣,以及 LinkedIn 貼文上眾多提問。

Footnotes  腳註

  1. To the point that my Jupyter Notebook intro to deep learning is called Thinking in tensors, writing in PyTorch
    甚至我的 Jupyter Notebook 深度學習入門課程就叫做「以張量思考,以 PyTorch 書寫」↩