Don't use cosine similarity carelessly
別亂用餘弦相似度

14 Jan 2025 | by Piotr Migdał
2025 年 1 月 14 日｜作者：Piotr Migdał

if you like this post, upvote on Hacker News
如果你喜歡這篇文章，請在 Hacker News 上按讚

Midas turned everything he touched into gold. Data scientists turn everything into vectors. We do it for a reason — as gold is the language of merchants, vectors are the language of AI¹.
國王米達斯觸碰任何東西都會變成黃金，而資料科學家則將一切轉化為向量。我們這麼做的原因是——黃金是商人的語言，向量則是 AI 的語言 ¹ 。

Just as Midas discovered that turning everything to gold wasn't always helpful, we'll see that blindly applying cosine similarity to vectors can lead us astray. While embeddings do capture similarities, they often reflect the wrong kind - matching questions to questions rather than questions to answers, or getting distracted by superficial patterns like writing style and typos rather than meaning. This post shows you how to be more intentional about similarity and get better results.
正如米達斯發現將一切變成黃金並非總是好事一樣，我們將會看到盲目地將餘弦相似度應用於向量可能會導致我們誤入歧途。雖然嵌入確實捕捉到了相似性，但它們通常反映的是錯誤的類型——將問題與問題匹配，而不是將問題與答案匹配，或者被書寫風格和錯字等表面模式所分散注意力，而不是關注其含義。這篇文章將向您展示如何更專注於相似性並獲得更好的結果。

Embeddings 嵌入

Embeddings are so captivating that my most popular blog post remains king - man + woman = queen; but why?. We have word2vec, node2vec, food2vec, game2vec, and if you can name it, someone has probably turned it into a vec. If not yet, it's your turn!
嵌入如此引人入勝，以至於我人氣最高的部落格文章仍然是王者——男人 + 女人 = 女王；但為什麼呢？我們有 word2vec、node2vec、food2vec、game2vec，只要你能說出名稱，就可能有人已經將其轉化為向量。如果還沒人做，就輪到你了！

When we work with raw IDs, we're blind to relationships. Take the words "brother" and "sister" — to a computer, they might as well be "xkcd42" and "banana". But with vectors, we can chart entities and relationships between them — both to provide as a structured input to a machine learning models, and on its own, to find similar items.
當我們使用原始 ID 時，我們對關係一無所知。以「兄弟」和「姐妹」兩個詞為例——對電腦來說，它們可能與「xkcd42」和「香蕉」沒什麼兩樣。但是有了向量，我們就可以繪製實體及其之間的關係圖——既可以作為結構化輸入提供給機器學習模型，也可以自行查找相似的項目。

Let's focus on sentence embeddings from Large Language Models (LLMs), as they are one of the most popular use cases for embeddings. Modern LLMs are so powerful at this that they can capture the essence of text without any fine-tuning. In fact, recent research shows these embeddings are almost as revealing as the original text - see Morris et al., Text Embeddings Reveal (Almost) As Much As Text, (2023). Yet, with great power comes great responsibility.
讓我們專注於大型語言模型 (LLM) 的句子嵌入，因為它們是嵌入最流行的用例之一。現代 LLM 在這方面功能強大，它們幾乎不需要任何微調就能捕捉文本的精髓。事實上，最近的研究表明，這些嵌入幾乎與原文一樣具有揭示性——參見 Morris 等人，《文本嵌入揭示（幾乎）與文本一樣多的資訊》(2023)。然而，能力越大，責任越大。

Example 例子

Let's look at three sentences:
讓我們看看三個句子：

A: "Python can make you rich."
A：「Python 能讓你致富。」
B: "Python can make you itch."
B：「Python 能讓你發癢。」
C: "Mastering Python can fill your pockets."
C：「掌握 Python 能讓你荷包滿滿。」

If you treated them as raw IDs, there are different strings, with no notion of similarity. Using string similarity (Levenshtein distance), A and B differ by 2 characters, while A and C are 21 characters apart. Yet semantically (unless you're allergic to money), A is closer to C than B.
如果你將它們視為原始 ID，它們就是不同的字串，沒有任何相似性的概念。使用字串相似度（Levenshtein 距離），A 和 B 相差 2 個字元，而 A 和 C 相差 21 個字元。然而，從語義上來說（除非你對錢過敏），A 比 B 更接近 C。

We can use OpenAI text-embedding-3-large, to get the following vectors:
我們可以使用 OpenAI text-embedding-3-large 來獲得以下向量：

A: [-0.003738, -0.033263, -0.017596, 0.029024, -0.015251, ...] A： [-0.003738, -0.033263, -0.017596, 0.029024, -0.015251, ...]
B: [-0.066795, -0.052274, -0.015973, 0.077706, 0.044226, ...] B： [-0.066795, -0.052274, -0.015973, 0.077706, 0.044226, ...]
C: [-0.011167, 0.017812, -0.018655, 0.006625, 0.018506, ...] C： [-0.011167, 0.017812, -0.018655, 0.006625, 0.018506, ...]

These vectors are quite long - text-embedding-3-large has up 3072 dimensions - to the point that we can truncate them at a minimal loss of quality. When we calculate cosine similarity, we get 0.750 between A and C (the semantically similar sentences), and 0.576 between A and B (the lexically similar ones). These numbers align with what we'd expect - the meaning matters more than spelling!
這些向量相當長——text-embedding-3-large 有高達 3072 個維度——以至於我們可以在幾乎不損失質量的情況下將它們截斷。當我們計算餘弦相似度時，我們得到 A 和 C（語義上相似的句子）之間的相似度為 0.750，而 A 和 B（詞彙上相似的句子）之間的相似度為 0.576。這些數字與我們的預期相符——意義比拼寫更重要！

What is cosine similarity?
什麼是餘弦相似度？

When comparing vectors, there's a temptingly simple solution that every data scientist reaches for — cosine similarity:
比較向量時，有一個簡單到令人心動的解決方案，每個資料科學家都會想到——餘弦相似度：

Geometrically speaking, it is the cosine of the angle between two vectors. However, I avoid thinking about it this way - we're dealing with spaces of dozens, hundreds, or thousands of dimensions. Our geometric intuition fails us in such high-dimensional spaces, and we shouldn't pretend otherwise.
從幾何角度來說，它是兩個向量之間夾角的餘弦值。然而，我不喜歡這樣思考——我們處理的是數十、數百甚至數千維度的空間。在如此高維度的空間中，我們的幾何直覺會失效，我們不應該假裝它不會。

From a numerical perspective, it is a dot product with normalized vectors. It has some appealing properties:
從數值角度來看，它是正規化向量後的點積。它有一些吸引人的特性：

Identical vectors score a perfect 1.
完全相同的向量得分為完美的 1。
Random vectors hover around 0 (there are many dimensions, so it averages out).
隨機向量徘徊在 0 附近（維度很多，所以會平均化）。
The result is between -1 and 1.
結果介於 -1 和 1 之間。

Yet, this simplicity is misleading. Just because the values usually fall between 0 and 1 doesn't mean they represent probabilities or any other meaningful metric. The value 0.6 tells little if we have something really similar, or not so much. And while negative values are possible, they rarely indicate semantic opposites — more often, the opposite of something is gibberish.
然而，這種簡潔性卻具有誤導性。僅僅因為值通常落在 0 和 1 之間，並不意味著它們代表概率或任何其他有意義的指標。如果我們有一些非常相似或不太相似的事物，0.6 的值就說明不了什麼。雖然負值是可能的，但它們很少表示語義上的相反——更常見的是，某事物的相反是無意義的。

When using cosine similarity on Glove vectors (`glove.6B.300d`), the closest words to "dog" are predictable, the farthest - not. You can play with it here.
在 Glove 向量 ( `glove.6B.300d` ) 上使用餘弦相似度時，「狗」最接近的詞是可以預測的，最遠的則不是。你可以在這裡試玩。

In other words, cosine similarity is the duct tape of vector comparisons. Sure, it sticks everything together — images, text, audio, code — but like real duct tape, it's a quick fix that often masks deeper problems rather than solving them. And just as you wouldn't use duct tape to permanently repair a water pipe, you shouldn't blindly trust cosine similarity for all your vector comparison needs.
換句話說，餘弦相似度是向量比較中的強力膠帶。當然，它可以將所有東西粘合在一起——圖像、文本、音訊、程式碼——但就像真正的強力膠帶一樣，它是一種快速解決方案，往往掩蓋了更深層次的問題，而不是解決它們。正如你不會用強力膠帶永久修理水管一樣，你不應該盲目相信餘弦相似度能滿足你所有向量比較的需求。

Like a Greek tragedy, this blessing comes with a curse: when it works, it feels like effortless magic. But when it fails, we are clueless, and we often run into impromptu fixes, each one bringing issues on its own.
這就像希臘悲劇一樣，福禍相依：當它奏效時，感覺就像毫不費力的魔法。但當它失敗時，我們就茫然無措，常常會採用臨時的解決方案，每一個方案都會帶來自身的問題。

Relation to correlation 與相關性的關係

Pearson correlation can be seen as a sequence of three operations:
皮爾森相關係數可以看作是三個運算的序列：

Subtracting means to center the data.
減去平均值以將數據集中。
Normalizing vectors to unit length.
將向量正規化到單位長度。
Computing dot products between them.
計算它們之間的點積。

When we with vectors that are both centered () and normalized (), Pearson correlation, cosine similarity and dot product are the same.
當我們使用同時已集中 ( ) 和正規化 ( ) 的向量時，皮爾森相關係數、餘弦相似度和點積是相同的。

In practical cases, we don't want to center or normalize vectors during each pairwise comparison - we do it once, and just use dot product. In any case, when you are fine with using cosine similarity, you should be as fine with using Pearson correlation (and vice versa).
在實際情況中，我們不希望在每次成對比較時都對向量進行集中或正規化——我們只做一次，然後只使用點積。無論如何，當你覺得可以使用餘弦相似度時，你也應該覺得可以使用皮爾森相關係數（反之亦然）。

Problems with cosine similarity as a measure of similarity
餘弦相似度作為相似度度量的問題

Using cosine similarity as a training objective for machine learning models is perfectly valid and mathematically sound. As we just seen, it's a combination of two fundamental operations in deep learning: dot product and normalization. The trouble begins when we venture beyond its comfort zone, specifically when:
將餘弦相似度用作機器學習模型的訓練目標是完全有效且數學上合理的。正如我們剛看到的，它是深度學習中兩個基本運算的組合：點積和正規化。當我們超出其舒適區時，問題就開始出現了，特別是當：

The cost function used in model training isn't cosine similarity (usually it is the case!).
模型訓練中使用的成本函數不是餘弦相似度（通常情況下確實如此！）。
The training objective differs from what we actually care about.
訓練目標和我們真正關心的內容有所不同。

Has the model ever seen cosine similarity?
模型曾經看過餘弦相似度嗎？

A common scenario involves training with unnormalized vectors, when we are dealing with a function of dot product - for example, predicting probabilities with a sigmoid function and applying log loss cost function. Other networks operate differently, e.g. they use Euclidean distance, minimized for members of the same class and maximized for members of different classes.
一個常見的場景是使用未標準化的向量進行訓練，而我們處理的是點積的函數——例如，使用 sigmoid 函數預測概率並應用 log loss 代價函數。其他網路則運作方式不同，例如，它們使用歐幾里德距離，將相同類別的成員最小化，將不同類別的成員最大化。

The normalization gives us some nice mathematical properties (keeping results between -1 and +1, regardless of dimensions), but it's ultimately a hack. Sometimes it helps, sometimes it doesn't — see the aptly titled paper Is Cosine-Similarity of Embeddings Really About Similarity?.
標準化賦予我們一些良好的數學特性（無論維度如何，都能將結果保持在 -1 和 +1 之間），但它最終只是一個權宜之計。有時它有效，有時則無效——請參閱一篇題目貼切的論文《嵌入的餘弦相似度真的關乎相似度嗎？》。

Sure, back in the days of an image detection model VGG16 I was using logit vectors from the classification layer and Pearson correlation to find similar images. It kind of worked - being fully aware it is a hack and just a hack.
當然，在使用 VGG16 影像偵測模型的年代，我使用分類層的 logit 向量和皮爾森相關性來尋找相似的影像。它有點作用——完全意識到這只是一個權宜之計，僅此而已。

We are safe only if the model itself uses cosine similarity or a direct function of it - usually implemented as a dot product of vectors that are kept normalized. Otherwise, we use a quantity we have no control over. It may work in one instance, but not in another. If some things are extremely similar, sure, it is likely that many different measures of similarity will give similar results. But if they are not, we are in trouble.
只有當模型本身使用餘弦相似度或其直接函數時，我們才是安全的——通常實現為保持標準化的向量的點積。否則，我們使用的是我們無法控制的量。它可能在某個例子中有效，但在另一個例子中則無效。如果有些東西非常相似，當然，許多不同的相似性度量很可能會給出相似的結果。但如果它們不相似的，我們就麻煩了。

In general, it is a part of a broader subject of unsupervised machine vs self-supervised learning. In the first one, we take an arbitrary function and we get some notions or similarity. Yet, there is no way to evaluate it. The second one, self-supervised learning, is a predictive model, in which we can directly evaluate the quality of prediction.
一般來說，這是無監督機器學習與自監督學習更廣泛議題的一部分。在第一種情況下，我們採用任意函數，並獲得一些相似性的概念。然而，沒有辦法評估它。第二種情況，自監督學習，是一個預測模型，我們可以直接評估預測的質量。

Is it the right kind of similarity?
這是正確的相似性類型嗎？

And here is the second issue - even if a model is explicitly trained on cosine similarity, we run into a deeper question: whose definition of similarity are we using?
這裡是第二個問題——即使模型明確地基於餘弦相似度進行訓練，我們也會遇到更深層次的問題：我們使用的是誰的相似性定義？

Consider books. For a literary critic, similarity might mean sharing thematic elements. For a librarian, it's about genre classification. For a reader, it's about emotions it evokes. For a typesetter, it's page count and format. Each perspective is valid, yet cosine similarity smashes all these nuanced views into a single number — with confidence and an illusion of objectivity.
以書籍為例。對於文學評論家來說，相似性可能意味著共享主題元素。對於圖書館員來說，這是關於類型分類。對於讀者來說，這是關於它所引發的情緒。對於排版者來說，這是頁數和格式。每種觀點都有其合理性，然而餘弦相似度卻將所有這些細微的觀點壓縮成單一數字——帶著自信和客觀性的錯覺。

Cartoon by Dmitry Malkov 圖畫：Dmitry Malkov

In the US, word2vec might tell you espresso and cappuccino are practically identical. It is not a claim you would make in Italy.
在美國，word2vec 可能會告訴你，espresso 和 cappuccino 幾乎完全相同。但在義大利，你不會這樣說。

When it falls apart
瓦解之時

Let's have a task that looks simple, a simple quest from our everyday life:
讓我們來看看一個看似簡單的任務，一個我們日常生活中簡單的尋物任務：

"What did I do with my keys?"
「我的鑰匙去哪了？」

Now, using cosine similarity, we can compare it to other notes:
現在，使用餘弦相似度，我們可以將其與其他筆記進行比較：

"I left them in my pocket"
「我放在口袋裡了」
"They are on the table"
「在桌子上」
"What did I put my wallet?"
「我的錢包去哪了？」
"What I did to my life?"
「我的人生怎麼了？」

The closest match is not a plausible answer to our question — instead, it is another question. With sentence embedding cosine similarity, we are more likely to question our own life than to solve our mundane task. Fortunately, sentences about Python are close to zero - as they are not related.
最接近的匹配並非我們問題的合理答案——相反，它又是另一個問題。使用句子嵌入餘弦相似度，我們更容易質疑自己的人生，而不是解決我們日常的任務。幸運的是，關於 Python 的句子相似度接近於零，因為它們毫不相關。

And remember, this is just a toy example with five sentences. In real-world applications, we're often dealing with thousands of documents — far more than could fit in a single context window. As your dataset grows, so does the noise sensitivity, turning your similarity scores into a game of high-dimensional roulette.
而且請記住，這只是一個包含五個句子的玩具範例。在真實世界的應用中，我們經常處理數千份文件——遠遠超過單一上下文窗口所能容納的數量。隨著資料集的增長，噪聲敏感度也會隨之提高，讓您的相似度評分變成一場高維度的輪盤賭。

So, what can we use instead?
那麼，我們可以改用什麼呢？

The most powerful approach
最有效的方法

The best approach is to directly use LLM query to compare two entries. So, first, start with a powerful model of your choice. Then, we can write something in the line of:
最好的方法是直接使用大型語言模型 (LLM) 查詢來比較兩個項目。因此，首先，從您選擇的強大模型開始。然後，我們可以寫類似這樣的程式碼：

"Is {sentence_a} similar to {sentence_b}?"
{sentence_a} 與 {sentence_b} 相似嗎？

This way we harness the full power of an LLM to extract meaningful comparisons. We typically want our answers in structured output - what the field calls "tools" or "function calls" (which is really just a fancy way of saying "JSON").
這樣我們就能充分利用大型語言模型 (LLM) 的能力來提取有意義的比較結果。我們通常希望答案以結構化輸出的形式呈現——該領域稱之為「工具」或「函數呼叫」(這其實只是「JSON」的另一種更花俏的說法)。

However, in most cases this approach is impractical - we don't want to run such a costly operation for each query. Unless our dataset is very small, it would be prohibitively expensive. Even with a small dataset, the delays would be noticeable compared to a simple numerical operation.
然而，在多數情況下，這種方法並不實際——我們不希望針對每個查詢都執行這樣成本高昂的操作。除非我們的資料集非常小，否則這將會非常昂貴。即使資料集很小，與簡單的數值運算相比，延遲也會很明顯。

Extracting the right features
提取正確的特徵

So, we can go back to using embeddings. But instead of blindly trusting a black box, we can directly optimize for what we actually care about by creating task-specific embeddings. There are two main approaches:
因此，我們可以回到使用嵌入式模型。但與其盲目相信黑箱模型，不如透過建立特定任務的嵌入式模型，直接針對我們真正關心的目標進行優化。主要有兩種方法：

Fine-tuning (teaching an old model new tricks by adjusting its weights).
微調（透過調整模型權重來教導舊模型新的技巧）。
Transfer learning (using the model's knowledge to create new, more focused embeddings).
轉移學習（利用模型的知識來建立新的、更專注的嵌入式模型）。

Which one we use is ultimately a technical question - depending on the access to the model, costs, etc. Let's start with a symmetric case. Say we want to ask, "Is A similar to B?" We can write this as:
我們最終使用哪一種方法是一個技術問題——取決於模型的存取權限、成本等等。讓我們從對稱的情況開始。假設我們想問：「A 與 B 相似嗎？」我們可以寫成：

where , and is a matrix that reduces the embedding space to dimensions we actually care about. Think of it as decluttering — we're keeping only the features relevant to our specific similarity definition.
其中，而是一個矩陣，它將嵌入空間縮減到我們真正關心的維度。可以把它想像成整理房間——我們只保留與我們特定相似度定義相關的特徵。

But often, similarity isn't what we're really after. Consider the question "Is document B a correct answer to question A?" (note the word "correct") and the relevant probability:
但通常情況下，我們真正想要的並非相似度。考慮一下這個問題：「文件 B 是否是問題 A 的正確答案？」（注意「正確」這個詞）以及相關的概率：

where and . The matrices and transform our embeddings into specialized spaces for questions and answers. It's like having two different languages and learning to translate between them, rather than assuming they're the same thing.
其中和。矩陣和將我們的嵌入式模型轉換到針對問題和答案的專門空間。這就像擁有兩種不同的語言，學習在它們之間進行翻譯，而不是假設它們是同一件事。

This approach works beautifully for retrieval augmented generation (RAG) too, as we usually care not only about similar documents but about the relevant ones.
這種方法也適用於增強生成式擷取 (RAG)，因為我們通常不僅關心相似的文件，也關心相關的文件。

But where do we get the training data? We can use the same AI models we're working with to generate training data. Then feed it into PyTorch, TensorFlow, or your framework of choice.
但是我們從哪裡獲得訓練資料呢？我們可以使用我們正在使用的相同 AI 模型來生成訓練資料。然後將其饋送到 PyTorch、TensorFlow 或您選擇的框架中。

Pre-prompt engineering 預提示工程

Sure, we can train a model. Maybe even train on artificially generated data - but what if we want to avoid this step entirely? We got used to zero-shot learning, and it is not easy to go back.
當然，我們可以訓練一個模型。甚至可以使用人工生成的資料進行訓練——但如果我們想完全避免這個步驟呢？我們已經習慣了零樣本學習，而要回頭並不容易。

One of the quickest fixes is to add prompt to the text, so to set the apparent context. A simple example — let's have the list of Time's 100 Most Significant Figures in History. Let's say we want to see who is the most similar to Isaac Newton.
最快速的解決方法之一，就是將提示詞添加到文字中，以設定明顯的語境。一個簡單的例子——讓我們看看時代雜誌百大人物名單。假設我們想看看誰和艾薩克·牛頓最相似。

No surprise, it's other physicists and natural philosophers. Yet, let's say we want to focus on his nationality - so we add a prompt "Nationality of {person}".
不意外，都是其他物理學家和自然哲學家。然而，假設我們想專注於他的國籍——所以我們添加一個提示詞「{人物}的國籍」。

Sadly, the results are underwhelming - sure, Galileo went a few places down, but Albert Einstein is listed as the most similar. So, let's try another approach, by making nationality the subject of the sentence - "This is a country that has produced many influential historical figures, including {person}".
可惜的是，結果令人失望——當然，伽利略排後幾名，但愛因斯坦被列為最相似的人物。所以，讓我們嘗試另一種方法，將國籍作為句子的主語——「這個國家培養出許多有影響力的歷史人物，包括{人物}」。

Now we get much better answer! To be clear - while I have found this approach useful, it is not a silver bullet. Depending on how to formulate the prompt, we can get a slight bias towards our goal, or something actually solving our problem.
現在我們得到更好的答案了！明確一點——雖然我發現這種方法很有用，但它並不是萬靈丹。根據提示詞的撰寫方式，我們可能會略微偏向我們的目標，或者實際上解決了我們的問題。

Rewriting and context extraction
改寫和語境提取

Another approach is to preprocess the text before embedding it. Here's a generic trick I often use — I ask the model:
另一種方法是在嵌入文字之前預處理文字。這是一個我經常使用的通用技巧——我問模型：

"Rewrite the following text in standard English using Markdown. Focus on content, ignore style. Limit to 200 words."
「使用 Markdown 將以下文字改寫成標準英文。專注於內容，忽略風格。限制在 200 字以內。」

This simple prompt works wonders. It helps avoid false matches based on superficial similarities like formatting quirks, typos, or unnecessary verbosity.
這個簡單的提示詞效果奇佳。它有助於避免基於格式上的細微差異、錯字或不必要的冗長而產生的錯誤匹配。

Often we want more - e.g. to extract information from a text while ignoring the rest. For example, let's say we have a chat with a client and want to suggest relevant pages, be it FAQ or product recommendations. A naive way would be to compare their discussion's embedding with the embeddings of our pages. A better approach is to first transform the conversation into a structured format focused on needs:
我們經常需要更多——例如，從文字中提取資訊，同時忽略其餘部分。例如，假設我們與客戶進行聊天，並想建議相關頁面，無論是常見問題集還是產品推薦。一個天真的方法是將他們的討論嵌入與我們頁面的嵌入進行比較。更好的方法是首先將對話轉換為關注需求的結構化格式：

"You have a conversation with a client. Summarize their needs and pain points in up to 10 Markdown bullet points, up to 20 words each. Consider both explicit needs and those implied by context, tone, and other signals."
「您正在與客戶交談。將他們的需求和痛點總結為最多 10 個 Markdown 列表項目，每個項目最多 20 個字。考慮到明確的需求以及由語境、語氣和其他訊號暗示的需求。」

Similarly, rewrite each of your pages in the same format before embedding them. This strips away everything that isn't relevant to matching needs with solutions.
同樣地，在嵌入您的每個頁面之前，也以相同的格式改寫它們。這去除了與將需求與解決方案匹配無關的所有內容。

This approach has worked wonders in many of my projects. Perhaps it will work for you too.
這種方法在我的許多專案中都發揮了奇效。也許它對您也有用。

Recap 回顧

Let's recap the key points:
讓我們回顧一下重點：

Cosine similarity gives us a number between -1 and 1, but don't mistake it for a probability.
Cosine 相似度給我們一個介於 -1 和 1 之間的數字，但不要把它誤認為是機率。
Most models aren't trained using cosine similarity - then the results are just "some sort of correlations" without any guarantees.
大多數模型並非使用 Cosine 相似度進行訓練——那麼結果就只是「某種相關性」，沒有任何保證。
Even when a model is trained with cosine similarity, we need to understand what kind of similarity it learned and if that matches our needs.
即使模型是用 Cosine 相似度訓練的，我們也需要了解它學習了哪種相似性，以及它是否符合我們的需求。
To use vector similarity effectively, there are a few approaches:
要有效地使用向量相似度，有幾種方法：
- Train custom embeddings on your specific data
  使用您特定的資料訓練自訂嵌入
- Engineer prompts to focus on relevant aspects
  設計提示詞以專注於相關方面
- Clean and standardize text before embedding
  在嵌入之前，請先清理並標準化文本

Have you found other ways to make vector similarity work better for your use case? What approaches have you tried? What were the results?
你是否找到其他方法來提升向量相似度在你的應用案例中的效果？你嘗試了哪些方法？結果如何？

Photo from Python Summit 2024 Warsaw - a laptop of Piotr Migdał showing "Don't use cosine similarity" talk front slide.
來自 2024 年華沙 Python 研討會的照片——Piotr Migdał 的筆記型電腦螢幕顯示著「不要使用餘弦相似度」演講的投影片。

Thanks 謝謝

I first presented this topic as a flash talk at Warsaw AI Breakfast - I am grateful for feedback from Grzegorz Kossakowski and Max Salamonowicz. I thank Rafał Małanij for inviting me to speak at Python Summit 2024 Warsaw. This blog post stemmed from interest after these presentations, as well as multiple questions on the LinkedIn post.
我最初在華沙 AI 早餐會上以閃電演講的形式發表了這個主題——我非常感謝 Grzegorz Kossakowski 和 Max Salamonowicz 提供的回饋。我感謝 Rafał Małanij 邀請我在 2024 年華沙 Python 研討會上發表演講。這篇部落格文章源於這些演講後的興趣，以及 LinkedIn 貼文上眾多提問。

To the point that my Jupyter Notebook intro to deep learning is called Thinking in tensors, writing in PyTorch ↩
甚至我的 Jupyter Notebook 深度學習入門課程就叫做「以張量思考，以 PyTorch 書寫」↩

Don't use cosine similarity carelessly別亂用餘弦相似度

Don't use cosine similarity carelessly
別亂用餘弦相似度