旋转位置嵌入（RoPE）

Introduction 介绍

Positional embedding is a crucial part of transformer models such as BERT and GPT. Unlike traditional models such as RNNs or LSTMs that understand the order of input sequences through sequential processing, transformers consider input sequences as unordered sets. This approach improves computational efficiency and overall performance, but it doesn’t account for the natural order of tokens, which is essential for understanding text. This limitation exists because the transformer architecture relies on self-attention mechanisms, which are permutation-invariant. This means they treat all positions equally, regardless of the element arrangement in the sequence. Positional embeddings address this shortcoming by integrating the sequence’s order into the model’s inputs, enabling the model to maintain awareness of token positions. This understanding is essential for language tasks such as translation, generation, and comprehension, where changing the word sequence can drastically change sentence meaning. For example, “The cat sat on the mat” and “The mat sat on the cat” have the exact words but different meanings due to word order.
位置嵌入是 Transformer 模型（如 BERT 和 GPT）的重要组成部分。与通过顺序处理理解输入序列顺序的传统模型如 RNN 或 LSTM 不同，Transformer 将输入序列视为无序集合。这种方法提高了计算效率和整体性能，但不考虑令牌的自然顺序，这对于理解文本至关重要。这种限制的存在是因为 Transformer 架构依赖于自注意力机制，而自注意力机制是置换不变的。这意味着它们对所有位置一视同仁，无论序列中元素的排列如何。位置嵌入通过将序列顺序整合到模型输入中来解决这一缺陷，使模型能够关注令牌的位置。这种理解对于诸如翻译、生成和理解的语言任务至关重要，因为改变词序会大大改变句子的意义。例如，“The cat sat on the mat”和“The mat sat on the cat”这两句话虽然有相同的词，但由于词序的不同，含义却不同。

Transformer models use two primary types of positional embeddings: Absolute and Relative. Absolute positional embeddings assign a unique identifier to each position in a sequence, enabling the model to learn and utilize the absolute position of tokens. On the other hand, relative positional embeddings focus on the distances between pairs of tokens, thereby allowing the model to understand and leverage the relative positioning of tokens within a sequence.
Transformer 模型使用两种主要类型的位置信息嵌入：绝对位置嵌入和相对位置嵌入。绝对位置嵌入为序列中的每个位置分配一个唯一标识符，使得模型可以学习并利用令牌的绝对位置。另一方面，相对位置嵌入关注令牌对之间的距离，从而使模型能够理解并利用序列中令牌的相对位置。

Sinusoidal Positional Encoding
正弦位置编码

The original transformer model, which was introduced in the paper “Attention is All You Need”, uses sine and cosine functions to create positional embeddings. These embeddings are absolute, which means that each position in the sequence is assigned a unique sinusoidal pattern. This helps the model to effectively differentiate between the tokens and utilize their order.
在“Attention is All You Need”一文中首次提出的原始 Transformer 模型使用正弦和余弦函数来创建位置嵌入。这些嵌入是绝对的，这意味着序列中的每个位置都被分配了一个唯一的正弦模式。这帮助模型有效区分令牌并利用其顺序。

Sine and cosine positional embeddings are calculated as follows for each position $p o s$ and each dimension $i$ of the $d_{m o d e l}$ -dimensional token embedding:
正弦和余弦位置嵌入对于每个位置 $p o s$ 和 $d_{m o d e l}$ 维令牌嵌入的每个维度 $i$ 的计算公式如下：

P E_{(p o s, 2 i)} = \sin (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}})

P E_{(p o s, 2 i + 1)} = \cos (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}})

This formulation ensures each position $p o s$ in the sequence receives a unique embedding, with the embeddings transitioning smoothly across positions.
这种公式确保了序列中的每个位置 $p o s$ 获得唯一的嵌入，嵌入在各个位置之间平滑过渡。

Consider a scenario where we have a sequence of words and we wish to encode the position of each word. The model’s dimension $d_{m o d e l}$ is 4 for simplicity.
考虑一个场景，我们有一个词序列，我们希望对每个词的位置进行编码。为简单起见，模型的维度 $d_{m o d e l}$ 为 4。

Step-by-Step Calculation for a Single Position
单个位置的逐步计算

Position 1, Dimension 0 and 1: For the first position ( $p o s = 0$ ) and the first dimension ( $i = 0$ ): $P E_{(0, 0)} = \sin (\frac{0}{10000^{0}}) = \sin (0) = 0$ $P E_{(0, 1)} = \cos (\frac{0}{10000^{0}}) = \cos (0) = 1$
位置 1，维度 0 和 1：对于第一个位置( $p o s = 0$ )和第一个维度( $i = 0$ )： $P E_{(0, 0)} = \sin (\frac{0}{10000^{0}}) = \sin (0) = 0$ $P E_{(0, 1)} = \cos (\frac{0}{10000^{0}}) = \cos (0) = 1$

This process is repeated for each dimension of the embedding, creating a unique blend of sine and cosine values that vary smoothly and predictably across positions.
这个过程对于嵌入的每个维度重复进行，创建了一个由正弦和余弦值组成的独特组合，平滑且可预测地在各个位置间变化。

Code Implementation of Sinusoidal Positional Encoding
正弦位置编码的代码实现

import numpy as np

def sinusoidal_positional_encoding(pos, d_model):
    position = np.arange(pos)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    pe = np.zeros((pos, d_model))
    pe[:, 0::2] = np.sin(position * div_term)
    pe[:, 1::2] = np.cos(position * div_term)
    return pe

np.arange(pos)[:, np.newaxis]: Creates a column vector with positions from 0 to pos-1.
np.arange(pos)[:, np.newaxis] : 创建一个从 0 到 pos-1 的列向量。
div_term: Calculates the denominator for the exponent, scaling down the rate of change for higher dimensions.
div_term : 计算指数的分母，降低高维的变动率。
pe[:, 0::2] = np.sin(position * div_term): Assigns the sine values to even indices of the embedding.
pe[:, 0::2] = np.sin(position * div_term) : 将正弦值分配到嵌入的偶数索引。
pe[:, 1::2] = np.cos(position * div_term): Assigns the cosine values to odd indices.
pe[:, 1::2] = np.cos(position * div_term) : 将余弦值分配到奇数索引。

This implementation efficiently encodes positional information into a matrix, which can be added to the token embeddings, enriching them with positional context.
这种实现高效地将位置信息编码成矩阵，可以将其添加到令牌嵌入中，丰富其位置信息环境。

Below visualization show a smooth, wave-like pattern, demonstrating how each position is uniquely encoded yet transitions smoothly to the next.
下图展示了一种平滑的波浪模式，显示了如何对每个位置进行唯一编码，同时平滑过渡到下一个位置。

Despite its effectiveness, sinusoidal positional embeddings have limitations, particularly with longer sequences than those seen during training.
尽管其有效性，正弦位置嵌入在处理长于训练时见到的序列上有其局限性。

Rotary Positional Embedding (RoPE)
旋转位置嵌入（RoPE）

Jianlin Su et al. introduced Rotary Positional Embedding (RoPE) to solve a above problem. RoPE uses a unique method to encode positions by rotating the embedding vectors in a multidimensional space. This is different from sinusoidal embeddings which add fixed wave patterns to the embeddings based on their positions. RoPE uses a rotation matrix to alter the representation of each token in a geometric way, similar to how relationships in physical spaces can be understood through angles and distances. This provides a more natural way to represent sequential information. Sinusoidal embeddings may not be effective over long distances or in complex sequences where the positional relationship is key to interpretation. RoPE’s rotation-based method maintains the relative positional information, making it particularly adept at handling such challenges.
苏剑林等人引入了旋转位置嵌入（RoPE）来解决上述问题。RoPE 使用一种独特的方法，通过在多维空间中旋转嵌入向量来编码位置。这与正弦嵌入不同，正弦嵌入通过在嵌入中添加基于位置的固定波形模式来表示位置信息。RoPE 使用旋转矩阵以几何方式改变每个标记的表示，类似于如何通过角度和距离来理解物理空间中的关系。这提供了一种更自然的方式来表示序列信息。正弦嵌入在长距离或复杂序列中可能效果不佳，因为在这些情况下，位置关系对解释至关重要。RoPE 的旋转方法保持了相对位置信息，使其在处理此类挑战时特别得心应手。

For a given token at position $p o s$ , with its embedding vector $x_{p o s}$ , RoPE defines a rotation matrix based on the token’s position that encodes the positional information:
对于给定位置 $p o s$ 处的标记及其嵌入向量 $x_{p o s}$ ，RoPE 基于标记的位置定义一个旋转矩阵来编码位置信息：

RoPE (x_{p o s}) = x_{p o s} \cdot \cos (θ_{p o s}) + {\hat{x}}_{p o s} \cdot \sin (θ_{p o s})

Here, $x_{p o s}$ is the original embedding vector, and ${\hat{x}}_{p o s}$ is a version of $x_{p o s}$ that has been rotated by the angle $θ_{p o s}$ in the embedding space. This angle is a predetermined function of the position, ensuring a unique rotation for each token.
其中， $x_{p o s}$ 是原始嵌入向量， ${\hat{x}}_{p o s}$ 是 $x_{p o s}$ 经过角度 $θ_{p o s}$ 旋转后的版本。这个角度是位置的一个预定函数，确保了每个标记的旋转是独特的。

To illustrate the process more concretely, let’s walk through how RoPE would encode a two-token sequence in a model where embeddings are two-dimensional:
为了更具体地说明这一过程，让我们走一遍 RoPE 在一个模型中如何编码一个两标记序列，此模型的嵌入是二维的：

Initial Embeddings: Start with the initial embeddings for tokens A and B, ${\vec{x}}_{A}$ and ${\vec{x}}_{B}$ , in a 2D space.
初始嵌入：从标记 A 和 B 的初始嵌入 ${\vec{x}}_{A}$ 和 ${\vec{x}}_{B}$ 开始，位于二维空间中。
Positional Rotation: For each token, calculate a rotation angle $θ_{A}$ and $θ_{B}$ based on their respective positions in the sequence.
位置旋转：为每个标记基于它们在序列中的相应位置计算旋转角度 $θ_{A}$ 和 $θ_{B}$ 。
Apply Rotation: Rotate each token’s embedding vector by its corresponding angle. This is done through element-wise multiplication by cosine and sine of the angles.
应用旋转：通过角度的余弦和正弦，在元素级别上进行乘法，旋转每个标记的嵌入向量。
Rotated Embeddings: The new positions ${\vec{x}}_{A}^{'}$ and ${\vec{x}}_{B}^{'}$ now contain both original and relative positional information, enabling the model to understand the sequence order.
旋转后的嵌入：新的位置 ${\vec{x}}_{A}^{'}$ 和 ${\vec{x}}_{B}^{'}$ 现在同时包含原始和相对位置信息，使模型能够理解序列顺序。

Visualizing RoPE RoPE 可视化

Initial Token Positions: The starting points for Token A and Token B.
初始标记位置：标记 A 和标记 B 的起始点。
Token Rotations: Reflecting the positional encoding process.
标记旋转：反映位置编码过程。
Relative Positioning: The relative distance between Token A and B is preserved post-rotation, signifying the model’s understanding of their order and spacing.
相对定位：标记 A 和 B 之间的相对距离在旋转后得以保留，显示出模型对它们顺序和间距的理解。

Advantages of RoPE RoPE 的优势

Long-Range Context: RoPE adeptly captures relationships between tokens across lengthy sequences, a challenge for conventional positional embeddings.
长距离上下文：RoPE 能够巧妙地捕捉到跨越长序列的标记间关系，这是传统位置嵌入的一个挑战。
Rotation Invariance: By design, RoPE maintains effectiveness irrespective of sequence length, addressing a limitation of sine-cosine embeddings.
旋转不变性：RoPE 的设计使其无论序列长度如何，都能始终有效地发挥作用，解决了正弦-余弦嵌入的局限性。
Interpretability: The rotational approach offers an intuitive geometric interpretation of how positional information influences the attention mechanism.
可解释性：旋转方法提供了一种直观的几何解释，展示了位置信息如何影响注意力机制。

Code Implementation 代码实现

import torch

def rotary_position_embedding(max_seq_len, dim):
    # Calculate the angle rates based on dimension indices.
    angle_rates = 1 / torch.pow(10000, torch.arange(0, dim, 2).float() / dim)
    # Calculate the angles for each position for half of the dimensions (sine and cosine)
    angles = (torch.arange(max_seq_len).unsqueeze(1) * angle_rates.unsqueeze(0))
    # Cosines and sines of the angles to get the RoPE for each position
    position_encodings = torch.stack((angles.cos(), angles.sin()), dim=2).flatten(1)
    return position_encodings

def apply_rope_embeddings(embeddings, position_encodings):
    # Split the position encodings into cosines and sines
    cos_enc, sin_enc = position_encodings[..., 0::2], position_encodings[..., 1::2]
    # Apply the rotations
    embeddings[..., 0::2] = embeddings[..., 0::2] * cos_enc - embeddings[..., 1::2] * sin_enc
    embeddings[..., 1::2] = embeddings[..., 1::2] * cos_enc + embeddings[..., 0::2] * sin_enc
    return embeddings

batch_size, max_seq_len, dim = 1, 128, 512  # Dimensions for batch size, sequence length, and embedding dimension

# Initialize random embeddings simulating a batch of token embeddings
token_embeddings = torch.randn(batch_size, max_seq_len, dim)

# Generate the position encodings for the sequence
position_encodings = rotary_position_embedding(max_seq_len, dim)

# Apply the RoPE to the token embeddings
rotated_token_embeddings = apply_rope_embeddings(token_embeddings, position_encodings)

The latest LLM models like Mistral and Llama are incorporating RoPE in their architecture. Understanding both the theoretical concepts and practical code implementation will help you to grasp these advanced models. I hope you found this blog informative and enjoyable. If you liked it, follow me on LinkedIn for more posts like this.
最新的 LLM 模型，如 Mistral 和 Llama，在其架构中使用了 RoPE。理解理论概念和实际代码实现将帮助你掌握这些先进的模型。希望你觉得这篇博客内容丰富并且有趣。如果你喜欢，欢迎在领英上关注我，了解更多类似的文章。

Reference 参考文献

Vaswani, Ashish, et al. “Attention Is All You Need.” ArXiv, 2017, /abs/1706.03762. Accessed 5 Apr. 2024.
Vaswani, Ashish, 等。《Attention Is All You Need》。ArXiv, 2017, /abs/1706.03762。访问日期 2024 年 4 月 5 日。
Su, Jianlin, et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” ArXiv, 2021, /abs/2104.09864. Accessed 5 Apr. 2024.
Su, Jianlin, 等。《RoFormer: Enhanced Transformer with Rotary Position Embedding》。ArXiv, 2021, /abs/2104.09864。访问日期 2024 年 4 月 5 日。

Introduction 介绍

Sinusoidal Positional Encoding正弦位置编码

Step-by-Step Calculation for a Single Position单个位置的逐步计算

Code Implementation of Sinusoidal Positional Encoding正弦位置编码的代码实现

Rotary Positional Embedding (RoPE)旋转位置嵌入（RoPE）