A Visual Guide to Mamba and State Space Models
黑曼巴和状态空间模型的视觉指南
An Alternative to Transformers for Language Modeling
语言建模的变压器替代方案
The Transformer architecture has been a major component in the success of Large Language Models (LLMs). It has been used for nearly all LLMs that are being used today, from open-source models like Mistral to closed-source models like ChatGPT.
Transformer 架构已成为大型语言模型(LLMs)成功的关键组成部分。它几乎被用于今天使用的所有LLMs,从开源模型如 Mistral 到闭源模型如 ChatGPT。
To further improve LLMs, new architectures are developed that might even outperform the Transformer architecture. One of these methods is Mamba, a State Space Model.
为了进一步改进LLMs,开发了新的架构,这些架构甚至可能超越 Transformer 架构。其中一种方法是 Mamba,一种状态空间模型。
Mamba was proposed in the paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces.1 You can find its official implementation and model checkpoints in its repository.
Mamba 在论文《Mamba: Linear-Time Sequence Modeling with Selective State Spaces》中被提出。1 你可以在其仓库中找到官方实现和模型检查点。
In this post, I will introduce the field of State Space Models in the context of language modeling and explore concepts one by one to develop an intuition about the field. Then, we will cover how Mamba might challenge the Transformers architecture.
在这篇文章中,我将在语言建模的背景下介绍状态空间模型的领域,并逐一探讨概念,以培养对该领域的直觉。然后,我们将探讨 Mamba 如何挑战 Transformer 架构。
As a visual guide, expect many visualizations to develop an intuition about Mamba and State Space Models!
作为视觉指南,预计会有许多可视化来培养对 Mamba 和状态空间模型的直觉!
Table of Contents 目录
Part 1: The Problem with Transformers
第一部分:变压器的问题Part 2: The State Space Model (SSM)
第 2 部分:状态空间模型(SSM)Part 3: Mamba - A Selective SSM
第三部分:Mamba - 一种选择性 SSM
Part 1: The Problem with Transformers
第一部分:变压器的问题
To illustrate why Mamba is such an interesting architecture, let’s do a short re-cap of transformers first and explore one of its disadvantages.
为了说明为什么 Mamba 架构如此有趣,让我们先简要回顾一下 Transformer,并探讨其一个缺点。
A Transformer sees any textual input as a sequence that consists of tokens.
Transformer 将任何文本输入视为由令牌组成的序列。
A major benefit of Transformers is that whatever input it receives, it can look back at any of the earlier tokens in the sequence to derive its representation.
Transformer 的一个主要优点是,无论接收到什么输入,它都可以回顾序列中的任何早期令牌来推导其表示。
The Core Components of Transformers
Transformer 的核心组件
Remember that a Transformer consists of two structures, a set of encoder blocks for representing text and a set of decoder blocks for generating text. Together, these structures can be used for several tasks, including translation.
请记住,Transformer 由两个结构组成,一组用于表示文本的编码器块和一组用于生成文本的解码器块。这些结构一起可以用于包括翻译在内的多项任务。
We can adopt this structure to create generative models by using only decoders. This Transformer-based model, Generative Pre-trained Transformers (GPT), uses decoder blocks to complete some input text.
我们可以采用这种结构来创建生成模型,只需使用解码器即可。这种基于 Transformer 的模型,即生成式预训练变换器(GPT),使用解码器块来完成一些输入文本。
Let’s take a look at how that works!
让我们来看看这是如何工作的!
A Blessing with Training…
训练的祝福…
A single decoder block consists of two main components, masked self-attention followed by a feed-forward neural network.
单个解码器块由两个主要组件组成,即掩码自注意力机制,随后是一个前馈神经网络。
Self-attention is a major reason why these models work so well. It enables an uncompressed view of the entire sequence with fast training.
自注意力是这些模型如此有效的主要原因之一。它使得整个序列的未压缩视图得以快速训练。
So how does it work?
那么它是如何工作的?
It creates a matrix comparing each token with every token that came before. The weights in the matrix are determined by how relevant the token pairs are to one another.
它创建了一个矩阵,将每个标记与之前出现的每个标记进行比较。矩阵中的权重由标记对之间的相关性决定。
During training, this matrix is created in one go. The attention between “My” and “name” does not need to be calculated first before we calculate the attention between “name” and “is”.
在训练过程中,这个矩阵是一次性创建的。在计算“我的”和“名字”之间的注意力时,不需要先计算“名字”和“是”之间的注意力。
It enables parallelization, which speeds up training tremendously!
它实现了并行化,极大地加快了训练速度!
And the Curse with Inference!
还有与推理相关的诅咒!
There is a flaw, however. When generating the next token, we need to re-calculate the attention for the entire sequence, even if we already generated some tokens.
然而,存在一个缺陷。在生成下一个标记时,我们需要重新计算整个序列的注意力,即使我们已经生成了一些标记。
Generating tokens for a sequence of length L needs roughly L² computations which can be costly if the sequence length increases.
生成一个长度为 L 的序列的令牌大约需要 L²次计算,如果序列长度增加,这可能会变得成本高昂。
This need to recalculate the entire sequence is a major bottleneck of the Transformer architecture.
这种需要重新计算整个序列的情况是 Transformer 架构的主要瓶颈。
Let’s look at how a “classic” technique, Recurrent Neural Networks, solves this problem of slow inference.
让我们来看看“经典”技术——循环神经网络,是如何解决推理速度缓慢这个问题的。
Are RNNs a Solution? RNN 是否是解决方案?
Recurrent Neural Networks (RNN) is a sequence-based network. It takes two inputs at each time step in a sequence, namely the input at time step t and a hidden state of the previous time step t-1, to generate the next hidden state and predict the output.
循环神经网络(RNN)是一种基于序列的网络。它在序列的每个时间步接收两个输入,即时间步 t 的输入和前一个时间步 t-1 的隐藏状态,以生成下一个隐藏状态并预测输出。
RNNs have a looping mechanism that allows them to pass information from a previous step to the next. We can “unfold” this visualization to make it more explicit.
RNN 具有循环机制,允许它们将信息从一个步骤传递到下一个步骤。我们可以“展开”这种可视化,使其更加明确。
When generating the output, the RNN only needs to consider the previous hidden state and current input. It prevents recalculating all previous hidden states which is what a Transformer would do.
在生成输出时,RNN 只需要考虑前一个隐藏状态和当前输入。它避免了重新计算所有先前的隐藏状态,而这正是 Transformer 会做的事情。
In other words, RNNs can do inference fast as it scales linearly with the sequence length! In theory, it can even have an infinite context length.
换句话说,随着序列长度的线性增长,循环神经网络(RNNs)可以快速进行推理!理论上,它甚至可以拥有无限长的上下文长度。
To illustrate, let’s apply the RNN to the input text we have used before.
为了说明这一点,让我们将 RNN 应用于我们之前使用过的输入文本。
Each hidden state is the aggregation of all previous hidden states and is typically a compressed view.
每个隐藏状态是所有先前隐藏状态的聚合,通常是一个压缩的视图。
There is a problem, however…
然而,存在一个问题…
Notice that the last hidden state, when producing the name “Maarten” does not contain information about the word “Hello” anymore. RNNs tend to forget information over time since they only consider one previous state.
请注意,当生成名称“Maarten”时,最后一个隐藏状态不再包含关于单词“Hello”的信息。由于 RNN 只考虑一个先前的状态,它们往往会随着时间的推移而忘记信息。
This sequential nature of RNNs creates another problem. Training cannot be done in parallel since it needs to go through each step at a time sequentially.
RNN 的这种顺序性质带来了另一个问题。由于需要按顺序逐个步骤进行,因此无法并行进行训练。
The problem with RNNs, compared to Transformers, is completely the opposite! Its inference is incredibly fast but it is not parallelizable.
RNN 与 Transformer 相比的问题完全相反!其推理速度非常快,但无法并行化。
Can we somehow find an architecture that does parallelize training like Transformers whilst still performing inference that scales linearly with sequence length?
我们能否找到一种架构,既能像 Transformer 一样并行化训练,同时又能进行与序列长度成线性比例的推理?
Yes! This is what Mamba offers but before diving into its architecture, let’s explore the world of State Space Models first.
是的!这就是 Mamba 所提供的,但在深入了解其架构之前,让我们先探索一下状态空间模型的世界。
Part 2: The State Space Model (SSM)
第 2 部分:状态空间模型(SSM)
A State Space Model (SSM), like the Transformer and RNN, processes sequences of information, like text but also signals. In this section, we will go through the basics of SSMs and how they relate to textual data.
状态空间模型(SSM),像 Transformer 和 RNN 一样,处理信息序列,如文本,也包括信号。在本节中,我们将介绍 SSM 的基础知识以及它们与文本数据的关系。
What is a State Space?
什么是状态空间?
A State Space contains the minimum number of variables that fully describe a system. It is a way to mathematically represent a problem by defining a system's possible states.
状态空间包含描述一个系统所需的最少数量的变量。它是通过定义一个系统的可能状态来数学地表示问题的一种方式。
Let’s simplify this a bit. Imagine we are navigating through a maze. The “state space” is the map of all possible locations (states). Each point represents a unique position in the maze with specific details, like how far you are from the exit.
让我们简化一下这个概念。想象一下我们在迷宫中导航。“状态空间”是所有可能位置(状态)的地图。每个点代表迷宫中一个独特的位
The “state space representation” is a simplified description of this map. It shows where you are (current state), where you can go next (possible future states), and what changes take you to the next state (going right or left).
Although State Space Models use equations and matrices to track this behavior, it is simply a way to track where you are, where you can go, and how you can get there.
The variables that describe a state, in our example the X and Y coordinates, as well as the distance to the exit, can be represented as “state vectors”.
Sounds familiar? That is because embeddings or vectors in language models are also frequently used to describe the “state” of an input sequence. For instance, a vector of your current position (state vector) could look a bit like this:
In terms of neural networks, the “state” of a system is typically its hidden state and in the context of Large Language Models, one of the most important aspects of generating a new token.
在神经网络方面,系统的“状态”通常是其隐藏状态,而在大型语言模型的背景下,生成新标记的最重要方面之一。
What is a State Space Model?
什么是状态空间模型?
SSMs are models used to describe these state representations and make predictions of what their next state could be depending on some input.
SSM 是用于描述这些状态表示并根据某些输入预测其下一个状态可能是什么的模型。
Traditionally, at time t, SSMs:
传统上,在时间 t,SSM:
map an input sequence x(t) — (e.g., moved left and down in the maze)
映射输入序列 x(t) — (例如,在迷宫中向左和向下移动)to a latent state representation h(t) — (e.g., distance to exit and x/y coordinates)
到一个潜在的状态表示 h(t) — (例如,距离出口的距离和 x/y 坐标)and derive a predicted output sequence y(t) — (e.g., move left again to reach the exit sooner)
并推导出一个预测的输出序列 y(t) — (例如,再次向左移动以更快到达出口)
However, instead of using discrete sequences (like moving left once) it takes as input a continuous sequence and predicts the output sequence.
然而,它不是使用离散序列(例如向左移动一次)作为输入,而是接受连续序列作为输入,并预测输出序列。
SSMs assume that dynamic systems, such as an object moving in 3D space, can be predicted from its state at time t through two equations.
状态空间模型假设动态系统,例如在 3D 空间中移动的物体,可以通过两个方程从其在时间 t 的状态进行预测。
By solving these equations, we assume that we can uncover the statistical principles to predict the state of a system based on observed data (input sequence and previous state).
通过解决这些方程,我们假设我们可以揭示基于观察数据(输入序列和先前状态)预测系统状态的统计原理。
Its goal is to find this state representation h(t) such that we can go from an input to an output sequence.
其目标是找到这样的状态表示 h(t),以便我们可以从一个输入序列转到输出序列。
These two equations are the core of the State Space Model.
这两个方程是状态空间模型的核心。
The two equations will be referenced throughout this guide. To make them a bit more intuitive, they are color-coded so you can quickly reference them.
这两个方程将在本指南中被频繁引用。为了使它们更加直观,它们被着色以便您可以快速参考。
The state equation describes how the state changes (through matrix A) based on how the input influences the state (through matrix B).
状态方程描述了状态如何变化(通过矩阵 A),以及输入如何影响状态(通过矩阵 B)。
As we saw before, h(t) refers to our latent state representation at any given time t, and x(t) refers to some input.
正如我们之前所见,h(t)指的是在任何给定时间 t 的潜在状态表示,而 x(t)指的是某些输入。
The output equation describes how the state is translated to the output (through matrix C) and how the input influences the output (through matrix D).
输出方程描述了状态如何通过矩阵 C 转换为输出,以及输入如何通过矩阵 D 影响输出。
NOTE: Matrices A, B, C, and D are also commonly refered to as parameters since they are learnable.
注意:矩阵 A、B、C 和 D 通常也被称为参数,因为它们是可学习的。
Visualizing these two equations gives us the following architecture:
可视化这两个方程式为我们提供了以下架构:
Let’s go through the general technique step-by-step to understand how these matrices influence the learning process.
让我们逐步了解这些矩阵如何影响学习过程的一般技术。
Assume we have some input signal x(t), this signal first gets multiplied by matrix B which describes how the inputs influence the system.
假设我们有一个输入信号 x(t),该信号首先乘以矩阵 B,该矩阵描述了输入如何影响系统。
The updated state (akin to the hidden state of a neural network) is a latent space that contains the core “knowledge” of the environment. We multiply the state with matrix A which describes how all the internal states are connected as they represent the underlying dynamics of the system.
更新后的状态(类似于神经网络的隐藏状态)是一个包含环境核心“知识”的潜在空间。我们用矩阵 A 乘以状态,该矩阵描述了所有内部状态如何连接,因为它们代表了系统的潜在动态。
As you might have noticed, matrix A is applied before creating the state representations and is updated after the state representation has been updated.
你可能已经注意到,矩阵 A 在创建状态表示之前应用,并在状态表示更新后更新。
Then, we use matrix C to describe how the state can be translated to an output.
然后,我们使用矩阵 C 来描述状态如何转换为输出。
Finally, we can make use of matrix D to provide a direct signal from the input to the output. This is also often referred to as a skip-connection.
Since matrix D is similar to a skip-connection, the SSM is often regarded as the following without the skip-connection.
Going back to our simplified perspective, we can now focus on matrices A, B, and C as the core of the SSM.
回到我们简化的视角,我们现在可以将注意力集中在矩阵 A、B 和 C 上,它们是状态空间模型(SSM)的核心。
We can update the original equations (and add some pretty colors) to signify the purpose of each matrix as we did before.
我们可以更新原始方程(并添加一些漂亮的颜色),以表明每个矩阵的用途,就像我们之前所做的那样。
Together, these two equations aim to predict the state of a system from observed data. Since the input is expected to be continuous, the main representation of the SSM is a continuous-time representation.
这两个方程共同旨在根据观测数据预测系统的状态。由于输入预计是连续的,因此 SSM 的主要表示形式是连续时间表示。
From a Continuous to a Discrete Signal
从连续信号到离散信号
Finding the state representation h(t) is analytically challenging if you have a continuous signal. Moreover, since we generally have a discrete input (like a textual sequence), we want to discretize the model.
如果你有一个连续信号,找到状态表示 h(t)在分析上是具有挑战性的。此外,由于我们通常有一个离散输入(如文本序列),我们希望将模型离散化。
To do so, we make use of the Zero-order hold technique. It works as follows. First, every time we receive a discrete signal, we hold its value until we receive a new discrete signal. This process creates a continuous signal the SSM can use:
为此,我们使用零阶保持技术。它的工作原理如下。首先,每次我们接收到一个离散信号时,我们都会保持其值,直到我们接收到一个新的离散信号。这个过程创建了一个连续信号,SSM 可以使用它:
How long we hold the value is represented by a new learnable parameter, called the step size ∆. It represents the resolution of the input.
我们持有值的时间长度由一个新的可学习参数表示,称为步长∆。它代表了输入的分辨率。
Now that we have a continuous signal for our input, we can generate a continuous output and only sample the values according to the time steps of the input.
既然我们已经为输入生成了一个连续信号,我们就可以生成一个连续输出,并仅根据输入的时间步长来采样值。
These sampled values are our discretized output!
这些采样值就是我们的离散化输出!
Mathematically, we can apply the Zero-order hold as follows:
在数学上,我们可以如下应用零阶保持:
Together, they allow us to go from a continuous SSM to a discrete SSM represented by a formulation that instead of a function-to-function, x(t) → y(t), is now a sequence-to-sequence, xₖ → yₖ:
它们一起使我们能够从一个连续的 SSM 转换到一个由公式表示的离散 SSM,该公式不再是函数到函数,x(t) → y(t),而是序列到序列,xₖ → yₖ:
翻译完成。
Here, matrices A and B now represent discretized parameters of the model.
在这里,矩阵 A 和 B 现在代表模型的离散化参数。
We use k instead of t to represent discretized timesteps and to make it a bit more clear when we refer to a continuous versus a discrete SSM.
我们使用 k 代替 t 来表示离散化的时间步长,并在我们提到连续与离散状态空间模型时使其更加清晰。
NOTE: We are still saving the continuous form of Matrix A and not the discretized version during training. During training, the continuous representation is discretized.
注意:我们仍然在训练期间保存矩阵 A 的连续形式,而不是离散化版本。在训练期间,连续表示被离散化。
Now that we have a formulation of a discrete representation, let’s explore how we can actually compute the model.
既然我们已经有了离散表示的公式,让我们探讨一下我们如何实际计算模型。
The Recurrent Representation
循环表示
Our discretized SSM allows us to formulate the problem in specific timesteps instead of continuous signals. A recurrent approach, as we saw before with RNNs is quite useful here.
我们的离散化状态空间模型允许我们在特定时间步长而不是连续信号中制定问题。正如我们之前在 RNN 中看到的那样,循环方法在这里非常有用。
If we consider discrete timesteps instead of a continuous signal, we can reformulate the problem with timesteps:
如果我们考虑离散的时间步而不是连续信号,我们可以用时间步重新表述问题:
At each timestep, we calculate how the current input (Bxₖ) influences the previous state (Ahₖ₋₁) and then calculate the predicted output (Chₖ).
在每个时间步,我们计算当前输入(Bxₖ)如何影响前一个状态(Ahₖ₋₁),然后计算预测的输出(Chₖ)。
This representation might already seem a bit familiar! We can approach it the same way we did with the RNN as we saw before.
这种表示可能已经看起来有点熟悉了!我们可以像之前看到的那样,以同样的方式处理 RNN。
Which we can unfold (or unroll) as such:
我们可以像这样展开(或展开)它:
Notice how we can use this discretized version using the underlying methodology of an RNN.
注意我们如何使用这个离散版本,使用 RNN 的基本方法。
This technique gives us both the advantages and disadvantages of an RNN, namely fast inference and slow training.
这种技术为我们提供了 RNN 的优缺点,即快速的推理和缓慢的训练。
The Convolution Representation
卷积表示
Another representation that we can use for SSMs is that of convolutions. Remember from classic image recognition tasks where we applied filters (kernels) to derive aggregate features:
我们可以使用的 SSM 的另一种表示是卷积。记住在经典的图像识别任务中,我们应用过滤器(内核)来提取聚合特征:
Since we are dealing with text and not images, we need a 1-dimensional perspective instead:
由于我们处理的是文本而不是图像,我们需要一个一维的视角:
The kernel that we use to represent this “filter” is derived from the SSM formulation:
我们用来表示这个“过滤器”的内核是从 SSM 公式推导出来的:
Let’s explore how this kernel works in practice. Like convolution, we can use our SSM kernel to go over each set of tokens and calculate the output:
让我们来探索这个内核在实践中是如何工作的。就像卷积一样,我们可以使用我们的 SSM 内核来遍历每一组令牌并计算输出:
This also illustrates the effect padding might have on the output. I changed the order of padding to improve the visualization but we often apply it at the end of a sentence.
这也说明了填充可能对输出产生的影响。我改变了填充的顺序以改善可视化效果,但我们通常在句子末尾应用填充。
In the next step, the kernel is moved once over to perform the next step in the calculation:
在下一步中,内核移动一次以执行计算中的下一步:
In the final step, we can see the full effect of the kernel:
在最后一步,我们可以看到内核的完整效果:
A major benefit of representing the SSM as a convolution is that it can be trained in parallel like Convolutional Neural Networks (CNNs). However, due to the fixed kernel size, their inference is not as fast and unbounded as RNNs.
将 SSM 表示为卷积的一个主要好处是它可以像卷积神经网络(CNNs)一样进行并行训练。然而,由于固定核大小,它们的推理速度不如 RNN 那样快且不受限制。
The Three Representations
三种表示方法
These three representations, continuous, recurrent, and convolutional all have different sets of advantages and disadvantages:
这三种表示方法,连续的、循环的和卷积的,各自都有不同的优势和劣势:
Interestingly, we now have efficient inference with the recurrent SSM and parallelizable training with the convolutional SSM.
有趣的是,我们现在拥有了使用循环 SSM 进行高效推理和使用卷积 SSM 进行可并行化训练的能力。
With these representations, there is a neat trick that we can use, namely choose a representation depending on the task. During training, we use the convolutional representation which can be parallelized and during inference, we use the efficient recurrent representation:
利用这些表示方法,我们可以使用一个巧妙的技巧,即根据任务选择合适的表示方法。在训练过程中,我们使用可以并行化的卷积表示方法,而在推理过程中,我们使用高效的循环表示方法:
This model is referred to as the Linear State-Space Layer (LSSL).2
这种模型被称为线性状态空间层(LSSL)。2
These representations share an important property, namely that of Linear Time Invariance (LTI). LTI states that the SSMs parameters, A, B, and C, are fixed for all timesteps. This means that matrices A, B, and C are the same for every token the SSM generates.
这些表示法共享一个重要特性,即线性时不变性(LTI)。LTI 指出,状态空间模型(SSM)的参数 A、B 和 C 在所有时间步长上都是固定的。这意味着对于 SSM 生成的每个令牌,矩阵 A、B 和 C 都是相同的。
In other words, regardless of what sequence you give the SSM, the values of A, B, and C remain the same. We have a static representation that is not content-aware.
换句话说,无论你给 SSM 什么样的序列,A、B 和 C 的值都保持不变。我们有一个静态表示,它不是内容感知的。
Before we explore how Mamba addresses this issue, let’s explore the final piece of the puzzle, matrix A.
在我们探讨 Mamba 如何解决这个问题之前,让我们先来探究这个谜题的最后一块拼图,即矩阵 A。
The Importance of Matrix A
矩阵 A 的重要性
Arguably one of the most important aspects of the SSM formulation is matrix A. As we saw before with the recurrent representation, it captures information about the previous state to build the new state.
可以说,SSM 公式中最重要的方面之一是矩阵 A。正如我们之前在循环表示中看到的,它捕获有关前一个状态的信息以构建新状态。
In essence, matrix A produces the hidden state:
本质上,矩阵 A 产生隐藏状态:
Creating matrix A can therefore be the difference between remembering only a few previous tokens and capturing every token we have seen thus far. Especially in the context of the Recurrent representation since it only looks back at the previous state.
因此,创建矩阵 A 可以是在只记住几个先前标记和捕捉到目前为止我们看到的每个标记之间的区别。特别是在循环表示的上下文中,因为它只回顾前一个状态。
So how can we create matrix A in a way that retains a large memory (context size)?
那么,我们如何创建矩阵 A,以便保留大量内存(上下文大小)?
We use Hungry Hungry Hippo! Or HiPPO3 for High-order Polynomial Projection Operators. HiPPO attempts to compress all input signals it has seen thus far into a vector of coefficients.
我们使用饥饿的河马!或者 HiPPO 3,即高阶多项式投影算子。HiPPO 试图将迄今为止所见到的所有输入信号压缩成一个系数向量。
It uses matrix A to build a state representation that captures recent tokens well and decays older tokens. Its formula can be represented as follows:
它使用矩阵 A 来构建一个能够很好地捕捉最近令牌并且衰减较老令牌的状态表示。其公式可以表示如下:
Assuming we have a square matrix A, this gives us:
假设我们有一个方阵 A,这给了我们:
Building matrix A using HiPPO was shown to be much better than initializing it as a random matrix. As a result, it more accurately reconstructs newer signals (recent tokens) compared to older signals (initial tokens).
使用 HiPPO 构建矩阵 A 被证明比将其初始化为随机矩阵要好得多。因此,与旧信号(初始令牌)相比,它更准确地重构了新信号(最近的令牌)。
The idea behind the HiPPO Matrix is that it produces a hidden state that memorizes its history.
HiPPO 矩阵背后的思想是,它产生一个记忆其历史的隐藏状态。
Mathematically, it does so by tracking the coefficients of a Legendre polynomial which allows it to approximate all of the previous history.4
从数学上讲,它通过跟踪勒让德多项式的系数来实现这一点,这使得它能够近似所有先前的历史。4
HiPPO was then applied to the recurrent and convolution representations that we saw before to handle long-range dependencies. The result was Structured State Space for Sequences (S4), a class of SSMs that can efficiently handle long sequences.5
然后,HiPPO 被应用于我们之前看到的循环和卷积表示,以处理长距离依赖关系。结果是序列的结构化状态空间(S4),这是一类可以有效处理长序列的 SSM。5
It consists of three parts:
它由三个部分组成:
State Space Models 状态空间模型
HiPPO for handling long-range dependencies
HiPPO 用于处理长距离依赖关系Discretization for creating recurrent and convolution representations
离散化用于创建循环和卷积表示
This class of SSMs has several benefits depending on the representation you choose (recurrent vs. convolution). It can also handle long sequences of text and store memory efficiently by building upon the HiPPO matrix.
这种 SSM 类别根据您选择的表示方式(循环与卷积)具有多种优势。它还能够处理长文本序列,并通过建立在 HiPPO 矩阵之上有效地存储记忆。
NOTE: If you want to dive into more of the technical details on how to calculate the HiPPO matrix and build a S4 model yourself, I would HIGHLY advise going through the Annotated S4.
注意:如果您想深入了解如何计算 HiPPO 矩阵以及如何自己构建 S4 模型的更多技术细节,我强烈建议您阅读《Annotated S4》。
Part 3: Mamba - A Selective SSM
第三部分:Mamba - 一种选择性 SSM
We finally have covered all the fundamentals necessary to understand what makes Mamba special. State Space Models can be used to model textual sequences but still have a set of disadvantages we want to prevent.
我们终于涵盖了理解 Mamba 独特之处所需的所有基础知识。状态空间模型可以用来模拟文本序列,但仍然存在一组我们想要避免的缺点。
In this section, we will go through Mamba’s two main contributions:
在本节中,我们将介绍 Mamba 的两个主要贡献:
A selective scan algorithm, which allows the model to filter (ir)relevant information
一种选择性扫描算法,允许模型过滤(不)相关信息A hardware-aware algorithm that allows for efficient storage of (intermediate) results through parallel scan, kernel fusion, and recomputation.
一种硬件感知算法,通过并行扫描、内核融合和重新计算,实现了(中间)结果的高效存储。
Together they create the selective SSM or S6 models which can be used, like self-attention, to create Mamba blocks.
它们共同创造了选择性的 SSM 或 S6 模型,这些模型可以像自注意力一样被用来创建 Mamba 块。
Before exploring the two main contributions, let’s first explore why they are necessary.
在探讨两个主要贡献之前,我们先来探究为什么它们是必要的。
What Problem does it attempt to Solve?
它试图解决什么问题?
State Space Models, and even the S4 (Structured State Space Model), perform poorly on certain tasks that are vital in language modeling and generation, namely the ability to focus on or ignore particular inputs.
状态空间模型,甚至是 S4(结构化状态空间模型),在某些对语言建模和生成至关重要的任务上表现不佳,即专注于或忽略特定输入的能力。
We can illustrate this with two synthetic tasks, namely selective copying and induction heads.
我们可以通过两个合成任务来说明这一点,即选择性复制和归纳头。
In the selective copying task, the goal of the SSM is to copy parts of the input and output them in order:
在选择性复制任务中,SSM 的目标是复制输入的部分内容并按顺序输出它们:
However, a (recurrent/convolutional) SSM performs poorly in this task since it is Linear Time Invariant. As we saw before, the matrices A, B, and C are the same for every token the SSM generates.
然而,由于是线性时间不变的,(循环/卷积)SSM 在这个任务上表现不佳。正如我们之前所见,对于 SSM 生成的每个标记,矩阵 A、B 和 C 都是相同的。
As a result, an SSM cannot perform content-aware reasoning since it treats each token equally as a result of the fixed A, B, and C matrices. This is a problem as we want the SSM to reason about the input (prompt).
因此,SSM 无法进行内容感知推理,因为它将每个标记都视为固定 A、B 和 C 矩阵的结果。这是一个问题,因为我们希望 SSM 对输入(提示)进行推理。
The second task an SSM performs poorly on is induction heads where the goal is to reproduce patterns found in the input:
SSM 表现不佳的第二个任务是归纳头,其目标是重现输入中发现的图案:
In the above example, we are essentially performing one-shot prompting where we attempt to “teach” the model to provide an “A:” response after every “Q:”. However, since SSMs are time-invariant it cannot select which previous tokens to recall from its history.
在上面的例子中,我们实际上是在进行一次性提示,我们试图“教”模型在每个“Q:”之后提供一个“A:”响应。然而,由于 SSM 是时间不变的,它无法选择从其历史中回忆哪些先前的标记。
Let’s illustrate this by focusing on matrix B. Regardless of what the input x is, matrix B remains exactly the same and is therefore independent of x:
让我们通过关注矩阵 B 来阐明这一点。无论输入 x 是什么,矩阵 B 保持完全不变,因此与 x 无关:
Likewise, A and C also remain fixed regardless of the input. This demonstrates the static nature of the SSMs we have seen thus far.
同样,A 和 C 也始终保持不变,无论输入如何。这表明了我们迄今为止所见的 SSMs 的静态特性。
翻译结果:
In comparison, these tasks are relatively easy for Transformers since they dynamically change their attention based on the input sequence. They can selectively “look” or “attend” at different parts of the sequence.
相比之下,这些任务对于 Transformer 来说相对容易,因为它们根据输入序列动态改变它们的注意力。它们可以选择性地“看”或“关注”序列的不同部分。
The poor performance of SSMs on these tasks illustrates the underlying problem with time-invariant SSMs, the static nature of matrices A, B, and C results in problems with content-awareness.
SSMs 在这些任务上的表现不佳说明了时间不变 SSMs 的潜在问题,矩阵 A、B 和 C 的静态特性导致了内容感知方面的问题。
Selectively Retain Information
有选择地保留信息
The recurrent representation of an SSM creates a small state that is quite efficient as it compresses the entire history. However, compared to a Transformer model which does no compression of the history (through the attention matrix), it is much less powerful.
SSM 的循环表示创建了一个非常高效的小状态,因为它压缩了整个历史。然而,与不压缩历史(通过注意力矩阵)的 Transformer 模型相比,它的能力要弱得多。
Mamba aims to have the best of both worlds. A small state that is as powerful as the state of a Transformer:
Mamba 旨在拥有两全其美的优势。一个与 Transformer 状态一样强大的小状态:
As teased above, it does so by compressing data selectively into the state. When you have an input sentence, there is often information, like stop words, that does not have much meaning.
如上所述,它通过有选择地将数据压缩到状态中来实现这一点。当你有一个输入句子时,通常会有一些信息,比如停用词,这些信息并没有太多意义。
To selectively compress information, we need the parameters to be dependent on the input. To do so, let’s first explore the dimensions of the input and output in an SSM during training:
为了有选择地压缩信息,我们需要参数依赖于输入。为此,让我们首先探索训练期间 SSM 中输入和输出的维度:
In a Structured State Space Model (S4), the matrices A, B, and C are independent of the input since their dimensions N and D are static and do not change.
在结构化状态空间模型(S4)中,矩阵 A,B 和 C 与输入无关,因为它们的维度 N 和 D 是静态的,不会改变。
翻译完成。
Instead, Mamba makes matrices B and C, and even the step size ∆, dependent on the input by incorporating the sequence length and batch size of the input:
相反,Mamba 通过将序列长度和批次大小纳入输入,使矩阵 B 和 C,甚至步长∆都依赖于输入:
This means that for every input token, we now have different B and C matrices which solves the problem with content-awareness!
这意味着对于每个输入令牌,我们现在有不同的 B 和 C 矩阵,这解决了内容感知的问题!
NOTE: Matrix A remains the same since we want the state itself to remain static but the way it is influenced (through B and C) to be dynamic.
Together, they selectively choose what to keep in the hidden state and what to ignore since they are now dependent on the input.
他们一起有选择地决定在隐藏状态中保留什么和忽略什么,因为它们现在依赖于输入。
A smaller step size ∆ results in ignoring specific words and instead using the previous context more whilst a larger step size ∆ focuses on the input words more than the context:
较小的步长∆导致忽略特定的单词,而更多地使用先前的上下文,而较大的步长∆则更多地关注输入单词而不是上下文:
The Scan Operation
Since these matrices are now dynamic, they cannot be calculated using the convolution representation since it assumes a fixed kernel. We can only use the recurrent representation and lose the parallelization the convolution provides.
To enable parallelization, let’s explore how we compute the output with recurrency:
Each state is the sum of the previous state (multiplied by A) plus the current input (multiplied by B). This is called a scan operation and can easily be calculated with a for loop.
Parallelization, in contrast, seems impossible since each state can only be calculated if we have the previous state. Mamba, however, makes this possible through the parallel scan algorithm.
并行化似乎是不可能的,因为每个状态只能在我们拥有前一个状态的情况下计算。然而,Mamba 通过并行扫描算法使得这成为可能。
It assumes the order in which we do operations does not matter through the associate property. As a result, we can calculate the sequences in parts and iteratively combine them:
它假设我们进行操作的顺序并不重要,这是通过结合律属性实现的。因此,我们可以分部分计算序列,并迭代地组合它们:
Together, dynamic matrices B and C, and the parallel scan algorithm create the selective scan algorithm to represent the dynamic and fast nature of using the recurrent representation.
动态矩阵 B 和 C 以及并行扫描算法共同创造了选择性扫描算法,以代表使用循环表示的动态和快速特性。
Hardware-aware Algorithm 硬件感知算法
A disadvantage of recent GPUs is their limited transfer (IO) speed between their small but highly efficient SRAM and their large but slightly less efficient DRAM. Frequently copying information between SRAM and DRAM becomes a bottleneck.
近期 GPU 的一个缺点是它们有限的传输(IO)速度,这是由于它们的小巧但高效 SRAM 与大容量但效率稍低的 DRAM 之间的速度限制。频繁地在 SRAM 和 DRAM 之间复制信息成为了一个瓶颈。
Mamba, like Flash Attention, attempts to limit the number of times we need to go from DRAM to SRAM and vice versa. It does so through kernel fusion which allows the model to prevent writing intermediate results and continuously performing computations until it is done.
曼巴,类似于闪存注意力,试图限制我们从动态随机存取存储器(DRAM)到静态随机存取存储器(SRAM)以及反之的次数。它通过内核融合实现这一点,这使得模型能够避免写入中间结果,并持续进行计算直到完成。
We can view the specific instances of DRAM and SRAM allocation by visualizing Mamba’s base architecture:
我们可以通过可视化 Mamba 的基础架构来查看 DRAM 和 SRAM 分配的具体实例:
Here, the following are fused into one kernel:
在这里,以下内容被融合到一个内核中:
Discretization step with step size ∆
离散化步骤,步长为∆Selective scan algorithm 选择性扫描算法
Multiplication with C 与 C 相乘
The last piece of the hardware-aware algorithm is recomputation.
硬件感知算法的最后一部分是重新计算。
The intermediate states are not saved but are necessary for the backward pass to compute the gradients. Instead, the authors recompute those intermediate states during the backward pass.
中间状态没有被保存,但对于反向传播计算梯度是必要的。相反,作者在反向传播过程中重新计算这些中间状态。
Although this might seem inefficient, it is much less costly than reading all those intermediate states from the relatively slow DRAM.
尽管这可能看起来效率不高,但它比从相对较慢的 DRAM 中读取所有那些中间状态要便宜得多。
We have now covered all components of its architecture which is depicted using the following image from its article:
我们现在已涵盖了其架构的所有组件,这些组件在其文章中以下图所示:
data:image/s3,"s3://crabby-images/42a29/42a2986df7cadf8292c45a1140b085c785d97301" alt=""
选择性 SSM。摘自:Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." arXiv preprint arXiv:2312.00752 (2023).
This architecture is often referred to as a selective SSM or S6 model since it is essentially an S4 model computed with the selective scan algorithm.
这种架构通常被称为选择性 SSM 或 S6 模型,因为它本质上是用选择性扫描算法计算的 S4 模型。
The Mamba Block 曼巴区块
The selective SSM that we have explored thus far can be implemented as a block, the same way we can represent self-attention in a decoder block.
到目前为止我们所探讨的选择性 SSM 可以实现为一个块,就像我们可以将自注意力表示为解码器块一样。
Like the decoder, we can stack multiple Mamba blocks and use their output as the input for the next Mamba block:
与解码器类似,我们可以堆叠多个 Mamba 块,并将其输出作为下一个 Mamba 块的输入:
It starts with a linear projection to expand upon the input embeddings. Then, a convolution before the Selective SSM is applied to prevent independent token calculations.
它从一个线性投影开始,以扩展输入嵌入。然后,在选择性 SSM 之前应用卷积,以防止独立令牌计算。
The Selective SSM has the following properties:
选择性 SSM 具有以下属性:
Recurrent SSM created through discretization
通过离散化创建的循环 SSMHiPPO initialization on matrix A to capture long-range dependencies
在矩阵 A 上使用 HiPPO 初始化以捕获长距离依赖关系Selective scan algorithm to selectively compress information
选择性扫描算法以选择性地压缩信息Hardware-aware algorithm to speed up computation
硬件感知算法以加速计算 翻译完成
We can expand on this architecture a bit more when looking at the code implementation and explore how an end-to-end example would look like:
当我们查看代码实现时,可以进一步扩展这种架构,并探索一个端到端的示例会是什么样子:
Notice some changes, like the inclusion of normalization layers and softmax for choosing the output token.
When we put everything together, we get both fast inference and training and even unbounded context!
Using this architecture, the authors found it matches and sometimes even exceeds the performance of Transformer models of the same size!
Conclusion
This concludes our journey in State Space Models and the incredible Mamba architecture using a selective State Space Model. Hopefully, this post gives you a better understanding of the potential of State Space Models, particularly Mamba. Who knows if this is going to replace the Transformers but for now, it is incredible to see such different architectures getting well-deserved attention!
Resources 资源
Hopefully, this was an accessible introduction to Mamba and State Space Models. If you want to go deeper, I would suggest the following resources:
希望这为 Mamba 和状态空间模型提供了一个易于理解的介绍。如果你想深入了解,我建议以下资源:
翻译完成。
The Annotated S4 is a JAX implementation and guide through the S4 model and is highly advised!
《Annotated S4》是一个 JAX 实现,以及对 S4 模型的指南,强烈推荐!A great YouTube video introducing Mamba by building it up through foundational papers.
一个很棒的 YouTube 视频,通过基础论文逐步构建 Mamba 来介绍它。The Mamba repository with checkpoints on Hugging Face.
黑曼巴在 Hugging Face 上的存储库,包含检查点。An amazing series of blog posts (1, 2, 3) that introduces the S4 model.
一系列精彩的博客文章(1、2、3),介绍 S4 模型。The Mamba No. 5 (A Little Bit Of...) blog post is a great next step to dive into more technical details about Mamba but still from an amazingly intuitive perspective.
《黑曼巴 5 号(一点点...)》这篇博客文章是深入了解黑曼巴技术细节的绝佳下一步,同时仍然保持着令人惊叹的直观视角。And of course, the Mamba paper! It was even used for DNA modeling and speech generation.
当然,还有黑曼巴论文!它甚至被用于 DNA 建模和语音生成。
Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." arXiv preprint arXiv:2312.00752 (2023).
顾,阿尔伯特,和崔道。 "Mamba: 选择性状态空间的线性时间序列建模。" 预印本 arXiv:2312.00752 (2023)。
翻译完成。
Gu, Albert, et al. "Combining recurrent, convolutional, and continuous-time models with linear state space layers." Advances in neural information processing systems 34 (2021): 572-585.
顾,阿尔伯特等。 "结合循环、卷积和连续时间模型的线性状态空间层。" 神经信息处理系统进展 34 (2021): 572-585。
Gu, Albert, et al. "Hippo: Recurrent memory with optimal polynomial projections." Advances in neural information processing systems 33 (2020): 1474-1487.
顾,阿尔伯特等。 "Hippo:具有最佳多项式投影的循环记忆。" 神经信息处理系统进展 33 (2020): 1474-1487。
Voelker, Aaron, Ivana Kajić, and Chris Eliasmith. "Legendre memory units: Continuous-time representation in recurrent neural networks." Advances in neural information processing systems 32 (2019).
沃克,亚伦,伊万娜·卡吉奇,和克里斯·埃利亚斯米特。 "Legendre 记忆单元:循环神经网络中的连续时间表示。" 神经信息处理系统进展 32 (2019)。
Gu, Albert, Karan Goel, and Christopher Ré. "Efficiently modeling long sequences with structured state spaces." arXiv preprint arXiv:2111.00396 (2021).
顾,阿尔伯特,卡兰·戈尔,和克里斯托弗·雷。 "用结构化状态空间高效建模长序列。" arXiv 预印本 arXiv:2111.00396 (2021)。
Subscribe to Exploring Language Models
订阅探索语言模型
ML Engineer writing about the intersection of AI, Language Models, and Psychology. Open Source Developer (BERTopic, PolyFuzz, KeyBERT). Co-author of "Hands-On Large Language Models".
机器学习工程师,撰写关于人工智能、语言模型和心理学交叉领域的文章。开源开发者(BERTopic、PolyFuzz、KeyBERT)。《大型语言模型实战》的合著者。
Thank you for taking the time to break down the complexity into visually appealing, bite sized chunks. This is very interesting!
Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!