Mixture-of-Experts with Expert Choice Routing
专家选择路由的专家混合模型

Yanqi Zhou   周研琦 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com Tao Lei   雷涛 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com Hanxiao Liu   刘汉潇 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com Nan Du   杜楠 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com Yanping Huang   黄晏平 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com Vincent Zhao   赵文森 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com Andrew Dai   戴安德鲁 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com Zhifeng Chen   陈志峰 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com Quoc Le   李国 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com James Laudon   詹姆斯·劳登 Google, Mountain View, CA, USA
谷歌，美国加利福尼亚州山景城
{yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com

Abstract 摘要

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top- $k$ function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top- $k$ experts, we have experts selecting the top- $k$ tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2 $\times$ . For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.
稀疏激活专家混合（MoE）模型允许参数数量大幅增加，同时保持每个给定 token 或样本的计算量不变。然而，不良的专家路由策略可能导致某些专家训练不足，从而导致某个专家过度或不够专长。之前的工作不管不同 token 的相对重要性，使用一个固定数量的专家分配给每个 token，通过一个 top- $k$ 函数解决这些问题。为了解决这个问题，我们提出了一种采用专家选择方法的异构专家混合方法。我们让专家选择 top- $k$ token，而不是让 token 选择 top- $k$ 专家。因此，每个 token 可以被路由到不同数量的专家，每个专家可以有一个固定的桶大小。我们系统地研究了相同计算资源的 Switch Transformer top-1 和之前工作的 GShard top-2 门控下的预训练加速，并发现我们的方法提高了训练收敛时间超过 2 $\times$ 。在同样的计算成本下，我的方法在 GLUE 和 SuperGLUE 基准的 11 个选定任务中显示更高的微调性能。对于更小的激活成本，我的方法在 11 个任务中有 7 个超过了 T5 密集模型的性能。

1 Introduction 1 引言

Scaling up model capacity, dataset size, and training time has demonstrated huge success in enhancing the performance of computer vision architectures [13, 14, 11, 4] as well as neural language models [26, 2, 20, 27]. The final model quality has been found to have a power-law relationship with the amount of data, model size, and compute time [16, 20]. However, training efficiency, which is defined as the total amount of computation used to achieve superior model quality than the state of the art system [21], should receive greater attention as we increase our efforts towards green AI [29].
扩大模型容量、数据集规模和训练时间在提升计算机视觉结构[13, 14, 11, 4]以及神经语言模型[26, 2, 20, 27]的性能方面已经证明了巨大的成功。最终的模型质量被发现与数据量、模型大小和计算时间之间存在幂律关系 [16, 20]。然而，训练效率，即为了达到比当前最先进系统更优的模型质量所使用的总计算量[21]，在我们努力推动绿色人工智能[29]的过程中应该获得更多关注。

Sparsely gated mixture-of-experts [31] (MoE) provides an effective way to scale model capacity given a fixed computational cost, and has recently played an important role in increasing the training efficiency of large-scale language models [21, 10]. MoE operate by adopting a number of experts, each as a sub-network, and by activating only one or a few experts for each input token. A gating network must be chosen and optimized in order to route each token to the most suited expert(s). For example, recent work has implemented sparse routing via $k$ -means clustering [12], linear assignment to maximize token-expert affinities [22], or hashing [28, 8]. Many of the prior work use a routing strategy concerning the token choice, where each token selects the best one or two experts.
稀疏门控的专家混合模型[31]（MoE）是一种在固定计算成本下扩大模型容量的有效方法，最近在提高大规模语言模型训练效率方面发挥了重要作用 [21, 10]。MoE 通过采用多个专家，每个专家是一个子网络，并且仅为每个输入 token 激活一个或几个专家来操作。必须选择和优化一个门控网络，以便将每个 token 路由到最合适的专家。例如，最近的工作通过 $k$ -均值聚类[12]，线性分配最大化 token-专家亲和性[22]，或者哈希[28, 8]实现了稀疏路由。很多之前的工作使用了一种关于 token 选择的路由策略，其中每个 token 选择最好的一个或两个专家。

We argue that the independent token choice of prior work often leads to an imbalanced load of experts, which causes training inefficiency and sub-optimal training of the model. In order to mitigate this issue, previous sparsely gated networks introduce additional auxiliary losses as regularization to prevent too many tokens being routed to a single expert, but the effectiveness is still limited. Recent approaches [22, 28, 8] explore alternative strategies for routing, but they focus on pre-training only and do not demonstrate performance gain on downstream tasks. Moreover, none of the previous methods consider allocating a variable number of experts to each token based on importance, which can be beneficial.
我们认为，以往研究中独立选择 token 的方法常常导致专家负载不平衡，从而造成训练效率低下和模型训练效果不佳。为了缓解这一问题，之前的稀疏门控网络引入了额外的辅助损失作为正则化手段，以防止过多的 token 被分配到单一专家，但效果仍然有限。最近的一些方法[22, 28, 8]探索了其他的路由策略，但它们仅专注于预训练，并未在下游任务中展示性能提升。此外，之前的方法都没有考虑根据重要性为每个 token 分配可变数量的专家，这样做可能会更有利。

We propose a very simple yet effective routing method we are calling expert choice. Unlike conventional MoE where tokens select one or two top-scoring experts, our method lets each expert pick the top- $k$ tokens. Our method guarantees perfect load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance as demonstrated in our experiments. Our major contributions include:
我们提出了一种非常简单但有效的路由方法，我们称之为专家选择（expert choice）。与传统的混合专家模型（MoE）不同，后者让 tokens 选择一个或两个得分最高的专家，我们的方法让每个专家挑选排名前 $k$ 的 tokens。我们的方法确保了完美的负载平衡，允许每个 token 拥有可变数量的专家，并在我们的实验中证明了其在训练效率和下游性能上的显著提升。我们的主要贡献包括：

Refer to caption — Figure 1: High-level Comparison Between Conventional MoE and expert choice MoE.
图 1：传统 MoE 与专家选择 MoE 的高层次比较。

•

We identify common pitfalls in conventional MoE such as load imbalance as described in Section 3.1. We then propose a heterogeneous, expert choice method to provide a fluid allocation of model parameters based on a learnt token-to-expert importance. This method intrinsically guarantees load balance without imposing an auxiliary loss.
我们识别出传统 MoE 中的常见问题，例如 3.1 节中描述的负载不平衡。然后我们提出了一种异构的专家选择方法，通过学习的 token-to-expert 重要性提供模型参数的灵活分配。此方法内在地保证了负载平衡，无需附加损失。
•

We show our method provides over 2 $\times$ faster training convergence in a 8B/64E (8 billion activated parameters, 64 experts) model, compared to the top-1 and top-2 gating counterparts in Switch Transformer [10] and GShard [21].
我们的方法在 8B/64E（80 亿激活参数，64 个专家）模型中相比 Switch Transformer [10]和 GShard [21]的 top-1 和 top-2 门控对手，提供了超过 2 倍的训练收敛速度。
•

We show our method demonstrates strong scaling when increasing the number of experts from 16 to 128, evaluated in training perplexity.
我们的方法在将专家数量从 16 增加到 128 时显示出强大的扩展能力，评估指标为训练困惑度。
•

We show our method demonstrates strong performance on downstream tasks selected from GLUE and SuperGLUE at all the evaluated scales. More specifically, our 8B/64E model outperforms a T5 11B dense model in 7 out of 11 tasks evaluated.
我们的方法在 GLUE 和 SuperGLUE 中选择的下游任务中，在所有评估规模下表现出色。更具体地说，我们的 8B/64E 模型在 11 个评估任务中的 7 个任务中优于 T5 11B 密集模型。

2 Related Work 2 相关工作

Scaling: Various approaches have been proposed to scale up neural network capacity to improve performance. Recent works have successfully scaled models to billions of parameters via various forms of model parallelism [21, 33, 27, 26, 2]. Model parallelism [30] splits weights and tensors across multiple cores while pipeline parallelism [18, 24] splits different layers across devices with micro-batches pipelined to the different layers. To enable continued scaling of neural networks, improving model training and serving efficiency has become a critical research area.
扩展：为了提高性能，各种方法被提出以扩展神经网络容量。最近的研究通过各种形式的模型并行成功地将模型扩展到数十亿参数[21, 33, 27, 26, 2]。模型并行[30]将权重和张量分配到多个核心上，而流水线并行[18, 24]将不同的层分割到设备上，并通过微批处理输送到不同的层。为了持续扩展神经网络，提升模型训练和服务效率已成为关键研究领域。

Conditional Computation: Computation decisions can be made dynamically based on the input [25, 23]. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation, by activating certain parameters and computation on demand, on a per-example or per-token basis [3]. Conditional convolution layers [1] with task-specific gating has been used to combat catastrophic forgetting when a sequence of learning problems are optimized. The gating decisions may be binary or sparse and continuous, stochastic or deterministic.
条件计算：可以根据输入动态地进行计算决策[25, 23]。条件计算被提议作为一种在不增加计算量的情况下提高深度神经网络容量的方法，通过按需激活特定的参数和计算，基于每个实例或每个 token[3]。任务特定门控的条件卷积层[1]被用来对抗在优化一系列学习问题时遇到的灾难性遗忘。门控决策可以是二元的或稀疏连续的，可以是随机的或确定的。

Mixture of Experts: Sparsely-gated MoE [31] is the first model to demonstrate massive improvements in model capacity, training time, or model quality with gating. Switch Transformer [10] simplifies the gating by selecting only the top expert per token using a softmax over the hidden state and demonstrates better scaling than previous work. All the prior work requires an auxiliary loss to explicitly encourage balancing. This loss term has to be carefully weighted to not overwhelm the primary loss. However, auxiliary loss does not guarantee balancing and a hard capacity factor has to be imposed. As a result, many tokens can still be unprocessed by the MoE layer. Hard MoE [12] with a single decoding layer can be efficiently trained to good effect on large scale hashtag prediction tasks. Base Layers [22] formulate a linear assignment that maximizes token-expert affinities while ensuring each expert receives an equal number of tokens. Hash layers [28, 8] devise hashing techniques on input tokens. However, the evaluations are limited to pre-training perplexity. THOR [Thor.Zuo.2021] randomly activates experts during training and inference and is trained with a consistency regularization loss. THOR has demonstrated strong performance on translation tasks. Different from these prior works, our method is a learnt method that enables heterogeneous MoE and effectively improves downstream fine-tuning performance.
专家混合：稀疏门控 MoE 是第一个通过门控机制在模型容量、训练时间或模型质量上展示巨大进步的模型。Switch Transformer 通过在隐藏状态上使用 softmax 选择每个标记的顶级专家，简化了门控，并展示了比以往工作更好的扩展性能。此前的工作都需要辅助损失来明确地促进平衡。这个损失项必须仔细加权，以免压倒主要损失。然而，辅助损失不能保证平衡，必须施加硬容量因子。因此，许多标记仍可能未被 MoE 层处理。Hard MoE 通过单层解码可以高效地在大规模标签预测任务中实现良好效果。Base Layers 制定了一种线性分配，以最大化标记-专家之间的亲和力，同时确保每个专家接收相同比例的标记。Hash layers 在输入标记上设计了哈希技术。然而，这些评估仅限于预训练的困惑度。THOR 在训练和推理过程中随机激活专家，并使用一致性正则化损失进行训练。在翻译任务中，THOR 表现出强大的性能。有别于这些先前的工作，我们的方法是一种学习方法，可以实现异构的 MoE，并有效提高下游微调性能。

3 Method 方法三

We first identify a few pitfalls in the routing method of conventional mixture-of-experts (MoE) models and then present our method using expert choice to tackle these problems.
我们首先识别出传统混合专家（MoE）模型的路由方法中的一些缺陷，然后介绍我们的方法，使用专家选择来解决这些问题。

3.1 Pitfalls of Token-Choice Routing
3.1 令牌选择路由的缺陷

MoE can be computationally advantageous compared to a dense model, a routing strategy must be used to assign each token to the most-suited experts. Conventional MoE models employ token-choice routing which independently selects the top- $k$ experts for each token [31, 21, 10]. We argue that this strategy has a few pitfalls that lead to sub-optimal training.
虽然 MoE 在计算上可能比密集模型更具优势，但必须使用一种路由策略将每个令牌分配给最合适的专家。传统的 MoE 模型采用令牌选择路由，它为每个令牌独立选择前 $k$ 个专家 [31, 21, 10]。我们认为这种策略有一些缺陷，导致训练效果不佳。

Load Imbalance: Token-choice routing often lead to poor load balancing across experts. That is, some experts may be trained with most tokens, leaving the remaining experts under-utilized. Experts can be under specialized because a lot of model capacity in the under-utilized experts are wasted. On the other side, some tokens will not be processed, since over-utilized experts can only take a maximum number of tokens at each step in order to avoid running out of memory. Load imbalance can also hurt step latency, thus inference time, as the step latency can be determined by the most loaded expert. Previous methods add an auxiliary loss on load balancing to mitigate the issue. However, this auxiliary loss does not guarantee a balanced load, especially during the important early stages of training. Indeed, we empirically observe that the over-capacity ratio can reach 20%–40% for some experts in token choice routing, indicating that a significant portion of the tokens routed to these experts will be dropped.
负载不平衡：令牌选择路由经常导致不同专家之间的负载不平衡。即一些专家可能会被大多数令牌训练，而其他专家则未被充分利用。专家可能会因为未充分利用的专家中的大量模型容量被浪费而变得不够专门化。另一方面，由于过度使用的专家在每一步只能处理一定数量的令牌以避免内存不足，一些令牌将无法被处理。负载不平衡还会影响步骤延迟，从而影响推理时间，因为步骤延迟可以由负载最重的专家决定。先前的方法在负载平衡上添加了辅助损失以缓解问题。然而，这种辅助损失不能保证负载平衡，尤其是在训练的重要早期阶段。实际上，我们在实验中观察到，在令牌选择路由中，某些专家的超容量比可以达到 20%–40%，这表明分配给这些专家的令牌中有相当一部分会被丢弃。

Under Specialization: Each MoE layer uses a gating network to learn token-to-expert affinity. Ideally, the learnt gating network should produce the affinity such that similar or relevant tokens are routed to the same expert. A sub-optimal strategy can produce redundant experts and/or experts that are not sufficiently specialized. Under specialization may result by imposing an large auxiliary loss which favors more load balanced but less effective routing. Finding the right balance on the auxiliary loss to promote both load balancing and specialization is challenging for token-choice routing.
在专业化不足的情况下：每个 MoE 层使用一个门控网络来学习 token 到专家的亲和性。理想情况下，学习到的门控网络应生成这样的亲和性，使得相似或相关的 token 被路由到相同的专家。一个次优的策略可能会导致冗余的专家和/或不够专业化的专家。专业化不足可能是由于设置了过大的辅助损失，以偏向于更平衡负载但效率较低的路由。在确保负载平衡和专业化的辅助损失之间找到合适的平衡对于 token 选择路由来说是具有挑战性的。

Same Compute for Every Token: Finally, in a token-choice strategy each token receives exactly $k$ experts and therefore occupies the same amount of compute. We hypothesize that this is not necessary nor desired. Instead, a MoE model should flexibly allocate its compute resource based on the complexity of the input. Motivated by the aforementioned observations, we next describe a simple yet effective method which produces load balanced assignments based on expert choice.
每个 Token 相同的计算：最终，在 token 选择策略中，每个 token 正好接收 $k$ 个专家，因此占用相同数量的计算资源。我们假设这既不是必需也不是期望的。相反，MoE 模型应该根据输入的复杂性灵活分配其计算资源。基于上述观察，接下来我们将描述一种简单但有效的方法，它基于专家选择产生负载均衡的分配。

3.2 Heterogeneous MoE via Expert Choice
3.2 通过专家选择实现异质 MoE

Different from conventional routing, an expert choice method independently selects top- $k$ tokens for each expert, where $k$ is a fixed expert capacity (i.e. the number of tokens each expert can take). Despite its simplicity, expert choice achieves perfect load balancing by design. It also enables more flexible allocation of model compute since tokens can be received by a variable number of experts.
与传统路由不同，专家选择方法独立为每个专家选择前 $k$ 个 tokens，其中 $k$ 是固定的专家容量（即每个专家可以接受的 tokens 数量）。尽管其简单，但设计上专家选择实现了完美的负载平衡。它还允许更灵活的模型计算分配，因为 tokens 可以由可变数量的专家接收。

In our experiments, we set $k$ as
在我们的实验中，我们将 $k$ 设置为

k=\frac{n\times c}{e}

(1)

where $n$ is the total number of tokens in the input batch (such as batch size $\times$ sequence length), $c$ is the capacity factor, and $e$ is the number of experts. The capacity factor $c$ denotes on average how many experts are utilized by a token. Given input token representations $X\in\mathbb{R}^{n\times d}$ where $d$ is the model hidden dimension, our method produces a token-to-expert assignment denoted by three output matrices $I$ , $G$ and $P$ . The matrix $I$ is an index matrix where $I[i,j]$ specifies $j$ -th selected token of the $i$ -th expert. The gating matrix $G\in\mathbb{R}^{e\times k}$ denotes the weight of expert for the selected token, and $P\in\mathbb{R}^{e\times k\times n}$ refers to an one-hot version of $I$ that will be used to gather tokens for each expert. These matrices are computed using a gating function,
其中 $n$ 是输入批次中的 token 总数（例如批次大小 $\times$ 序列长度）， $c$ 是容量因子， $e$ 是专家数量。容量因子 $c$ 表示平均每个 token 使用多少个专家。给定输入 token 表征 $X\in\mathbb{R}^{n\times d}$ ，其中 $d$ 是模型隐藏维度，我们的方法产生一个 token 到专家的分配，由三个输出矩阵 $I$ ， $G$ 和 $P$ 表示。矩阵 $I$ 是一个索引矩阵，其中 $I[i,j]$ 指定 $j$ -th 选定的专家的 $i$ -th token。门控矩阵 $G\in\mathbb{R}^{e\times k}$ 表示为选定 token 的专家权重， $P\in\mathbb{R}^{e\times k\times n}$ 指的是一个 $I$ 的 one-hot 版本，将用于为每个专家收集 tokens。计算这些矩阵时使用的门控函数，

\begin{gathered}S=\mathrm{Softmax}(X\cdot W_{g}),\quad S\in\mathbb{R}^{n\times e}\\ G,I=\mathrm{TopK}(S^{\top},k),P=\mathrm{Onehot}(I)\\ \end{gathered}

(2)

where $S$ denotes the token-to-expert affinity scores, $W_{g}\in\mathbb{R}^{d\times e}$ denotes the expert embeddings, and $TopK()$ selects the k largest entries for each row of $S^{\top}$ .
其中 $S$ 表示 token 到专家的亲和度分数， $W_{g}\in\mathbb{R}^{d\times e}$ 表示专家嵌入， $TopK()$ 选择 $S^{\top}$ 中每行的 k 个最大条目。

Similar to Switch Transformer [10] and GShard [21], we apply mixture of experts and the gating function in the dense feed-forward (FFN) layer, as it is the most computationally expensive part in a Transformer-based network. The input to the gated FFN, denoted by $X_{\text{in}}\in\mathbb{R}^{e\times k\times d}$ , is produced using the permutation matrix $P$ . Here $X_{\text{in}}[i]\in\mathbb{R}^{k\times d}$ denotes the input of the $i$ -th expert. Similarly, let $W_{1}$ and $W_{2}$ denote the parameters of gated FFN in which $W_{1}[i]$ and $W_{2}[i]\in\mathbb{R}^{d\times d^{\prime}}$ denote the parameter matrices of the $i$ -th expert. We compute the output of each expert $X_{e}[i]$ as follows,
类似于 Switch Transformer [10] 和 GShard [21]，我们在稠密前馈网络（FFN）层应用了专家混合和门控功能，因为这是基于 Transformer 的网络中计算量最大的一部分。标记为 $X_{\text{in}}\in\mathbb{R}^{e\times k\times d}$ 的门控 FFN 的输入是使用排列矩阵 $P$ 产生的。这里 $X_{\text{in}}[i]\in\mathbb{R}^{k\times d}$ 表示第 $i$ 位专家的输入。同样，设 $W_{1}$ 和 $W_{2}$ 表示门控 FFN 的参数，其中 $W_{1}[i]$ 和 $W_{2}[i]\in\mathbb{R}^{d\times d^{\prime}}$ 表示第 $i$ 位专家的参数矩阵。我们计算每个专家的输出 $X_{e}[i]$ 如下，

\begin{gathered}X_{in}=P\cdot X\\ \forall i:\ \ X_{e}[i]=\mathrm{GeLU}(X_{in}[i]\cdot W_{1}[i])\cdot W_{2}[i]^{\top}\\ \end{gathered}

(3)

We omit the bias terms here for brevity. The finally output of the gated FFN layer $X_{\text{out}}\in\mathbb{R}^{n\times d}$ can be obtained given $X_{e}$ , the permutation and gating matrices $P$ and $G$ ,
为简洁起见，我们在这里省略了偏置项。门控 FFN 层的最终输出 $X_{\text{out}}\in\mathbb{R}^{n\times d}$ 可以通过给定 $X_{e}$ 、排列和门控矩阵 $P$ 和 $G$ 获得。

\begin{gathered}X_{\text{out}}[l,d]=\sum_{i,j}P[i,j,l]\ G[i,j]\ X_{e}[i,j,d]\end{gathered}

(4)

Both $X_{e}$ and $X_{\text{out}}$ can be efficiently computed using Einstein summation (einsum) operations.
$X_{e}$ 和 $X_{\text{out}}$ 都可以通过爱因斯坦求和（einsum）操作有效计算。

3.3 Expert Choice with Additional Constraint
3.3 带有附加约束的专家选择

We also consider regularizing our expert choice routing by limiting the maximum number of experts for each token. We are interested in whether adding this constraint improves pre-training and fine-tuning results. More importantly, it helps analyzing to what degree using a variable number of experts per token affects the model performance.
我们还考虑通过限制每个 token 的最大专家数量来调整我们的专家选择路由。我们感兴趣的是添加这个约束是否可以改善预训练和微调结果。更重要的是，它有助于分析每个 token 使用可变数量专家对模型性能的影响程度。

Let $A\in\mathbb{R}^{e\times n}$ be a positive matrix where $A[i,j]$ represents whether the i-th expert selects j-th token. We solve the following entropy-regularized linear programming problem
设 $A\in\mathbb{R}^{e\times n}$ 为一个正矩阵，其中 $A[i,j]$ 表示第 i 个专家是否选择第 j 个 token。我们求解以下熵正则化线性规划问题

		$\displaystyle\max_{A}\ \left<S^{\top},A\right>+\lambda H(A)$
	s.t.	$\displaystyle\forall i:\ \sum_{j^{\prime}}A[i,j^{\prime}]=k;~{}~{}\forall j:\ \sum_{i^{\prime}}A[i^{\prime},j]\leq b;~{}~{}\forall i,j:\ 0\leq A[i,j]\leq 1$

where $<S^{\top},A>$ denotes the inner product, $H(A)$ is the sum of element-wise entropy¹¹1 $H(A)=\sum_{ij}-A[i,j]\log A[i,j]$ , and $b>0$ is an integer that upper bounds the selection for each token. Adding a small entropy term gives a near-integer solution while enabling a fast iterative solver we can run on TPUs. Specifically, the solution space is the intersection of three convex sets each satisfying one of the linear constraints. We use Dykstra’s algorithm [9] that alternatively projects the intermediate solution onto one of the convex sets.²²2We use $\lambda=0.001$ and a maximum of 100 iterations. After $A$ is computed, the routing indices $I$ is selected using $TopK(A,k)$ instead.
其中 $<S^{\top},A>$ 表示内积， $H(A)$ 是逐元素熵 ¹ 的和，而 $b>0$ 是对每个 token 的选择的整数上界。添加一个小熵项可以给出一个接近整数的解，同时启用可以在 TPU 上运行的快速迭代求解器。具体来说，解空间是分别满足线性约束之一的三个凸集的交集。我们使用 Dykstra 算法 [9] 交替地将中间解投影到一个凸集上。在 $A$ 计算后，路由索引 $I$ 改为使用 $TopK(A,k)$ 选择。

Model	Type 类型	$n_{\text{params}}$	$n_{\text{act-params}}$	$L$	$M$	$H$	$n_{\text{heads}}$	$d_{\text{head}}$	$E$
0.1B	Dense 稠密	130M	130M						-
0.1B/16E	MoE	548M	145M						16
0.1B/32E	MoE	1.0B	145M	12	768	3,072	12	64	32
0.1B/64E	MoE	1.9B	145M	12	768	3,072	12	64	64
0.1B/128E	MoE	3.7B	145M						128
8B	Dense 稠密	8.7B	8.7B	32	4,096	16,384	32	128	-
8B/64E	MoE	143B	9.8B	32	4,096	16,384	32	128	64

Table 1: Sizes and architectures of both MoE and dense models that were trained in our experiments. Models are grouped by the number of activated parameters per token. All trained models share the same learning hyperparameters described in Section 4.1.
表 1：本实验中训练的 MoE 和稠密模型的大小和架构。模型按每个标记的激活参数数量分组。所有训练的模型共享相同的学习超参数，如第 4.1 节所述。

3.4 Model Architecture 3.4 模型架构

At the high level, we adopt the idea of sparsely activated Mixture-of-Experts (MoE) [31]. We use a Transformer architecture and replace the feed-forward component of every other Transformer layer with a MoE layer, following recent practice [21, 10]. Interleaving regular Transformer layers and MoE layers empirically improves model performance and training efficiency, probably because forcing some shared components in between MoE layers can mitigate the negative effects of skipping tokens. Several additional modifications adopted in recent work have been applied in our experiments. For example, we replace the standard positional embedding with per-layer relative positional bias [5]. In the non-MoE feed-forward sub-layers (only every other layers are MoE layers), we replace the first linear projection and the activation function with the Gated Linear Unit [6], which computes the component-wise product of two linear transformation of the input, followed by a Gaussian Error Linear Unit [15] activation function.
在高层次上，我们采用稀疏激活的 Mixture-of-Experts（MoE）理念。我们使用 Transformer 架构，并在每隔一层的 Transformer 中将前馈组件替换为 MoE 层，遵循近期的实践经验。规律性 Transformer 层和 MoE 层的交错排列在实际中提升了模型性能和训练效率，这可能是因为在 MoE 层之间引入一些共享组件可以减轻跳过 token 所带来的负面影响。我们在实验中应用了一些近期工作中采用的额外修改。例如，我们用按层的相对位置偏置替代了标准位置嵌入。在非 MoE 的前馈子层（只有每隔一层是 MoE 层）中，我们用门控线性单元替换了第一个线性映射和激活函数，其计算输入的两次线性变换的逐元素乘积，随后是一个高斯误差线性单元激活函数。

As described earlier, each MoE layer consists of a group of independent feed-forward networks as denoted as ‘‘experts’’. The gating function in Eq. 2 uses a softmax activation function to model a probability distribution over these experts. This distribution denotes the preference over experts of each incoming token, which is computed similarly in a conventional gating network [31, 10, 21]. During training, each MoE layer’s learnable gating network described in Eq. 2 is trained to use the input to activate the best subset of experts using a top- $k$ function along the token dimension. An ‘‘shuffle’’ stage and an ‘‘unshuffle’’ stage are inserted to the MoE layer, where the first stage gathers the tokens to their designated experts while the second stage permutes the tokens back to their original order in the input batch. This step is formulated in LABEL:eq:permute and Eq. 4.
如前所述，每个 MoE 层由一组独立的前馈网络组成，称为“专家”。在方程 2 中使用门控函数，其采用 softmax 激活函数来对这些专家建模出一个概率分布。这个概率分布表示每个传入 token 对专家的偏好，这与传统的门控网络[31, 10, 21]计算方式类似。在训练过程中，在方程 2 中描述的每个 MoE 层的可学习门控网络被训练以使用输入激活使用 top- $k$ 函数沿 token 维度的最佳专家子集。插入一个“打乱”和一个“恢复”阶段到 MoE 层，其中第一阶段将 tokens 聚集到其指定的专家，而第二阶段将 tokens 重新排列回输入批次中的原始顺序。这一步阐述在 LABEL:eq:permute 和方程 4 中。

Similar to conventional MoE method, there are more parameters in the MoE layer. However, the activated model size per token can be comparable to a dense layer because during training or inference, only a limited subset of experts is activated for any given token. For instance, Switch Transformer [10] has only one activated expert while GShard [21] uses two experts per token. In our method, the number of activated experts can vary for each token but the overall computation is kept the same as the baseline architectures by fixing the capacity factor $c$ in Eq. 1. Unless otherwise specified, we set $c=2$ such that our method can be directly compared to the top-2 token-choice gating in GShard.
类似于传统的 MoE 方法，MoE 层拥有更多参数。然而，因为在训练或推理期间，对于任何给定的 token，只激活有限的专家子集，所以每个 token 的激活模型大小可以与密集层相媲美。例如，Switch Transformer[10] 仅激活一个专家，而 GShard[21] 每个 token 使用两个专家。在我们的方法中，激活的专家数量对于每个 token 可以有所不同，但通过在方程 1 中固定容量因子 $c$ ，整体计算与基准结构保持相同。除非另有说明，我们设定 $c=2$ 以便我们的方法可以直接与 GShard 中的 top-2 token 选择门控进行比较。

We train several variants of our architecture at the 100M scale (i.e. 100M expert size) by increasing the number of experts to understand the scaling effects of our method. We also train a 8B scale MoE model. The large MoE model is partitioned with a 2D sharding algorithm as presented in GSPMD [36], which fully exploits the 2D topology of the TPU cluster [19]. Across different scales and setups, our method outperforms related work and demonstrates strong downstream task performance on selected tasks in GLUE and SuperGLUE.
我们通过增加专家数量来训练几个变体的架构，在 100M 规模（即 100M 专家规模）下以理解我们方法的扩展效果。我们也训练了一个 8B 规模的 MoE 模型。这个大型 MoE 模型使用 GSPMD[36]中介绍的 2D 分片算法进行分区，充分利用 TPU 群集[19]的 2D 拓扑。在不同的规模和设置下，我们的方法优于相关工作，并在 GLUE 和 SuperGLUE 中选定的任务上展示了强大的下游任务性能。

4 Experiments 4 实验

4.1 Setup 4.1 设置

Table 1 summarizes the hyperparameter settings of different MoE models. As a reference point, we also include the respective dense model configurations with comparable numbers of activated parameters per-token during inference. To study of the effect of scaling the number of experts, we studied varying the number of experts but fixing the per expert size to 100M parameters. For example, 0.1B/64E represents the architecture of an approximately 100M parameter dense model with every other layer replaced by a 64-expert MoE layer. The MoE model degenerates into a dense transformer architecture when each MoE layer only has one expert. While $n_{params}$ is the total number of trainable parameters, $n_{act-params}$ represents the number of activated parameters per token. $L$ is the total number of Transformer layers, $M$ is the model dimension, $H$ is the hidden dimension after the projection in each transformer layer, $n_{heads}$ is the number of attention heads, and $d_{head}$ is the hidden dimension of each attention head.
表 1 总结了不同 MoE 模型的超参数设置。作为参考点，我们还包括相应的稠密模型配置，这些配置在推理过程中每个 token 激活的参数数量相当。为了研究专家数量扩展的效果，我们研究了在保持每个专家大小为 1 亿参数的情况下变化专家数量。例如，0.1B/64E 表示一个大约 1 亿参数的稠密模型架构，其每隔一个层被一个 64 专家的 MoE 层替代。当每个 MoE 层只有一个专家时，MoE 模型退化为稠密 Transformer 架构。 $n_{params}$ 是可训练参数的总数， $n_{act-params}$ 代表每个 token 激活的参数数， $L$ 是 Transformer 层的总数， $M$ 是模型维度， $H$ 是每个 Transformer 层投影后的隐藏维度， $n_{heads}$ 是注意力头的数量， $d_{head}$ 是每个注意力头的隐藏维度。

Dataset: We use the high-quality dataset from GLaM [du2022glam] of 1.6 trillion tokens that are representative of a wide range of natural language use cases. An in-house classifier is trained to classify between a collection of curated text and other webpages and estimate the content quality of a webpage. A high-quality filtered subset of webpages are combined with books, Wikipedia pages, conversations, forums, and news to create the final dataset. The data and mixture weights can be found in Table 3 in the GLaM paper.
数据集：我们使用了 GLaM [du2022glam] 提供的高质量数据集，其中包含了 1.6 万亿个 token，代表了广泛的自然语言用例。一个内部分类器被训练用来在一系列经过筛选的文本与其他网页之间进行分类并评估网页的内容质量。经过高质量过滤的网页子集被与书籍、维基百科页面、对话、论坛和新闻结合，创建了最终的数据集。数据和混合权重可在 GLaM 论文的表 3 中找到。

Model Training: Our model training follows the setups of GLaM [du2022glam] where a maximum sequence length of 1024 tokens is adopted. We use an Adafactor optimizer [32] with first-moment decay $\beta_{1}=0$ and second-moment decay $\beta_{2}=0.99$ . We keep the learning rate constant for the first 10K training steps, and then decay it with an inverse square root schedule. Unlike most related works, we do not impose any auxiliary loss for load balance, such as described in Switch Transformer [10] and GShard [21]. We use the SentencePiece subword tokenizer with a vocabulary of size of 256K. The largest model (8B/64E) is trained on 512 TPU V4 chips. We use a dropout rate of 0 during training as the number of tokens in the training data corpus is much greater than the total number of tokens during training.
模型训练：我们的模型训练遵循 GLaM [du2022glam] 的设置，其中采用了最大序列长度为 1024 个 tokens。我们使用 Adafactor 优化器[32]，其中一阶动量衰减为 $\beta_{1}=0$ ，二阶动量衰减为 $\beta_{2}=0.99$ 。在前 10K 训练步骤中，我们保持学习率不变，然后使用反平方根计划衰减。与大多数相关工作不同，我们未施加任何用于负载平衡的附加损失，例如 Switch Transformer[10]和 GShard[21] 中所述。我们使用了 256K 词汇量的 SentencePiece 子词分词器。最大的模型(8B/64E)在 512 个 TPU V4 芯片上进行训练。训练时，我们的辍学率为 0，因为训练数据语料库中的 token 数量远大于训练期间的总 token 数量。

Model Evaluation: We mainly focus on evaluating the finetuning performance on the 11 selected tasks from GLUE and SuperGLUE benchmarks [35, 34].
模型评估：我们主要关注在 GLUE 和 SuperGLUE 基准中选择的 11 个任务上进行微调表现的评估。

4.2 Training Efficiency 4.2 训练效率

			100M/128E			100M/64E
Name 名称	Metric 指标	Split 分割	ST Top-1	GS Top-2	EC-CF2	ST Top-1	GS Top-2	EC-CF2
BoolQ	acc 准确率	dev 开发	77.4	76.5	76.9	73.2	77.5	79.7
CB	acc 准确率	dev 开发	87.5	80.9	89.1	85.9	84.4	89.1
CoLA	acc 准确率	dev 开发	78.9	84.0	86.7	64.1	85.2	88.3
MNLI	acc 准确率	dev 开发	82.3	83.6	84.9	80.8	85.2	86.7
MRPC	acc 准确率	dev 开发	82.6	81.0	83.1	81.3	81.3	84.4
QNLI	acc 准确率	dev 开发	89.5	88.6	89.0	89.4	89.7	91.3
QQP	acc 准确率	dev 开发	90.6	90.3	90.4	88.9	90.5	91.0
RTE	acc 准确率	dev 开发	77.0	78.9	78.5	74.1	79.3	81.6
SST2	acc 准确率	dev 开发	92.0	94.5	94.6	91.8	95.1	95.1
WiC	acc 准确率	dev 开发	67.8	65.5	68.1	64.4	67.8	65.6
WNLI	acc 准确率	dev 开发	65.6	70.3	67.2	68.8	68.8	71.7
Avg 平均	-	-	81.0	81.3	82.6	78.4	82.2	84.0
			100M/32E			8B/64E
Name 名称	Metric 指标	Split 分割	ST Top-1	GS Top-2	EC-CF2	ST Top-1	GS Top-2	EC-CF2
BoolQ	acc 准确率	dev 开发	74.5	79.0	79.3	89.1	89.5	89.2
CB	acc 准确率	dev 开发	80.6	81.3	92.2	93.8	96.7	100
CoLA	acc 准确率	dev 开发	87.5	92.2	93.8	88.3	87.5	89.1
MNLI	acc 准确率	dev 开发	83.1	87.8	88.0	90.7	91.4	91.1
MRPC	acc 准确率	dev 开发	82.3	85.2	84.4	89.3	91.7	90.6
QNLI	acc 准确率	dev 开发	91.6	91.9	92.5	94.5	94.9	95.0
QQP	acc 准确率	dev 开发	90.1	91.5	92.0	92.1	92.5	93.8
RTE	acc 准确率	dev 开发	75.0	79.1	78.1	91.0	92.2	95.2
SST2	acc 准确率	dev 开发	93.3	94.4	95.4	97.1	98.0	97.7
WiC	acc 准确率	dev 开发	62.5	65.9	69.8	74.5	76.4	83.8
WNLI	acc 准确率	dev 开发	65.6	64.1	68.8	78.1	82.8	92.8
Avg 平均	-	-	80.6	83.5	85.0	88.9	90.3	92.6

Table 2: Expert choice with capacity factor of 2 (EC-CF2) outperforms Top-1 gating in Switch Transformer (ST) and top-2 gating in GShard (GS) on GLUE and SuperGLUE tasks. Note that with an expert size of 100M parameters, 100M/32E works best for our method and Ghard Top-2 while 100M/128E works better for Switch Transformer Top-1. Our method consistently outperforms the others across all the scales.
表 2：在 GLUE 和 SuperGLUE 任务上，容量系数为 2 的专家选择（EC-CF2）优于 Switch Transformer（ST）的 Top-1 门控和 GShard（GS）的 top-2 门控。注意，对于专家规模为 1 亿参数，1 亿/32E 的方法对我们的方式和 Ghard Top-2 最有效，而 1 亿/128E 对 Switch Transformer Top-1 更好。我们的方法在所有规模上均持续表现优于其他方法。

We first study training efficiency and convergence. We use expert choice with a capacity factor of 2 (EC-CF2) to match the activated model size and computational cost on a per token basis in GShard top-2 gating and run both for a fixed number of steps. The results are shown in Fig. 2 (a). Comparing to GShard top-2 gating, which showed stronger performance in both perplexity in the evaluation dataset and fine-tuning on downstream tasks compared to Switch Transformer top-1 gating, EC-CF2 converges more than 2x faster during training. More specifically, EC-CF2 reaches the same perplexity as GShard top-2 in less than half the steps, and with each GShard top-2 step being 20% slower than our method. As explained in Section 3.1, the slower step time in top-2 gating is due to load imbalance where some experts can receive a lot more tokens than the desired capacity. As a result, the step latency will be bottlenecked by the most loaded expert.
我们首先研究训练效率和收敛性。我们使用容量系数为 2 的专家选择（EC-CF2）来匹配 GShard top-2 门控在每个 token 基础上激活的模型大小和计算成本，并在固定的步骤数下运行两者。结果如图 2 (a)所示。相比 GShard top-2 门控，在评估数据集的困惑度和下游任务微调上均表现出更强性能，EC-CF2 在训练期间的收敛速度快了超过 2 倍。更具体地说，EC-CF2 在不到一半的步骤中达到了 GShard top-2 的相同困惑度，并且每一步 GShard top-2 的处理速度比我们的方法慢 20%。如 3.1 节所述，top-2 门控中步骤时间较慢是由于负载不平衡，其中一些专家接收到的 tokens 数量远超预期容量。因此，步骤延迟将受到负载最多的专家的瓶颈限制。

4.3 Scaling the Number of Experts
4.3 扩展专家数量

As presented in Table 1, increasing the number of experts effectively increases model capacity without increasing activated model size. We scale the number of experts while fixing the expert size to 100M parameters for both expert choice (EC) and GShard (Top-2) methods and find both methods work well in terms of perplexity on the evaluation dataset during pre-training. As demonstrated in Fig. 2 (b), having more experts consistently improves training perplexity.
如表 1 所示，增加专家数量可有效提高模型容量，而不会增加激活的模型规模。我们在固定专家规模为 1 亿参数的情况下，对专家选择（EC）和 GShard（Top-2）方法进行专家数量的扩展，发现这两种方法在预训练期间评估数据集上的困惑度表现良好。如图 2（b）所示，增加专家数量持续改善训练困惑度。

4.4 Fine-tuning on GLUE and SuperGLUE
4.4 在 GLUE 和 SuperGLUE 上进行微调

To validate whether improved perplexity directly translates to better performance in downstream tasks, we perform fine-tuning on 11 selected tasks from GLUE and SuperGLUE. We compare three MoE methods including Switch Transformer top-1 gating (ST Top-1), GShard top-2 gating (GS Top-2) and our method (EC-CF2) that matches the activation memory size and computational cost of GS Top-2. Indicated by the results in Table 2, our EC-CF2 method consistently outperforms the related methods and yields more than 2% average accuracy increase in a large 8B/64E setting. Table 3 further compares our 8B/64E model against its dense counterpart. Again, our method achieves stronger fine-tuning results, increasing the average score by 3.4 point.
为了验证改进的困惑度是否直接转化为下游任务的更好性能，我们对 GLUE 和 SuperGLUE 的 11 个选定任务进行微调。我们比较了三种 MoE 方法，包括 Switch Transformer top-1 gating（ST Top-1）、GShard top-2 gating（GS Top-2）和我们的方法（EC-CF2），该方法匹配了 GS Top-2 的激活内存大小和计算成本。由表 2 中的结果可以看出，我们的 EC-CF2 方法始终优于相关方法，并在较大的 8B/64E 环境下平均提高了超过 2%的准确率。表 3 进一步将我们的 8B/64E 模型与其密集对应模型进行比较。同样，我们的方法获得了更强的微调结果，平均得分提高了 3.4 分。

Interestingly, we observe the 100M/32E model setting works the best for both GS Top-2 and EC-CF2, even though the effective model capacity is smaller than that of 100M/64E and 100M/128E. This result indicates that a good training perplexity does not always translate to better performance of downstream tasks.
有趣的是，我们观察到 100M/32E 模型设置对于 GS Top-2 和 EC-CF2 都表现最佳，即使其有效模型容量小于 100M/64E 和 100M/128E。这一结果表明，良好的训练困惑度并不总是意味下游任务的更好表现。

Model	BoolQ	CB	CoLA	MNLI	MRPC	QNLI	QQP	RTE	SST2	WiC	WNLI	Avg 平均
Dense 8B	88.2	100	86.4	91.3	86.7	94.7	91.2	92.2	97.2	75.6	78.1	89.2
EC-CF2 8B/64E	89.2	100	89.1	91.1	90.6	95.0	93.8	95.2	97.7	83.8	92.8	92.6

Table 3: Comparison between Dense 8B and Expert Choice (EC-CF2) 8B/64E models: Our method significantly outperforms the dense model in downstream tasks.
表 3：密集 8B 与专家选择（EC-CF2）8B/64E 模型的比较：我们的方法在下游任务中明显优于密集模型。

4.5 Heterogeneity Matters
4.5 异质性的重要性

Capped Expert Choice: We regularized expert choice by limiting the maximum number of experts for each token, using the method described in Section 3.3. Table 4 reports the average accuracy on the 11 selected datasets. EC-CAP2 is the variant of our expert choice method by limiting the number of experts of each token to 2. This decreases the fine-tuning accuracy by 0.8 points on average. In addition, EC-CAP3 allows a maximum of 3 experts per token and achieves on par results compared to the vanilla expert choice method. This ablation study confirms that allowing variable number of experts per token is indeed helpful.
限制专家选择：我们通过限制每个标记的最大专家数量，使用第 3.3 节中描述的方法来正则化专家选择。表 4 报告了 11 个选定数据集的平均准确率。EC-CAP2 是我们专家选择方法的一个变体，通过将每个标记的专家数量限制为 2。这样平均降低了微调准确率 0.8 分。此外，EC-CAP3 允许每个标记最多 3 个专家，并获得与普通专家选择方法相当的结果。这项消融研究证实了允许可变数量的专家对于每个标记确实有帮助。

Variable Experts per Token: We compute statistics on token-to-expert routing, particularly on the ratio of tokens that have been routed to a certain number of experts. According to Fig. 3, a majority of tokens have been routed to one or two experts while 23% have been routed to three or four experts and only about 3% tokens have been routed to more than 4 experts. This plot verifies our hypothesis that our method learns to allocate a variable number experts to tokens, which can be beneficial for important tokens.
可变标记专家：我们计算标记到专家路由的统计数据，特别是关于已路由到某个数量的专家的标记比例。根据图 3，大多数标记已路由到一位或两位专家，而 23%的标记已路由到三位或四位专家，只有约 3%的标记已路由到超过 4 位专家。该图验证了我们的假设，即我们的方法学会了为标记分配可变数量的专家，这对重要的标记可能是有益的。

Method	Max # of Experts	Avg acc.
EC-CAP2	2	83.2 $\pm$ 0.4
EC-CAP3	3	84.0 $\pm$ 0.4
EC-CF2	-	84.0 $\pm$ 0.2
Hash Layer	-	81.3 $\pm$ 0.1

4.6 Comparison with Hash Layer
4.6 与哈希层的比较

In this section, we compare our method with Hash Layers [28]. We use $\mod{x}$ to map a token ID to an expert ID. This ensures load balance and generates specialized experts. The fine-tuning results are presented in the last row in Table 4. Hashing based routing performs worse than expert choice in terms of average scores and variance. This indicates that load balancing alone does not generate all the benefits.
本节中，我们将我们的方法与 Hash Layers [28]进行比较。我们使用 $\mod{x}$ 将令牌 ID 映射到专家 ID。这确保了负载平衡并生成了专门的专家。微调结果在表 4 的最后一行中展示。基于哈希的路由在平均分数和方差方面表现不如专家选择。这表明仅靠负载平衡并不能产生所有的好处。

4.7 Ablation 4.7 消融实验

Capacity Factor: We study the capacity factor in our expert choice method and compare the training perplexity with the baseline top-1 gating method used in Switch Transformer. As described in Eq. 1, the capacity factor determines how many experts on average each token can be routed to, thus the bucket size $k$ of each expert. In all our previous experiments, we use a capacity factor of 2, which matches the computational footprint of the top-2 gating used in GShard method. To match the computation cost on a per-token basis fairly with top-1 gating used in Switch Transformer, we reduce the capacity factor to 1 and plot the training perplexity in Fig. 4 (a). Not surprisingly, using a smaller capacity factor yields higher perplexity, but our method still significantly outperforms top-1 gating. We further push the capacity factor down to 0.5, and observe that it still outperforms the top-1 gating.
容量因子：我们在专家选择方法中研究了容量因子，并将训练困惑度与 Switch Transformer 中使用的基线顶级-1 门控方法进行比较。如公式 1 所述，容量因子决定了平均每个标记可以被路由到多少专家，因此确定了每个专家的桶大小 $k$ 。在我们之前的所有实验中，我们使用了容量因子为 2，这与 GShard 方法使用的顶级-2 门控的计算足迹相匹配。为了在每个标记的基础上公正地匹配与 Switch Transformer 中使用的顶级-1 门控的计算成本，我们将容量因子减至 1，并在图 4（a）中绘制了训练困惑度。不出所料，使用较小的容量因子会提高困惑度，但我们的方法仍显著优于顶级-1 门控。我们进一步将容量因子降低到 0.5，观察到它仍然优于顶级-1 门控。

Comparison with Dense Models on Pre-training: We compare our method with dense models on pre-training. As shown in Fig. 4 (b), our method consistently outperforms the dense method in perplexity and convergence time. For a small expert size of 100M parameters, the benefit of sparse gating is even more significant. Orthogonal to results presented in Fig. 2 (b), where scaling the number of experts improves model performance, Fig. 4 (b) shows that increasing expert capacity also significantly increases model performance.
与密集模型在预训练中的比较：我们将我们的方法与密集模型在预训练阶段进行比较。如图 4（b）所示，我们的方法在困惑度和收敛时间上始终优于密集方法。对于参数量为 1 亿的较小专家规模，稀疏门控的优势更加显著。与图 2（b）中所展示的增加专家数量能够提高模型性能的结果不同，图 4（b）显示增加专家容量也能显著提升模型性能。

5 Conclusion 5 结论

We propose a new routing method for sparsely activated mixture-of-experts (MoE) models. This method addresses load imbalance and under-utilization of experts in conventional MoE methods, and enables selecting different numbers of experts for each token. Our model demonstrates more than 2x training efficiency improvements when compared to the state-of-the-art GShard and Switch Transformer models, and also achieves strong gains when finetuning on 11 datasets in the GLUE and SuperGLUE benchmark.
我们提出了一种新的稀疏激活专家混合（MoE）模型的路由方法。该方法解决了传统 MoE 方法中的负载不平衡和专家利用不足的问题，并且能够为每个 token 选择不同数量的专家。与最新的 GShard 和 Switch Transformer 模型相比，我们的模型训练效率提高了两倍以上，并且在 GLUE 和 SuperGLUE 基准测试中的 11 个数据集上进行微调时也取得了显著的提升。

6 Limitations 6 限制

The expert choice method might not immediately apply to auto-regressive text generation as our current implementation takes in the past and future tokens to perform the top- $k$ selection. One possible solution is to collect a large batch of input sequences, dispatch tokens of the same sequence into separate groups, and perform expert choice routing for each group. Another scenario where the expert choice method does not immediately apply is when the batch size becomes very small during serving or inference. A global top- $k$ can be selected instead and we can cap the number of times each expert or token gets selected. We leave these possible improvements for future work.
专家选择方法可能不太适用于自回归文本生成，因为我们当前的实现需要使用过去和未来的标记进行 top- $k$ 选择。一种可能的解决方案是收集大量输入序列，将同一序列的标记分派到不同组，并为每个组执行专家选择路由。专家选择方法无法立即应用的另一种情况是当批次大小在服务或推理过程中变得非常小时。可以选择一个全局 top- $k$ ，并限制每个专家或标记被选择的次数。我们将这些可能的改进留待未来研究。

Another long-standing issue with MoE has been the large memory footprint. Even though computational cost can be reduced using sparsely gated networks, the total number of parameters increases linearly or sub-linearly with the number of experts. Increasing the number of experts requires reservation of a large number of hardware devices. Therefore, dynamic (used) power is saved while static (reserved) power is not. Power saving techniques such as the ability to put hardware devices into low power states while not in use [17] can help with reducing the reserved power requirements.
MoE 的另一个长期存在的问题是大的内存占用。即使使用稀疏门控网络可以减少计算成本，参数的总数也会随着专家数量的增加按线性或次线性增长。增加专家数量需要预留大量硬件设备。因此，动态（使用中）功耗得以节省，而静态（保留中）功耗没有节省。节能技术，如将硬件设备在未使用时置于低能耗状态的能力[17]，可以帮助减少保留功率要求。

References

[1] Davide Abati, Jakub Tomczak, Tijmen Blankevoort, Simone Calderara, Rita Cucchiara, and Babak Ehteshami Bejnordi. Conditional channel gated networks for task-aware continual learning. In CVPR, pages 3930--3939. Computer Vision Foundation / IEEE, 2020.
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. In Advances in Neural Information Processing Systems.
[3] Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning, 2014.
[4] Zihang Dai, Hanxiao Liu, Quoc V. Le, and Mingxing Tan. CoAtNet: Marrying convolution and attention for all data sizes. In Advances in Neural Information Processing Systems, 2021.
[5] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019. Association for Computational Linguistics.
[6] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 933–941. JMLR.org, 2017.
[7] Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. Glam: Efficient scaling of language models with mixture-of-experts, 2021.
[8] Dheeru Dua, Shruti Bhosale, Vedanuj Goswami, James Cross, Mike Lewis, and Angela Fan. Tricks for training sparse translation models, 2021.
[9] Richard L Dykstra. An iterative procedure for obtaining i-projections onto the intersection of convex sets. The annals of Probability, pages 975--984, 1985.
[10] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021.
[11] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. NAS-FPN: learning scalable feature pyramid architecture for object detection. In CVPR, pages 7036--7045. Computer Vision Foundation / IEEE, 2019.
[12] Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam. Hard mixtures of experts for large scale weakly supervised vision, 2017.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770--778, 2016.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision -- ECCV 2016, pages 630--645, Cham, 2016. Springer International Publishing.
[15] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs), 2016.
[16] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017.
[17] Ping Huang, Zuocheng Xing, Tianran Wang, Qiang Wei, Hongyan Wang, and Guitao Fu. A brief survey on power gating design. In 2010 10th IEEE International Conference on Solid-State and Integrated Circuit Technology, pages 788--790, 2010.
[18] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 103--112, 2019.
[19] Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David A. Patterson. A domain-specific supercomputer for training deep neural networks. Commun. ACM, 63(7):67--78, 2020.
[20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
[21] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021.
[22] Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 6265--6274. PMLR, 18--24 Jul 2021.
[23] Min Lin, Jie Fu, and Yoshua Bengio. Conditional computation for continual learning, 2019.
[24] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. New York, NY, USA, 2019. Association for Computing Machinery.
[25] Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, Cédric Renggli, André Susano Pinto, Sylvain Gelly, Daniel Keysers, and Neil Houlsby. Scalable transfer learning with expert models. In ICLR. OpenReview.net, 2021.
[26] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
[27] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1--140:67, 2020.
[28] Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse models, 2021.
[29] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green ai, 2019.
[30] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. Mesh-tensorflow: Deep learning for supercomputers. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 10435–10444, Red Hook, NY, USA, 2018. Curran Associates Inc.
[31] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR (Poster). OpenReview.net, 2017.
[32] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4596--4604. PMLR, 10--15 Jul 2018.
[33] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
[34] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems. Curran Associates, Inc.
[35] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, November 2018. Association for Computational Linguistics.
[36] Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake A. Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. GSPMD: general and scalable parallelization for ML computation graphs. CoRR, abs/2105.04663, 2021.

7 Checklist 7 清单

(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Yes
(a) 摘要和介绍中提出的主要声明是否准确反映了论文的贡献和范围？是的

(b) Have you read the ethics review guidelines and ensured that your paper conforms to them? Yes
(b) 您是否阅读了伦理审查指南并确保您的论文符合这些指南？是的

(d) Did you describe the limitations of your work? Yes
(d) 您是否描述了工作的局限性？是

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results? Yes. We include details in the experiment setup to help reproduce the main results.
(a) 您是否包含了重现主要实验结果所需的代码、数据和说明？是的。我们在实验设置中提供了详细信息以帮助重现主要结果。

(b) Did you specify all the training details? Yes
(b) 您是否指定了所有训练细节？是

(d) Did you include the amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Yes
(d) 您是否包括了计算量和资源类型（例如，GPU 类型、内部集群或云提供商）？是

(a) If your work uses existing assets, did you cite the creators? Yes
(a) 如果您的工作使用了现有资产，您是否引用了创作者？是

(b) Did you mention the license of the assets? No. The used dataset is not released yet.
(b) 您是否提到了资产的许可？否。使用的数据集尚未发布。

(c) Did you include any new assets either in the supplemental material or as a URL? No. The dataset is not released yet.
(c) 您是否在补充材料或作为 URL 中包含了任何新资产？否。数据集尚未发布。

(d) Did you discuss whether and how consent was obtained from people whose data you’re using/curating? No. Not using persons’ data.
(d) 您是否讨论了如何以及是否获得了使用/策划的数据的人们的同意？否。不使用个人数据。

(e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? Yes. The dataset does not contain any personally identifiable information or offensive content.
(e) 您是否讨论了您使用/策划的数据是否包含个人可识别信息或冒犯性内容？是的。数据集不包含任何个人可识别信息或冒犯性内容。

Mixture-of-Experts with Expert Choice Routing专家选择路由的专家混合模型