Using Bifurcations for Diversity in Differentiable Games
在可微分博弈中使用分岔实现多样性

Jonathan Lorraine $^{123}$ Jack Parker-Holder $^{14}$ Paul Vicol $^{23}$ Aldo Pacchiano $^{5}$ Luke Metz $^{6}$ Tal Kachman $^{7}$ Jakob Foerster $^{1}$

Abstract 抽象

Ridge Rider (RR) is an algorithm for finding diverse solutions to optimization problems by following eigenvectors of the Hessian (“ridges”). $R R$ is designed for conservative gradient systems (i.e. settings involving a single loss function), where it branches at saddles - the only relevant bifurcation points. We generalize this idea to non-conservative, multi-agent gradient systems by identifying new types of bifurcation points and proposing a method to follow eigenvectors with complex eigenvalues. We give theoretical motivation for our method - denoted Game Ridge Rider (GRR) - by leveraging machinery from the field of dynamical systems. Finally, we empirically motivate our method by constructing novel toy problems where we can visualize new phenomena and by finding diverse solutions in the iterated prisoners’ dilemma, where existing methods often converge to a single solution mode.
Ridge Rider （RR）是一种通过遵循 Hessian 的特征向量（“ridges”）来查找优化问题的不同解决方案的算法。 $R R$ 专为保守的梯度系统（即涉及单个损失函数的设置）而设计，其中它在 saddles - 唯一相关的分叉点处分支。我们通过识别新型分岔点并提出一种遵循具有复杂特征值的特征向量的方法，将这一想法推广到非保守的多智能体梯度系统。我们通过利用动力系统领域的机械，为我们的方法（称为 Game Ridge Rider （GRR））提供了理论动机。最后，我们通过构建新颖的玩具问题来实证地激励我们的方法，我们可以在其中可视化新现象，并在迭代的囚徒困境中找到不同的解决方案，其中现有方法通常收敛为单一解决方案模式。

1. Introduction 1. 引言

In machine learning, optimizing non-convex losses to local minima is critical in a variety of applications. Often, being able to select a specific type of minimum is important such as for policy optimization [1] or to avoid non-transferable features in supervised learning

[2, 3]

.
在机器学习中，将非凸损失优化为局部最小值在各种应用中都至关重要。通常，能够选择特定类型的最小值很重要，例如对于策略优化 [1] 或避免监督学习

[2, 3]

中的不可转移特征。
However, an increasing number of applications require learning in games, generalizing single objective optimization to settings where each agent controls a different subset of parameters to optimize their own objective. Some examples are GANs [4, 5], actor-critic models [5], curriculum learning [6-9], hyperparameter optimization [10-15], adversarial examples [16, 17], learning models [18-20], domain adversarial adaptation [21], neural architecture search [22-26], multi-agent settings [27] and meta-learning [28-30].
然而，越来越多的应用程序需要在游戏中学习，将单目标优化推广到每个代理控制不同参数子集以优化自己的目标的设置。一些例子是 GAN [4， 5]、演员-评论家模型 [5]、课程学习 [6-9]、超参数优化 [10-15]、对抗性示例 [16， 17]、学习模型 [18-20]、领域对抗性适应 [21]、神经架构搜索 [22-26]、多智能体设置 [27] 和元学习 [28-30]。
There are two major challenges with learning in games: first, the learning dynamics can be unstable due to the nonstationarity induced by simultaneously learning agents, e.g. resulting in cycling dynamics rather than convergence. Second, different equilibria can result in vastly different rewards for the different agents. For example, in the iterated prisoners’ dilemma (Sec. 4.1), finding solutions that favor cooperation vs selfishness result in higher return for all agents; or in Hanabi, finding solutions that do not arbitrarily break
游戏中的学习有两个主要挑战：首先，由于同时学习代理引起的非平稳性，学习动态可能不稳定，例如导致循环动态而不是收敛。其次，不同的均衡会导致不同的主体获得截然不同的回报。例如，在迭代的囚徒困境（第 4.1 节）中，找到有利于合作而不是自私的解决方案会为所有主体带来更高的回报;或者在 Hanabi 中，寻找不会随意打破的解决方案

symmetries results in better coordination with humans [31].
对称性导致与人类的协调性更好[31]。
Recently, Ridge Rider (RR [32]) provided a general method for finding diverse solutions in single objective optimization, by following eigenvectors of the Hessian with negative eigenvalues. We generalize RR to multi-agent settings, each optimizing their own objective. The relevant generalization of the Hessian - the game Hessian - is not symmetric and thus, in general, has complex eigenvalues (EVals) and eigenvectors (EVecs). This leads to new types of bifurcation points, where small changes in initial parameters lead to very different optimization outcomes.
最近，Ridge Rider （RR [32]）提供了一种通用方法，通过遵循具有负特征值的 Hessian 特征向量，在单目标优化中寻找各种解决方案。我们将 RR 推广到多代理设置，每个设置都优化了自己的目标。Hessian 矩阵的相关泛化 - 游戏 Hessian 矩阵 - 不是对称的，因此，通常具有复特征值（EVals）和特征向量（EVecs）。这导致了新型的分叉点，其中初始参数的微小变化会导致非常不同的优化结果。

Our method, Game Ridge Rider (GRR) branches from these novel bifurcation points along the (complex) EVecs to discover diverse solutions in these settings.
我们的方法 Game Ridge Rider （GRR）从这些沿着（复杂）EVec 的新颖分叉点分支出来，以在这些环境中发现不同的解决方案。

Figure 1. We visualize the Game Ridge Rider (GRR) algorithm when branching at different types of bifurcations. We have two eigenvectors and can branch in opposite directions, so we have four paths displayed in different colors. Steps with the eigenvector have magenta boundaries. Top: For the small Iterated Prisoner’s Dilemma (IPD) finding then branching at the max entropy saddle allows us to find defect-defect and tit-for-tat solutions. Bottom: For the mixed problem of small IPD and matching pennies, branching at the Hopf bifurcation allows us to find both solutions.
图 1.我们在不同类型的分岔处分支时可视化了 Game Ridge Rider （GRR）算法。我们有两个特征向量，可以向相反的方向分支，因此我们有四条路径以不同的颜色显示。具有特征向量的步骤具有洋红色边界。上图：对于小的迭代囚徒困境（IPD）结果，然后在最大熵鞍上分支，我们可以找到缺陷-缺陷和针锋相对的解决方案。下图：对于小 IPD 和匹配便士的混合问题，在 Hopf 分岔处进行分支可以让我们找到两种解决方案。

2. Background 2. 背景

Appendix Table 2 summarizes our notation. Consider the minimization problem:
附录表 2 总结了我们的符号。考虑最小化问题：

θ^{*} := \arg min_{θ} L (θ)

We denote the gradient of the loss at parameters

θ^{j}

g^{j} := g (θ^{j}) := {\nabla_{θ} L (θ) |}_{θ^{j}}

. We can locally minimize the loss

L

using (stochastic) gradient descent with step size

α

.
我们用

g^{j} := g (θ^{j}) := {\nabla_{θ} L (θ) |}_{θ^{j}}

表示 at 参数

θ^{j}

的损失梯度。我们可以使用步长

α

的（随机）梯度下降来局部最小化损失

L

。

θ^{j + 1} = θ^{j} - α g^{j}

Due to the potentially non-convex nature of

L

, multiple solutions can exist and simply following the gradient will not explore the parameter space.
由于

L

的潜在非凸性，可以存在多个解决方案，并且仅遵循梯度不会探索参数空间。

2.1. Ridge Rider 2.1. 山脊骑士

Ridge Rider (RR) [32] finds diverse solutions in gradient systems for single-objective minimization. RR first finds a saddle point, and then branches the optimization procedure following different directions (or “ridges”) given by the EVecs of the Hessian

H = \nabla_{θ} g = \nabla_{θ} (\nabla_{θ} L)

. Full computation of

H

—and its EVecs and EVals-is often prohibitively expensive; however, we can efficiently access a subset of the eigenspaces in libraries with efficient Hessianvector products

H v = \nabla_{θ} ((\nabla_{θ} L) v)

[33-36]. Algorithm 1 summarizes the RR method.
Ridge Rider （RR） [32] 在梯度系统中发现了多种单目标最小化的解决方案。RR 首先找到一个鞍点，然后按照 Hessian

H = \nabla_{θ} g = \nabla_{θ} (\nabla_{θ} L)

的 EVecs 给出的不同方向（或“山脊”）对优化过程进行分支。完全计算

H

— 及其 EVec 和 EVals — 通常成本高得令人望而却步;然而，我们可以有效地访问具有高效 Hessianvector 积

H v = \nabla_{θ} ((\nabla_{θ} L) v)

的库中特征空间的子集 [33-36]。流程图 1 总结了 RR 方法。

2.2. Optimization in Games
2.2. 游戏中的优化

Instead of simply optimizing a single loss, optimization in games involves multiple agents each with a loss functions that potentially depend on other agents. Examples include multiplayer games (e.g. Go [37, 38] and Hanabi [39]), iterated prisoners’ dilemma (IPD) [40, 41] and generative adversarial networks (GANs) [4]. For simplicity, we look at 2-player games with players denoted by

A

and

B

. Each player wants to minimize their loss

L_{A}, L_{B}

with their parameters

θ_{A} \in R^{d_{A}}, θ_{B} \in R^{d_{B}}

.
游戏中的优化不是简单地优化单个损失，而是涉及多个代理，每个代理都有一个可能依赖于其他代理的损失函数。例子包括多人游戏（如围棋 [37， 38] 和 Hanabi [39]）、迭代囚徒困境（IPD） [40， 41] 和生成对抗网络（GAN） [4]。为简单起见，我们看一下玩家用

A

和

B

表示的 2 人游戏。每个玩家都希望

L_{A}, L_{B}

通过他们的 parameters

θ_{A} \in R^{d_{A}}, θ_{B} \in R^{d_{B}}

\begin{array}{r} θ_{A}^{*} \in \arg min_{θ_{A}} L_{A} (θ_{A}, θ_{B}^{*}), θ_{B}^{*} \in \arg min_{θ_{B}} L_{B} (θ_{A}^{*}, θ_{B}) \\ ⟺ θ_{A}^{*} \in \arg min_{θ_{A}} L_{A}^{*} (θ_{A}), θ_{B}^{*} \in \arg min_{θ_{B}} L_{B}^{*} (θ_{B}) \end{array}

L_{B}

and

L_{A}

are differentiable in

θ_{B}

and

θ_{A}

we say the game is differentiable. Unfortunately, for a player to do gradient based optimization of their objective we must compute

d L_{B}^{*} / d θ_{B}

which often requires

d θ_{A}^{*} / d θ_{B}

, but

θ_{A}^{*} (θ_{B})

and its Jacobian are typically intractable. There are various algorithms to try to find solutions like Eq. 2, often efficiently approximating

d θ_{A}^{*} / d θ_{B}

with a method rely on efficient Jacobian-vector products. We overview these in our related work (Appendix A).
如果

L_{B}

和

L_{A}

是可微的，

θ_{B}

θ_{A}

我们说博弈是可微的。不幸的是，对于玩家来说，要对他们的目标进行基于梯度的优化，我们必须计算

d L_{B}^{*} / d θ_{B}

，这通常需要

d θ_{A}^{*} / d θ_{B}

，但

θ_{A}^{*} (θ_{B})

其雅可比矩阵通常很难处理。有各种算法可以尝试找到像方程 2 这样的解，

d θ_{A}^{*} / d θ_{B}

通常使用依赖于高效雅可比向量积的方法进行有效近似。我们在相关工作（附录 A）中概述了这些内容。

One of the most straightforward optimization methods is to find local solutions with simultaneous SGD (SimSGD). This does not take into account

d θ_{A}^{*} / d θ_{B}

and often fails to converge. Here,

g_{A}^{j} := g_{A} (θ_{A}^{j}, θ_{B}^{j})

and

g_{B}^{j} := g_{B} (θ_{A}^{j}, θ_{B}^{j})

are estimators for

{\nabla_{θ_{A}} L_{A} |}_{θ_{A}^{j}, θ_{B}^{j}}

and

{\nabla_{θ_{B}} L_{B} |}_{θ_{A}^{j}, θ_{B}^{j}}

:
最直接的优化方法之一是找到具有同步 SGD （SimSGD）的局部解。这没有考虑到

d θ_{A}^{*} / d θ_{B}

，而且经常无法收敛。这里，

g_{A}^{j} := g_{A} (θ_{A}^{j}, θ_{B}^{j})

和

g_{B}^{j} := g_{B} (θ_{A}^{j}, θ_{B}^{j})

是和

{\nabla_{θ_{B}} L_{B} |}_{θ_{A}^{j}, θ_{B}^{j}}

的

{\nabla_{θ_{A}} L_{A} |}_{θ_{A}^{j}, θ_{B}^{j}}

估计量：

θ_{A}^{j + 1} = θ_{A}^{j} - α g_{A}^{j}, θ_{B}^{j + 1} = θ_{B}^{j} - α g_{B}^{j} (SimSGD)

We simplify notation by using the concatenation of all players parameters (or joint-parameters)

ω := [θ_{A}, θ_{B}] \in R^{d}

and the joint-gradient vector field

\hat{g} : R^{d} \to R^{d}

, which at the

j^{th}

iteration is denoted:
我们通过使用所有玩家参数（或联合参数）

ω := [θ_{A}, θ_{B}] \in R^{d}

和联合梯度向量场

\hat{g} : R^{d} \to R^{d}

的串联来简化符号，在迭代时

j^{th}

表示为：

{\hat{g}}^{j} := \hat{g} (ω^{j}) := [g_{A} (ω^{j}), g_{B} (ω^{j})] = [g_{A}^{j}, g_{B}^{j}]

We can write the next iterate in (SimSGD) with a fixed-point operator

F_{α}

:
我们可以使用定点运算符

F_{α}

编写下一次迭代（SimSGD）：

ω^{j + 1} = F_{α} (ω^{j}) = ω^{j} - α {\hat{g}}^{j}

The Jacobian of the fixed point operator

F_{α}

is useful for analysis, including bounding convergence rates near fixed points and finding points where local changes to parameters may cause convergence to qualitatively different solutions. The fixed point operator’s Jacobian crucially depends on the Jacobian of our update

\hat{g}

. When our update is the gradient, we call this the game Hessian because it generalizes the Hessian:
不动点运算符

F_{α}

的雅可比矩阵可用于分析，包括固定点附近的边界收敛速率，以及查找参数的局部更改可能导致收敛到定性不同的解的点。定点运算符的雅可比行列式在很大程度上取决于我们更新的

\hat{g}

雅可比行列式。当我们的更新是梯度时，我们将其称为游戏 Hessian，因为它泛化了 Hessian：

\begin{matrix} \hat{H} := \nabla_{ω} \hat{g} = [\begin{array}{cc} \nabla_{θ_{A}}^{2} L_{A} & \nabla_{θ_{A}} \nabla_{θ_{B}} L_{A} \\ \nabla_{θ_{B}} \nabla_{θ_{A}} L_{B}^{⊤} & \nabla_{θ_{B}}^{2} L_{B} \end{array}] \\ \nabla_{ω} F_{α} (ω) = I - α \hat{H} \end{matrix}

For single-objective optimization

\hat{H}

is the Hessian of the loss, which is symmetric and has real EVals yielding parameter updates which form a conservative gradient vector field. However, in games with multiple objectives,

\hat{H}

is not symmetric and can have complex Evals, resulting in updates which form a non-conservative vector field.
对于单目标优化

\hat{H}

，损失的 Hessian 矩阵是对称的，并且具有真正的 EVals 产生参数更新，从而形成一个保守的梯度向量场。但是，在具有多个目标的游戏中，

\hat{H}

不是对称的，并且可能具有复杂的 Evals，从而导致形成非保守向量场的更新。

2.3. Local Bifurcations and Separatrices
2.3. 局部分叉和分离

The goal of RR was to obtain a method for finding diverse solutions in minimization. There are multiple ways to try to find solutions with RR, but we focus on finding separatrices -i.e., boundaries between phase space regions with different dynamical behavior. Crossing such boundaries leads to different solutions under our updates flow-and branching over these boundaries. Such a behavior of crossing regions and changing behavior is in fact a local bifurcations and a qualitative change in the behavior of the solutions.
RR 的目标是获得一种在最小化中寻找各种解的方法。有多种方法可以尝试用 RR 找到解，但我们专注于寻找分离——即具有不同动力学行为的相空间区域之间的边界。跨越这些边界会导致我们的更新流程下出现不同的解决方案，并跨越这些边界。这种跨越区域和改变行为的行为实际上是局部分叉和解行为的质变。

When applying RR in conservative gradient systems, saddle points and their EVecs play a key role in the shape of the phase portraits, which are a geometrical representation of the underlying dynamics. The negative EVecs often align with unstable manifolds that are orthogonal to our separatrices [42], thus giving directions in which we can perturb to find different solutions (Lemma 14.3 [43]). However, in non-conservative systems there are a variety of other local bifurcations [44] besides saddle points [45]. For example, by Thm. 11.2 of [43] if all EVals of have negative real part, except a conjugate non-zero pair of purely imaginary EVals, then a Hopf bifurcation occurs when changing the parameters causing the pair to cross the imaginary axis. A Hopf bifurcation is a critical point where the stability of a system switches resulting in a periodic solution.
在保守梯度系统中应用 RR 时，鞍点及其 EVec 在相位图的形状中起着关键作用，相位图是潜在动力学的几何表示。负 EVec 通常与与我们的分离正交的不稳定流形对齐 [42]，从而给出我们可以扰动以找到不同解的方向（引理 14.3 [43]）。然而，在非保守系统中，除了鞍点[45]之外，还有各种其他局部分叉[44]。例如，根据 [43] 的 Thm. 11.2，如果所有的 EVal 都具有负实部，除了一对共轭非零的纯虚部 EVal，那么当改变参数导致对穿过虚轴时，就会发生 Hopf 分叉。Hopf 分岔是系统稳定性切换的关键点，导致周期性解。

3. Our Method: Game Ridge Rider (GRR)
3. 我们的方法：Game Ridge Rider （GRR）

In this section we introduce Game Ridge Rider (GRR), a generalization of Game Ridge Rider for learning in games.

R R

is not immediately applicable to complex EVals of the Hessian, because we would need to follow the complex EVecs, resulting in complex weights, and we may need to branch at points besides saddles. When the Hessian has real EVals, GRR is equivalent to RR.
在本节中，我们将介绍 Game Ridge Rider （GRR），这是 Game Ridge Rider 在游戏中学习的泛化。

R R

不能立即应用于 Hessian 的复 EVal，因为我们需要遵循复 EVec，从而产生复权重，并且我们可能需要在除 saddes 之外的点处进行分支。当 Hessian 矩阵具有实 EVals 时，GRR 等效于 RR。
Consider the framework for the RR method in Alg 1. We modify the components of GetRidges and EndRide, along with a proposed starting saddle. We also add a method for running different optimizers after branching with Evecs.
考虑 Alg 1 中 RR 方法的框架。我们修改了 GetRidges 和 EndRide 的组件，以及建议的起始鞍座。我们还添加了一个方法，用于在使用 Evecs 分支后运行不同的优化器。

GetRidges finds which EVals and EVecs we should explore from a given branching point. The EVecs of a matrix are not unique in-general, and may have complex entries for complex EVals. We only want real-valued updates when following our EVecs, so we select the choices with largest real part and norm one-this is the default in PyTorch [35]. For a conjugate pair of complex EVals, this selection of EVecs corresponds to spanning the (real-valued) plane that we would rotate in under repeated matrix multiplication of the EVec. This also specifies the order in which we explore the EVals, which we set to be the most negative EVals first. Note that we can explore in opposite directions along each negative EVec.
GetRidges 查找我们应该从给定分支点探索哪些 EVals 和 EVec。矩阵的 EVec 通常不是唯一的，并且可能具有复杂 EVal 的复杂条目。在遵循 EVec 时，我们只希望有实值更新，因此我们选择具有最大实部和范数 1 的选项——这是 PyTorch 中的默认值 [35]。对于一对复数 EVal 的共轭，这种 EVec 的选择对应于跨越我们在 EVec 的重复矩阵乘法下旋转的（实值）平面。这也指定了我们探索 EVals 的顺序，我们首先将其设置为最负的 EVals。请注意，我们可以沿着每个负 EVec 沿相反的方向进行探索。

EndRide is a heuristic that determines how long we follow an EVec. We experiment with stopping after a single step and stopping when the real part of the EVal goes from negative to zero. If we have noisy EVecs estimates, or we are not exactly at a separatrix, we may need to take multiple steps to find different solutions. We can attempt to find Hopf bifurcations by ending and branching when a complex EVals crosses the imaginary axis.
EndRide 是一种启发式方法，用于确定我们跟踪 EVec 的时间。我们尝试在单个步骤后停止，并在 EVal 的实部从负数变为零时停止。如果我们有嘈杂的 EVecs 估计，或者我们并不完全处于分离状态，我们可能需要采取多个步骤来找到不同的解决方案。当复 EVals 穿过虚轴时，我们可以尝试通过结束和分支来找到 Hopf 分叉。

Optimize is a method which runs an optimization procedure. In this work we explore following gradients SimSGD or LOLA [27] with user specified optimization parameters.
Optimize 是一种运行优化过程的方法。在这项工作中，我们探索了以下梯度 SimSGD 或 LOLA [27] 以及用户指定的优化参数。

Starting location (

ω^{Saddle}

) We start our method at some some

ω^{Saddle} = \arg min_{ω} | \hat{g} |

. There are often multiple saddles we can begin at, so for multi-agent Tabular RL - like the IPD - we heuristically begin at the maximum entropy saddle

\arg min_{ω} | \hat{g} | - β H (π_{ω} (a)), β > 0

as in [32].
起始位置（

ω^{Saddle}

）我们从某个 .

ω^{Saddle} = \arg min_{ω} | \hat{g} |

我们通常可以从多个鞍座开始，因此对于多代理表格 RL - 如 IPD - 我们启发式地从最大熵鞍开始，

\arg min_{ω} | \hat{g} | - β H (π_{ω} (a)), β > 0

如 [32]。
The other components of ChooseFromArchive and UpdateRidge we did not change, but summarize below. See RR [32] for more details on their implementations.
我们没有更改 ChooseFromArchive 和 UpdateRidge 的其他组件，但总结如下。有关其实施方式的更多详细信息，请参见RR [32]。

ChooseFromArchive gives a search order on optimization branches - ex., BFS or DFS - by outputting an index to search and the updated archive of optimization branches.
ChooseFromArchive 通过输出要搜索的索引和优化分支的更新存档，在优化分支（例如 BFS 或 DFS）上提供搜索顺序。

UpdateRidge updates the currently followed EVec which potentially changed due to the optimization step.
UpdateRidge 会更新当前关注的 EVec，该 EVec 可能会因优化步骤而更改。

Algorithm 1 Game Ridge Rider (GRR)-red modifications
    Input: \(\boldsymbol{\omega}^{\text {Saddle }}, \alpha\), ChooseFromArchive, GetRidges,
            EndRide, Optimize, UpdateRidge
    \(\mathcal{A}=\operatorname{GetRidges}\left(\omega^{\text {Saddle }}\right) \quad\) \# Init. Archive
    while Archive \(\mathcal{A}\) non-empty do
        \(j, \mathcal{A}=\) ChooseFromArchive \((\mathcal{A})\)
        \(\left(\boldsymbol{\omega}^{j}, e_{j}, \lambda_{j}\right)=\mathcal{A}_{j}\)
        while EndRide \(\left(\boldsymbol{\omega}^{j}, e_{j}, \lambda_{j}\right)\) not True do
            \(\boldsymbol{\omega}^{i} \leftarrow \boldsymbol{\omega}^{j}-\alpha e_{j} \quad\) \# Step along the ridge \(e_{j}\)
            \(e_{j}, \lambda_{j}=\operatorname{UpdateRidge}\left(\boldsymbol{\omega}^{j}, e_{j}, \lambda_{j}\right)\)
        \(\boldsymbol{\omega}^{j}=\) Optimize \(\left(\boldsymbol{\omega}^{j}\right)\)
        \(\mathcal{A}=\mathcal{A} \cup \operatorname{GetRidges}\left(\boldsymbol{\omega}^{j}\right) \quad\) \# Add new ridges

4. Experiments 4. 实验

We investigate using Game Ridge Rider on multi-agent learning problems. First, we make a range of twodimensional games, allowing us to qualitatively study new phenomena that occur in multi-agent settings. Next, we look at a higher dimensional experiment of learning strategies in the iterated prisoners’ dilemma (IPD). Our method is able to find a diverse set of solutions improving on baselines.
我们研究了使用 Game Ridge Rider 解决多智能体学习问题。首先，我们制作了一系列二维游戏，使我们能够定性地研究在多智能体设置中出现的新现象。接下来，我们着眼于迭代囚徒困境（IPD）中学习策略的更高维实验。我们的方法能够找到一组不同的解决方案来改进基线。

4.1. Test Problems 4.1. 测试问题

Our test problems described in full detail in Appendix Section B. 1 and summarized here. We visualize the strategy space for 2-parameter problems in Appendix Fig. 3.
我们的测试问题在附录 B. 1 节中进行了详细描述，并在此处进行了总结。我们在附录图 3 中可视化了 2 参数问题的策略空间。
Iterated Prisoners’ Dilemma (IPD): This game is an infinite sequence of the Prisoner’s Dilemma, where the future payoff is discounted by a factor

γ \in [0, 1)

. Each agent is conditioned on the actions in the prior state, so there are 5 parameters for each agent - i.e., the probability of cooperating at start state or given both agents preceding actions. We interested in two Nash equilibria: Defect-Defect (DD) where agents are selfish (giving a poor reward), and tit-fortat (TT) where agents initially cooperate, then copy the opponents action (giving a higher reward).
迭代囚徒困境（IPD）：这个游戏是囚徒困境的无限序列，其中未来的收益被贴现了一个因子

γ \in [0, 1)

。每个代理都以先前状态中的动作为条件，因此每个代理都有 5 个参数 - 即，在开始状态下合作的概率或给定两个代理在动作之前的概率。我们对两个纳什均衡感兴趣：缺陷-缺陷（DD），其中代理是自私的（给予糟糕的奖励），以及 tit-fortat （TT），其中代理最初合作，然后复制对手的行动（给予更高的奖励）。
Small IPD: A is a 2-parameter simplification of IPD, which allows DD and TT Nash equilibria. This game allows us to visualize some of the optimization difficulties for the full-scale IPD, however, the game Hessian has strictly real EVals unlike the full-scale IPD.
小 IPD：A 是 IPD 的 2 参数简化，它允许 DD 和 TT 纳什均衡。该游戏允许我们可视化全尺寸 IPD 的一些优化难点，但是，与全尺寸 IPD 不同，Hessian 游戏具有严格的真实 EVals。

Matching Pennies: This is a simplified 2-parameter version of rock-paper scissors. The first player wins if they select the same, while the second player wins if they select different. This game has a Nash equilibrium where each player selects their action with uniform probability. This problem’s game Hessian has purely imaginary EVals unlike the small IPD, but only has a single solution and thus is a poor fit for evaluating RR which finds diversity of solutions.
Matching Pennies（匹配便士）：这是石头布剪刀的简化 2 参数版本。如果第一个玩家选择相同，则他们获胜，而第二个玩家如果选择不同，则获胜。这个游戏有一个纳什均衡，每个玩家都以均匀的概率选择他们的行动。与小 IPD 不同，这个问题的游戏 Hessian 具有纯虚的 EVals，但只有一个解，因此不适合评估找到解多样性的 RR。
Mixing Small IPD and Matching Pennies: This game interpolates between the Small IPD and matching pennies games with an interpolation factor

τ \in [0, 1]

. If

τ = .25

混合小 IPD 和匹配便士：此游戏使用插值因子

τ \in [0, 1]

在小 IPD 和匹配便士游戏之间进行插值。如果

τ = .25

	Player 1 Loss 玩家 1 输	Player 1 Strategy Distribution, [min, max $]$ 玩家 1 策略分布， [min， max $]$
Search Strategy 检索策略	$L [min, max]$	$C_{0}$	$C ∣ C C$	$C ∣ C D$	$C ∣ D C$	$C ∣ D D$
Max Entropy Saddle Max Entropy 鞍座	$[1.000, 2.000]$	$[.001, .999]$	$[.041, .999]$	$[.004, 0.874]$	$[.000, 0.912]$	$[.000, .013]$
20 Random init + grad.	$[1.997, 1.998]$	$[.043, .194]$	$[.142, .480]$	$[.041, .143]$	$[.055, .134]$	$[.001, .001]$
20 Random init + LOLA 20 个随机初始化 + LOLA	$[1.000, 1.396]$	$[.000, 1.00]$	$[.093, 1.00]$	$[.000, .966]$	$[.057, 1.00]$	$[.000, .947]$
1 Random init + branch 1 个随机初始化 + 分支	$[2.000, 2.000]$	$[.001, .001]$	$[.027, .027]$	$[.003, .003]$	$[.008, .008]$	$[.000, .000]$

| | Player 1 Loss | Player 1 Strategy Distribution, [min, max

]

| | | | | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | Search Strategy |

L [min, max]

C_{0}

C ∣ C C

C ∣ C D

C ∣ D C

C ∣ D D

| | Max Entropy Saddle |

[1.000, 2.000]

[.001, .999]

[.041, .999]

[.004, 0.874]

[.000, 0.912]

[.000, .013]

| | 20 Random init + grad. |

[1.997, 1.998]

[.043, .194]

[.142, .480]

[.041, .143]

[.055, .134]

[.001, .001]

| | 20 Random init + LOLA |

[1.000, 1.396]

[.000, 1.00]

[.093, 1.00]

[.000, .966]

[.057, 1.00]

[.000, .947]

| | 1 Random init + branch |

[2.000, 2.000]

[.001, .001]

[.027, .027]

[.003, .003]

[.008, .008]

[.000, .000]

Table 1. We compare strategies for finding diverse solutions in the iterated prisoner’s dilemma (IPD). The IPD has two solution modes - i.e., solutions where both agents end up defecting with a loss of 2 and where both agents end up cooperating with a loss of 1 (like tit-for-tat). We compare just following gradients with SimSGD and LOLA [27] with random (init)ializations. We look at the impact of starting at a saddle, by branching on the EVecs at a random init. Takeaway: Our method finds solutions at both loss modes. Random inits then following the gradient or using LOLA does not find diverse solutions - the gradient often defects, while LOLA often cooperates. If we are not at a stationary point like a saddle, then branching likely does not affect where we converge to.
表 1.我们比较了在迭代囚徒困境（IPD）中寻找不同解决方案的策略。IPD 有两种解决方案模式 - 即，两个代理最终都以 2 分的损失结束的解，以及两个代理最终合作以 1 分的损失（如针锋相对）的解决方案。我们将以下梯度与 SimSGD 和 LOLA [27] 与随机（初始）化进行比较。我们通过在随机 init 处在 EVec 上分支来查看从 saddle 开始的影响。要点：我们的方法在两种损失模式下都能找到解决方案。然后遵循梯度或使用 LOLA 的随机初始化不会找到不同的解决方案 - 梯度经常有缺陷，而 LOLA 经常合作。如果我们不在像鞍座这样的静止点，那么分支可能不会影响我们收敛到的位置。
problem has two solutions - one where both players cooperate and one where both players select actions uniformly, with a Hopf bifurcation separating these.
problem 有两种解决方案 - 一种是双方合作，另一种是双方一致选择动作，用 Hopf 分叉分隔这些。

4.2. Baseline Methods on Toy Problems
4.2. 玩具问题的基线方法

Fig. 2 we shows the phase portrait for baseline methods on our mixed problem - i.e, the mixture of small IPD and matching pennies with

τ = .25

.
图 2 显示了混合问题上基线方法的相位图 - 即小 IPD 和匹配的便士与

τ = .25

的混合。

Figure 2. This shows the phase portrait for two standard optimization algorithms on the mixed small IPD and Matching pennies problem. Following the gradient is shown in red, while LOLA - a method for learning in games - is shown in blue. Following the gradient only finds the solution in the top right, because the center solution has imaginary EVals. LOLA [27] can find either solution. Takeaway: The mixture game has a range of phenomena, including a imaginary EVal solution, a real EVal solution and a Hopf bifurcation. We may want to use a method for learning in games, so we can robustly converge to different solutions.
图 2.这显示了混合小 IPD 和 Matching pennies 问题上的两种标准优化算法的相位图。遵循渐变以红色显示，而 LOLA（一种在游戏中学习的方法）以蓝色显示。遵循梯度只会找到右上角的解，因为中心解有虚数 EVal。LOLA [27] 可以找到任何一种解决方案。要点：混合博弈有一系列现象，包括虚构的 EVal 解、实的 EVal 解和 Hopf 分岔。我们可能希望使用一种在游戏中学习的方法，这样我们就可以稳健地收敛到不同的解决方案。

4.3. Visualizing the Spectrum on Toy Problems
4.3. 可视化玩具问题的频谱

In Appendix Fig. 5, we visualize the spectrum with the mixed objective to see where stationary points are, which stationary points are solutions, how desirable solutions are for the players, and where the bifurcation occurs.
在附录图 5 中，我们用混合物镜可视化频谱，以查看静止点在哪里，哪些静止点是解决方案，玩家理想的解决方案如何，以及分叉发生的位置。

4.4. Visualizing the Spectrum on the full IPD
4.4. 可视化完整 IPD 上的频谱

Appendix Fig. 4 shows the spectrum on optimization trajectories for the IPD. During training, complex EVals cross the imaginary axis and the final stationary point has positive a& negative real EVals and complex EVals. Takeaway: While optimizing the IPD, we have multiple bifurcation candidates and thus multiple potential branching points for GRR.
附录图 4 显示了 IPD 优化轨迹上的光谱。在训练过程中，复杂的EVals穿过假想轴，最后的静止点有正a和负的真实EVals和复杂的EVals。要点：在优化 IPD 的同时，我们有多个分叉候选者，因此 GRR 有多个潜在的分支点。

4.5. Game Ridge Rider (GRR) on Toy Problems
4.5. 游戏 Ridge Rider （GRR）关于玩具问题

In Figure 1 we use our method to find diverse solutions on toy problems with different types of bifurcations. The small IPD has a saddle bifurcation, while the mixed problem has a Hopf bifurcation. The mixture has a solution with imaginary EVals, which is unstable when following the gradient - see Figure 2 - so we use LOLA after branching. Takeaway: By branching our optimization at a bifurcation and using a method for learning in games, we can find all solutions in both toy problems from a single starting point.
在图 1 中，我们使用我们的方法为具有不同类型分岔的玩具问题找到不同的解决方案。小 IPD 具有 saddle 分叉，而混合问题具有 Hopf 分叉。该混合物有一个带有假想 EVals 的解，当遵循梯度时不稳定 - 参见图 2 - 因此我们在分支后使用 LOLA。要点：通过在分岔处分支我们的优化并使用一种在游戏中学习的方法，我们可以从一个起点找到两个玩具问题的所有解决方案。

4.6. Game Ridge Rider on the IPD
4.6. IPD 上的 Game Ridge Rider

Here, we use our method on the IPD which is a larger scale problem where existing methods have difficulty finding diverse solutions. IPD has two solution modes: ones where both agents end up defecting and cooperating respectively. Table 4 compares our method to following gradients and LOLA each run with random initializations. We also investigate the importance of starting at the max entropy saddle. Takeaway: Our method finds solutions at both loss modes by branching at the top few EVecs, where baseline methods failed to find diverse solutions. Starting at an approximate saddle is critical for our branching to find different solutions.
在这里，我们在 IPD 上使用我们的方法，这是一个更大规模的问题，现有方法很难找到不同的解决方案。IPD 有两种解决方案模式：一种是两个代理分别以叛逃和合作结束。表 4 将我们的方法与以下梯度和 LOLA 进行了比较，每次运行都采用随机初始化。我们还研究了从最大熵鞍开始的重要性。要点：我们的方法通过在前几个 EVec 上分支来找到两种损失模式的解决方案，其中基线方法无法找到不同的解决方案。从一个近似的 sendle 开始对于我们的分支找到不同的解决方案至关重要。

5. Conclusion 5. 总结

In this paper we introduced Game Ridge Rider, an extension of the Ridge Rider algorithm to settings with multiple losses. We showed that in these settings a broader class of bifurcation points needs to be considered and that GRR can indeed discover them in a number of settings. Furthermore, our experimental results showed that GRR obtains a diversity of qualitatively different solutions in multi-agent settings such as iterated prisoner’s dilemma. We also provide some theoretical justification for our method by using tools from the dynamical systems literature. Prior work had failed to explore the connection between saddle points in the RR algorithm and the bifurcation points in dynamical systems.
在本文中，我们介绍了 Game Ridge Rider，它是 Ridge Rider 算法对具有多次损失的设置的扩展。我们表明，在这些设置中，需要考虑更广泛的分叉点类别，并且 GRR 确实可以在许多设置中发现它们。此外，我们的实验结果表明，GRR 在迭代囚徒困境等多代理环境中获得了多种定性不同的解决方案。我们还通过使用动力系统文献中的工具为我们的方法提供了一些理论论证。以前的工作未能探索 RR 算法中的鞍点与动力系统中的分岔点之间的联系。

Acknowledgements 确认

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. We would also like to thank C. Daniel Freeman for feedback on this work.
用于准备这项研究的资源部分由安大略省、加拿大政府通过 CIFAR 以及赞助 Vector Institute 的公司提供。我们还要感谢 C. Daniel Freeman 对这项工作的反馈。

References 引用

[1] Antoine Cully, Jeff Clune, Danesh Tarapore, and JeanBaptiste Mouret. Robots that can adapt like animals. Nature, 521(7553):503-507, 2015.
[1] Antoine Cully、Jeff Clune、Danesh Tarapore 和 JeanBaptiste Mouret。可以像动物一样适应的机器人。自然， 521（7553）：503-507， 2015.
[2] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
[2] 罗伯特·盖霍斯、帕特里夏·鲁比施、克劳迪奥·米凯利斯、马蒂亚斯·贝思奇、菲利克斯·维奇曼和维兰德·布伦德尔。Imagenet 训练的 cnn 偏向于纹理;增加形状偏差可以提高准确性和稳健性。arXiv 预印本 arXiv：1811.12231,2018 年。
[3] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665-673, 2020.
[3] 罗伯特·盖罗斯、约恩-亨里克·雅各布森、克劳迪奥·米凯利斯、理查德·泽梅尔、维兰德·布伦德尔、马蒂亚斯·贝思奇和费利克斯·维奇曼。深度神经网络中的快捷方式学习。自然机器智能， 2（11）：665-673， 2020.
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672-2680, 2014.
[4] 伊恩·古德费罗、让·普盖特-阿巴迪、迈赫迪·米尔扎、徐兵、大卫·沃德-法利、谢尔吉尔·奥泽尔、亚伦·库维尔和约书亚·本吉奥。生成对抗网络。在神经信息处理系统进展中，第 2672-2680 页，2014 年。
[5] David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actor-critic methods. arXiv preprint arXiv:1610.01945, 2016.
[5] David Pfau 和 Oriol Vinyals。连接生成对抗网络和参与者-评论家方法。arXiv 预印本 arXiv：1610.01945,2016 年。
[6] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In International Conference on Machine Learning (ICML), pages 41-48, 2009.
[6] Yoshua Bengio、Jérôme Louradour、Ronan Collobert 和 Jason Weston。课程学习。在机器学习国际会议（ICML）中，第 41-48 页，2009 年。
[7] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2019.
[7] Bowen Baker、Ingmar Kanitscheider、Todor Markov、Yi Wu、Glenn Powell、Bob McGrew 和 Igor Mordatch。来自多代理自学课程的新兴工具使用。在学习表征国际会议中，2019 年。
[8] David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. Open-ended learning in symmetric zero-sum games. In International Conference on Machine Learning, pages 434-443. PMLR, 2019.
[8] 大卫·巴尔杜齐、玛尔塔·加内洛、约拉姆·巴赫拉赫、沃伊切赫·查内基、朱利安·佩罗拉特、马克斯·贾德伯格和托尔·格雷佩尔。对称零和博弈中的开放式学习。在机器学习国际会议中，第 434-443 页。PMLR，2019 年。
[9] Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. In International Conference on Learning Representations, 2018.
[9] Sainbayar Sukhbaatar、Zeming Lin、Ilya Kostrikov、Gabriel Synnaeve、Arthur Szlam 和 Rob Fergus。通过不对称的自我游戏实现内在动机和自动课程。在学习表征国际会议上，2018 年。
[10] Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pages 318-326, 2012.
[10] 贾斯汀·多姆克。基于优化的建模的通用方法。人工智能与统计，第 318-326 页，2012 年。
[11] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning (ICML), pages 2113-2122, 2015.
[11] 道格尔·麦克劳林、大卫·杜文诺和瑞安·亚当斯。通过可逆学习实现基于梯度的超参数优化。在机器学习国际会议（ICML）中，第 2113-2122 页，2015 年。
[12] Jonathan Lorraine and David Duvenaud. Stochastic hyperparameter optimization through hypernetworks. arXiv preprint arXiv:1802.09419, 2018.
[12] 乔纳森·洛林和大卫·杜文诺。通过超网络进行随机超参数优化。arXiv 预印本 arXiv：1802.09419,2018 年。
[13] Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. In International Conference on Learning Representations (ICLR), 2019.
[13] 马修·麦凯、保罗·维科尔、乔恩·洛林、大卫·杜文诺和罗杰·格罗斯。自调整网络：使用结构化最佳响应函数对超参数进行双层优化。在学习表征国际会议（ICLR）中，2019 年。
[14] Aniruddh Raghu, Maithra Raghu, Simon Kornblith, David Duvenaud, and Geoffrey Hinton. Teaching with commentaries. arXiv preprint arXiv:2011.03037, 2020.
[14] Aniruddh Raghu、Maithra Raghu、Simon Kornblith、David Duvenaud 和 Geoffrey Hinton。讲解教学。arXiv 预印本 arXiv：2011.03037,2020 年。
[15] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1540-1552, 2020.
[15] 乔纳森·洛林、保罗·维科尔和大卫·杜文诺。通过隐式微分优化数百万个超参数。人工智能与统计国际会议（AISTATS），第 1540-1552 页，2020 年。
[16] Avishek Joey Bose, Gauthier Gidel, Hugo Berrard, Andre Cianflone, Pascal Vincent, Simon Lacoste-Julien, and William L Hamilton. Adversarial example games. arXiv preprint arXiv:2007.00720, 2020.
[16] Avishek Joey Bose、Gauthier Gidel、Hugo Berrard、Andre Cianflone、Pascal Vincent、Simon Lacoste-Julien 和 William L Hamilton。对抗性示例游戏。arXiv 预印本 arXiv：2007.00720,2020 年。
[17] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. IEEE Transactions on Neural Networks and Learning Systems, 30 (9):2805-2824, 2019.
[17] Xiaoyong Yuan， Pan He， Qile Zhu， and Xiaolin Li. 对抗性示例：深度学习的攻击和防御。IEEE 神经网络与学习系统汇刊， 30 （9）：2805-2824， 2019.
[18] Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for model based reinforcement learning. arXiv preprint arXiv:2004.07804, 2020.
[18] Aravind Rajeswaran、Igor Mordatch 和 Vikash Kumar。基于模型的强化学习的博弈论框架。arXiv 预印本 arXiv：2004.07804,2020 年。
[19] Pierre-Luc Bacon, Florian Schäfer, Clement Gehring, Animashree Anandkumar, and Emma Brunskill. A Lagrangian method for inverse problems in reinforcement learning. lis.csail.mit.edu/pubs, 2019.
[19] 皮埃尔-吕克·培根、弗洛里安·谢弗、克莱门特·盖林、阿尼玛什里·阿南德库马尔和艾玛·布伦斯基尔。一种用于强化学习中逆问题的拉格朗日方法。lis.csail.mit.edu/pubs，2019 年。
[20] Evgenii Nikishin, Romina Abachi, Rishabh Agarwal, and Pierre-Luc Bacon. Control-oriented model-based reinforcement learning with implicit differentiation. arXiv preprint arXiv:2106.03273, 2021.
[20] 叶夫根尼·尼基辛、罗米娜·阿巴奇、里沙布·阿加瓦尔和皮埃尔-吕克·培根。面向控制的基于模型的强化学习，具有隐式微分。arXiv 预印本 arXiv：2106.03273,2021 年。
[21] David Acuna, Guojun Zhang, Marc T Law, and Sanja Fidler. f-domain-adversarial learning: Theory and algorithms for unsupervised domain adaptation with neural networks, 2021. URL https://openreview.net/forum?id= WqXAKcw£ZtI.
[21] David Acuna、Guojun Zhang、Marc T Law 和 Sanja Fidler。f 域对抗性学习：神经网络无监督域适应的理论和算法，2021 年。URL https://openreview.net/forum?id= WqXAKcw£ZtI.
[22] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
[22] Barret Zoph 和 Quoc V Le.使用强化学习进行神经架构搜索。arXiv 预印本 arXiv：1611.01578,2016 年。
[23] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence, volume 33, pages 4780-4789, 2019.
[23] Esteban Real、Alok Aggarwal、Yanping Huang 和 Quoc V Le。图像分类器架构搜索的正则化演变。在 AAAI 人工智能会议上，第 33 卷，第 4780-4789 页，2019 年。
[24] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations (ICLR), 2019.
[24] Hanxiao Liu、Karen Simonyan 和 Yiming Yang。DARTS：可微分架构搜索。在学习表征国际会议（ICLR）中，2019 年。
[25] Will Grathwohl, Elliot Creager, Seyed Kamyar Seyed Ghasemipour, and Richard Zemel. Gradient-based optimization of neural network architecture. 2018.
[25] 威尔·格拉斯沃尔、埃利奥特·克雷格、赛义德·卡米亚尔·赛义德·加塞米普尔和理查德·泽梅尔。神经网络架构的基于梯度的优化。2018.
[26] George Adam and Jonathan Lorraine. Understanding neural architecture search techniques. arXiv preprint arXiv:1904.00438, 2019.
[26] 乔治·亚当和乔纳森·洛林。了解神经架构搜索技术。arXiv 预印本 arXiv：1904.00438,2019 年。
[27] Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In International Conference on Autonomous Agents and MultiAgent Systems, pages 122-130, 2018.
[27] 雅各布·福斯特、理查德·陈、马鲁安·谢迪瓦特、西蒙·怀特森、彼得·阿比尔和伊戈尔·莫尔达奇。以对手学习意识学习。在自主代理和多代理系统国际会议上，第 122-130 页，2018 年。
[28] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised fewshot classification. arXiv preprint arXiv:1803.00676, 2018.
[29] Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-learning with implicit gradients. arXiv preprint arXiv:1909.04630, 2019.
[29] 阿拉文德·拉杰斯瓦兰、切尔西·芬恩、沙姆·卡卡德和谢尔盖·莱文。具有隐式梯度的元学习。arXiv 预印本 arXiv：1909.04630,2019 年。
[30] Mengye Ren, Eleni Triantafillou, Kuan-Chieh Wang, James Lucas, Jake Snell, Xaq Pitkow, Andreas S Tolias, and Richard Zemel. Flexible few-shot learning with contextual similarity. arXiv preprint arXiv:2012.05895, 2020.
[30] 任梦烨、Eleni Triantafillou、王宽杰、詹姆斯·卢卡斯、杰克·斯内尔、Xaq Pitkow、Andreas S Tolias和理查德·泽梅尔。具有上下文相似性的灵活小样本学习。arXiv 预印本 arXiv：2012.05895,2020 年。
[31] Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. “other-play” for zero-shot coordination. In International Conference on Machine Learning, pages 4399-4410. PMLR, 2020.
[31] 胡恒元、亚当·勒勒、亚历克斯·佩萨霍维奇和雅各布·福斯特。“other-play” 用于零镜头协调。在机器学习国际会议中，第 4399-4410 页。PMLR，2020 年。
[32] Jack Parker-Holder, Luke Metz, Cinjon Resnick, Hengyuan Hu, Adam Lerer, Alistair Letcher, Alexander Peysakhovich, Aldo Pacchiano, and Jakob Foerster. Ridge rider: Finding diverse solutions by following eigenvectors of the hessian. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 753765. Curran Associates, Inc., 2020. URL https:// proceedings.neurips.cc/paper/2020/file/ 08425b881bcde94a383cd258cea331be-Paper. pdf.
[32] 杰克·帕克-霍尔德、卢克·梅茨、辛洪·雷斯尼克、胡恒远、亚当·勒勒、阿利斯泰尔·莱彻、亚历山大·佩萨霍维奇、阿尔多·帕基亚诺和雅各布·福斯特。Ridge rider：通过遵循 hessian 的特征向量来寻找不同的解决方案。在 H. Larochelle、M. Ranzato、R. Hadsell、MF Balcan 和 H. Lin 编辑的《神经信息处理系统进展》中，第 33 卷，第 753765 页。Curran Associates， Inc.，2020 年。URL https:// proceedings.neurips.cc/paper/2020/file/ 08425b881bcde94a383cd258cea331be-Paper。PDF格式。
[33] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147-160, 1994.
[33] 巴拉克·珀尔穆特。快速精确乘以 hessian。神经计算，6（1）：147-160,1994 年。
[34] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
[35] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[35] 亚当·帕斯克、山姆·格罗斯、苏米特·钦塔拉、格雷戈里·查南、爱德华·杨、扎卡里·德维托、林泽明、阿尔班·德斯梅森、卢卡·安蒂加和亚当·勒勒。pytorch 中的自动微分。2017.
[36] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http: / / github. com/google/jax.
[36] 詹姆斯·布拉德伯里、罗伊·弗罗斯特格、彼得·霍金斯、马修·詹姆斯·约翰逊、克里斯·利里、道格尔·麦克劳林、乔治·内库拉、亚当·帕斯克、杰克·范德普拉斯、斯凯·万德曼-米尔恩和张乔。JAX：Python+NumPy 程序的可组合转换，2018 年。网址 http： / / github.com/google/jax 的
[37] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484-489, 2016.
[37] David Silver， Aja Huang， Chris J Maddison， Arthur Guez， Laurent Sifre， George Van Den Driessche， Julian Schrittwieser， Ioannis Antonoglou， Veda Panneershelvam， Marc Lanctot， et al.掌握深度神经网络和树搜索的围棋游戏。自然， 529（7587）：484-489， 2016.
[38] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676): 354-359, 2017.
[38] 大卫·西尔弗、朱利安·施里特维泽、凯伦·西蒙尼扬、扬尼斯·安东诺格鲁、阿贾·黄、亚瑟·盖兹、托马斯·休伯特、卢卡斯·贝克、马修·赖、阿德里安·博尔顿等人。在没有人类知识的情况下掌握围棋游戏。自然， 550（7676）： 354-359， 2017.
[39] Nolan Bard, Jakob N Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, et al. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020.
[39] 诺兰·巴德、雅各布·福斯特、萨拉特·钱达尔、尼尔·伯奇、马克·兰克托、H 弗朗西斯·宋、埃米利奥·帕里索托、文森特·杜穆兰、苏博迪普·莫伊特拉、爱德华·休斯等。hanabi 挑战：AI 研究的新前沿。人工智能， 280：103216， 2020.
[40] Robert J Aumann. Acceptable points in games of perfect information. Pacific Journal of Mathematics, 10(2):381-417, 1960.
[40] 罗伯特·奥曼。完美信息游戏中可接受的分数。太平洋数学杂志，10（2）：381-417,1960 年。
[41] Robert Axelrod and William Donald Hamilton. The evolution of cooperation. science, 211(4489):1390-1396, 1981.
[41] 罗伯特·阿克塞尔罗德和威廉·唐纳德·汉密尔顿。合作的演变。科学， 211（4489）：1390-1396， 1981.
[42] M Tabor. Chaos and integrability in nonlinear dynamics: An introduction, wileyinterscience. Chaos and Integrability in Nonlinear Dynamics: An Introduction, 1989.
[42] M 泰伯。非线性动力学中的混沌和可积性：简介，wileyinterscience。非线性动力学中的混沌和可积性：导论，1989 年。
[43] Jack K Hale and Hüseyin Koçak. Dynamics and bifurcations, volume 3. Springer Science & Business Media, 2012.
[43] Jack K Hale 和 Hüseyin Koçak。动力学和分叉，第 3 卷。施普林格科学与商业媒体，2012 年。
[44] Ippei Shimada and Tomomasa Nagashima. A numerical approach to ergodic problem of dissipative dynamical systems. Progress of theoretical physics, 61(6):1605-1616, 1979.
[44] Ippei Shimada 和 Tomomasa Nagashima。耗散动力系统遍历问题的数值方法。理论物理学进展， 61（6）：1605-1616， 1979.
[45] Davide Bigoni and Oleg Kirillov. Dynamic Stability and Bifurcation in Nonconservative Mechanics. Springer, 2019.
[45] 大卫·比戈尼和奥列格·基里洛夫。非保守力学中的动态稳定性和分岔。施普林格，2019 年。
[46] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993-1001, 1990.
[46] L. K. Hansen 和 P. Salamon。神经网络集成。IEEE 模式分析和机器智能汇刊， 12（10）：993-1001， 1990。
[47] Y. Liu and X. Yao. Ensemble learning via negative correlation. Neural Networks, 12(10):1399 - 1404, 1999. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(99) 00073-8. URL http://www.sciencedirect.com/ science/article/pii/S0893608099000738.
[47] Y. Liu 和 X. Yao。通过负相关进行集成学习。神经网络， 12（10）：1399 - 1404， 1999。国际标准书号 0893-6080。doi： https://doi.org/10.1016/S0893-6080（99） 00073-8.URL http://www.sciencedirect.com/ science/article/pii/S0893608099000738。
[48] Samarth Sinha, Homanga Bharadhwaj, Anirudh Goyal, Hugo Larochelle, Animesh Garg, and Florian Shkurti. Diversity inducing information bottleneck in model ensembles. AAAI, 2021.
[49] Andrew Slavin Ross, Weiwei Pan, Leo A. Celi, and Finale Doshi-Velez. Ensembles of locally independent prediction models. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 5527-5536. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/ article/view/6004.
[49] Andrew Slavin Ross、Weiwei Pan、Leo A. Celi 和 Finale Doshi-Velez。局部独立预测模型的集合。第 34 届 AAAI 人工智能会议，AAAI 2020，第 32 届人工智能创新应用会议，IAAI 2020，第十届人工智能教育进展 AAAI 研讨会，EAAI 2020，美国纽约，2020 年 2 月 7 日至 12 日，第 5527-5536 页。AAAI 出版社，2020 年。URL https://aaai.org/ojs/index.php/AAAI/ article/view/6004。
[50] Zelda Mariet and Suvrit Sra. Diversity Networks: Neural Network Compression Using Determinantal Point Processes. In International Conference on Learning Representations (ICLR), May 2016.
[50] Zelda Mariet 和 Suvrit Sra. 分集网络：使用行列式点过程的神经网络压缩。在学习表征国际会议（ICLR）中，2016 年 5 月。
[51] Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pages 4970-4979. PMLR, 2019.
[51] 彭天宇、徐坤、杜超、陈宁和朱军。通过促进集成多样性来提高对抗鲁棒性。在机器学习国际会议中，第 4970-4979 页。PMLR，2019 年。
[52] Joel Lehman and Kenneth O. Stanley. Exploiting openendedness to solve problems through the search for novelty. In Proceedings of the Eleventh International Conference on Artificial Life (Alife XI. MIT Press, 2008.
[52] 乔尔·雷曼和肯尼斯·斯坦利。利用开放性，通过寻找新奇来解决问题。在第十一届人工生命国际会议论文集（Alife XI.麻省理工学院出版社，2008 年。
[53] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019. URL https: / / openreview. net/forum?id=SJx63jRqFm.
[53] 本杰明·艾森巴赫、阿比舍克·古普塔、朱利安·伊巴兹和谢尔盖·莱文。多样性就是你所需要的：学习没有奖励功能的技能。在学习表征国际会议中，2019 年。网址 https： / / openreview.net/forum？id=SJx63jRqFm.
[54] Jack Parker-Holder, Aldo Pacchiano, Krzysztof Choromanski, and Stephen Roberts. Effective diversity in populationbased reinforcement learning. In Advances in Neural Information Processing Systems 34. 2020.
[54] 杰克·帕克-霍尔德、阿尔多·帕基亚诺、克日什托夫·乔罗曼斯基和斯蒂芬·罗伯茨。基于人群的强化学习中的有效多样性。神经信息处理系统进展 34.2020.
[55] Justin K. Pugh, Lisa B. Soros, and Kenneth O. Stanley. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3:40, 2016. ISSN 2296-9144. doi:

10.3389 /

frobt.2016.00040. URL https://www.frontiersin.org/article/

10.3389 /

frobt. 2016.00040.
[55] 贾斯汀·皮尤、丽莎·索罗斯和肯尼斯·斯坦利。质量多样性：进化计算的新前沿。机器人和人工智能前沿，3：40,2016 年。国际标准书号 2296-9144。doi：

10.3389 /

frobt.2016.00040。URL https://www.frontiersin.org/article/

10.3389 /

frobt.2016.00040.
[56] Galina M Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747-756, 1976.
[56] 加林娜·科尔佩列维奇。用于查找鞍点和其他问题的 extragradient 方法。马特孔，12：747-756,1976 年。
[57] Waïss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. A tight and unified analysis of gradientbased methods for a whole spectrum of differentiable games. In International Conference on Artificial Intelligence and Statistics, pages 2863-2873. PMLR, 2020.
[57] Waïss Azizian、Ioannis Mitliagkas、Simon Lacoste-Julien 和 Gauthier Gidel。对各种可微分博弈的基于梯度的方法进行了紧密而统一的分析。在人工智能与统计国际会议中，第 2863-2873 页。PMLR，2020 年。
[58] Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. arXiv preprint arXiv:1311.1869, 2013.
[58] 亚历山大·拉赫林和卡尔蒂克·斯里达兰。具有可预测序列的优化、学习和游戏。arXiv 预印本 arXiv：1311.1869,2013 年。
[59] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. arXiv preprint arXiv:1711.00141, 2017.
[59] 康斯坦丁诺斯·达斯卡拉基斯、安德鲁·伊利亚斯、瓦西里斯·西尔格卡尼斯和曾浩阳。乐观地训练 gans。arXiv 预印本 arXiv：1711.00141,2017 年。
[60] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang, Simon LacosteJulien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 18021811. PMLR, 2019.
[60] 高蒂尔·吉德尔、雷哈内·阿斯卡里·赫马特、穆罕默德·佩泽什基、雷米·勒普里奥尔、加布里埃尔·黄、西蒙·拉科斯特朱利安和扬尼斯·米特利亚卡斯。改善游戏动态的负动量。在第 22 届人工智能与统计国际会议中，第 18021811 页。PMLR，2019 年。
[61] Jonathan Lorraine, David Acuna, Paul Vicol, and David Duvenaud. Complex momentum for learning in games. arXiv preprint arXiv:2102.08431, 2021.
[61] 乔纳森·洛林、大卫·阿库纳、保罗·维科尔和大卫·杜文诺。在游戏中学习的复杂动力。arXiv 预印本 arXiv：2102.08431,2021 年。
[62] Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial networks. In International Conference on Learning Representations, 2018.
[62] 高蒂尔·吉德尔、雨果·贝拉尔、盖坦·维格努德、帕斯卡尔·文森特和西蒙·拉科斯特-朱利安。生成对抗网络的变分不等式视角。在学习表征国际会议上，2018 年。
[63] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. arXiv preprint arXiv:1705.10461, 2017.
[63] 拉尔斯·梅舍德、塞巴斯蒂安·诺沃津和安德烈亚斯·盖格。gans 的数字。arXiv 预印本 arXiv：1705.10461,2017 年。
[64] Alistair Letcher, David Balduzzi, Sébastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. Differentiable game mechanics. The Journal of Machine Learning Research, 20(1):3032-3071, 2019.
[64] 阿利斯泰尔·莱彻、大卫·巴尔杜齐、塞巴斯蒂安·拉卡尼埃、詹姆斯·马滕斯、雅各布·福尔斯特、卡尔·图伊尔斯和托尔·格雷佩尔。可区分的游戏机制。机器学习研究杂志，20（1）：3032-3071,2019 年。
[65] Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On finding local nash equilibria (and only local nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838, 2019.
[65] 埃里克·马祖达尔、迈克尔·乔丹和 S 香卡·萨斯特里。关于在零和博弈中找到局部纳什均衡（并且只有局部纳什均衡）。arXiv 预印本 arXiv：1901.00838,2019 年。
[66] Florian Schäfer and Anima Anandkumar. Competitive gradient descent. arXiv preprint arXiv:1905.12103, 2019.
[66] 弗洛里安·舍费尔和阿尼玛·阿南德库马尔。有竞争力的坡度下降。arXiv 预印本 arXiv：1905.12103,2019 年。
[67] Marta Garnelo, Wojciech Marian Czarnecki, Siqi Liu, Dhruva Tirumala, Junhyuk Oh, Gauthier Gidel, Hado van Hasselt, and David Balduzzi. Pick your battles: Interaction graphs as population-level objectives for strategic diversity. In Proceedings of the 20th International Conference on

A u

tonomous Agents and MultiAgent Systems, pages 1501-1503, 2021.
[67] 玛尔塔·加内罗、沃伊切赫·玛丽安·查内基、刘思琪、德鲁瓦·蒂鲁马拉、吴俊赫、高蒂尔·吉德尔、哈多·范·哈塞尔特和大卫·巴尔杜齐。选择你的战斗：互动图作为战略多样性的人口水平目标。第 20 届

A u

tonomous Agents 和 MultiAgent 系统国际会议论文集，第 1501-1503 页，2021 年。
[68] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature, 575(7782):350-354, 2019.
[69] Yaodong Yang, Ying Wen, Jun Wang, Liheng Chen, Kun Shao, David Mguni, and Weinan Zhang. Multi-agent determinantal q-learning. In International Conference on Machine Learning, pages 10757-10766. PMLR, 2020.
[69] 杨耀东、温英、王军、陈立恒、邵坤、David Mguni 和 Weinan Zhang。多智能体决定性 q-learning。在机器学习国际会议中，第 10757-10766 页。PMLR，2020 年。
[70] Andrei Lupu, Hengyuan Hu, and Jakob Foerster. Trajectory diversity for zero-shot coordination. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pages 1593-1595, 2021.
[70] 安德烈·卢普、胡恒元和雅各布·福斯特。用于零镜头协调的轨迹多样性。第 20 届自主代理和多代理系统国际会议论文集，第 1593-1595 页，2021 年。

Diversity in machine learning: Finding diverse solutions is often desirable in machine learning, for example improving performance for model ensembles [46], with canonical approaches directly optimizing for negative correlation amongst model predictions [47]. In recent times these ideas have begun to re-emerge, improving ensemble performance [48-50], robustness [51, 1] and boosting exploration in reinforcement learning [52-55].
机器学习的多样性：在机器学习中，寻找不同的解决方案通常是可取的，例如提高模型集成的性能[46]，规范方法直接优化模型预测之间的负相关[47]。最近，这些想法开始重新出现，提高了集成性能 [48-50]、鲁棒性 [51， 1] 并促进了强化学习的探索 [52-55]。
Many of these approaches seek to find diverse solutions by following gradients of an altered, typically multi-objective loss function. By contrast, the recent Ridge Rider (RR, [32]) algorithm searches for diverse solutions by following EVecs of the Hessian with respect to the original loss function, producing orthogonal (loss reducing) search directions.
其中许多方法试图通过遵循改变的、通常是多目标损失函数的梯度来找到不同的解决方案。相比之下，最近的 Ridge Rider （RR， [32]）算法通过跟踪 Hessian 相对于原始损失函数的 EVecs 来搜索不同的解决方案，从而产生正交（减少损失）搜索方向。

Finding solutions in games: There are first-order methods for finding solutions in games including extragradient

[56, 57]

, optimistic gradient

[58, 59]

, negative momentum [60], complex momentum [61], and iterate averaging [62]. There are also higher-order methods like consensus optimization [63], symplectic gradient adjustment (SGA) [64], local symplectic surgery (LSS) [65], competitive gradient descent (CGD) [66].
在博弈中寻找解决方案：在博弈中寻找解决方案的一阶方法包括extragradient

[56, 57]

、optimistic gradient

[58, 59]

、negative momentum [60]、complex momentum[61]和迭代平均[62]。还有一些高阶方法，如一致性优化 [63]、对称梯度调整（SGA） [64]、局部对称手术（LSS） [65]、竞争梯度下降（CGD） [66]。

Diversity in multi-agent RL In recent times a series of works have explore the benefit of diversity in both competitive

[8, 67, 68]

and cooperative

[69, 70]

multi-agent RL. However, once again these approaches all consider augmented loss functions. Instead, we take inspiration from RR and extend it to the multi-agent setting.
多智能体 RL 的多样性最近，一系列工作探讨了竞争

[8, 67, 68]

性和合作

[69, 70]

性多智能体 RL 中多样性的好处。然而，这些方法再次考虑了增强的损失函数。相反，我们从 RR 中汲取灵感并将其扩展到多智能体设置。

B. Experiment Details B. 实验详情

B.1. Test Problems B.1. 测试问题

Iterated Prisoners’ Dilemma (IPD): This game is an infinite sequence of the Prisoner’s Dilemma, where the future payoff is discounted by a factor

γ \in [0, 1)

. Each agent is conditioned on the actions in the prior state

(s)

. Thus, there are 5 parameters for each agent

i : P^{i} (C ∣ s)

the probability of cooperating at start state

s_{0} = \emptyset

or state

s_{t} = (a_{t - 1}^{1}, a_{t - 1}^{2})

for

t > 0

. There are two Nash equilibria which we interested in: Defect-Defect (DD) where agents are selfish (resulting in poor reward), and tit-for-tat (TT) where agents initially cooperate, then copy the opponents action (resulting in higher reward).
迭代囚徒困境（IPD）：这个游戏是囚徒困境的无限序列，其中未来的收益被贴现了一个因子

γ \in [0, 1)

。每个代理都以先前状态

(s)

中的操作为条件。因此，每个代理

i : P^{i} (C ∣ s)

都有 5 个参数：在开始状态

s_{0} = \emptyset

或状态

s_{t} = (a_{t - 1}^{1}, a_{t - 1}^{2})

下合作的

t > 0

概率。我们感兴趣的是两种纳什均衡：缺陷-缺陷（DD），其中代理是自私的（导致糟糕的奖励），以及针锋相对（TT），代理最初合作，然后复制对手的行动（导致更高的奖励）。

Small IPD: This is a 2-parameter simplification of IPD, which allows DD and TT Nash equilibria. We fix the strategy if our opponent defects, to defect with high probability. We also constrain the probability of cooperating to only depend on if the opponent cooperates, and in the initial state we assume our opponent cooperated. This game allows us to visualize some of the optimization difficulties for the full- scale IPD, however, the game hessian has strictly real EVals unlike the full-scale IPD. See Fig 2, top for a visualization of the strategy space.
小 IPD：这是 IPD 的 2 参数简化，允许 DD 和 TT 纳什均衡。如果我们的对手叛逃，我们就会修正策略，以高概率叛逃。我们还将合作的概率限制为仅取决于对手是否合作，在初始状态下，我们假设对手合作。该游戏允许我们可视化全尺寸 IPD 的一些优化难点，但是，与全尺寸 IPD 不同，游戏 hessian 具有严格的真实 EVals。参见图 2，顶部，了解策略空间的可视化。

Matching Pennies: This is a simplified 2-parameter version of rock-paper scissors, where each players selects Cooperate or Defect. This game has a Nash equilibrium where each player selects their action with uniform probability. Notably, this problem’s game Hessian has purely imaginary EVals so following the gradient does not converge to solutions and we need a method for learning in games like LOLA. Also, this game only has a single solution thus is a poor fit for evaluating RR which finds diversity of solutions. See Fig 2, bottom for a visualization of the strategy space.
Matching Pennies：这是石头剪刀布的简化 2 参数版本，每个玩家都选择 Cooperative 或 Defect。这个游戏有一个纳什均衡，每个玩家都以均匀的概率选择他们的行动。值得注意的是，这个问题的游戏 Hessian 具有纯虚数 EVals，因此遵循梯度不会收敛到解决方案，我们需要一种在 LOLA 等游戏中学习的方法。此外，这个游戏只有一个解决方案，因此不适合评估 RR，因为它会找到多种解决方案。参见图 2，底部，了解策略空间的可视化。

Mixing Small IPD and Matching Pennies: This game interpolates between the Small IPD and matching pennies games with the loss for player

i, L_{mix P_{i}, τ} =

τ L_{smallIPD, P_{i}} + (1 - τ) L_{matchingPennies, P_{i}}

. This problem has two solutions - one where both players cooperate, and one where both players select actions uniformly. The uniform action solution has imaginary EVals, so it is only stable under a method for learning in games, while the both cooperate solution has real EVals. There is a Hopf bifurcation separating these solutions. See Fig 2 for standard methods on this problem and Appendix Fig. 3 to contrast this problem with Small IPD or Matching Pennies. See Appendix Fig 5 to visualize the eigenstructure on this problem.
混合小 IPD 和匹配便士：此游戏在小 IPD 和匹配便士游戏之间插值，玩家输

i, L_{mix P_{i}, τ} =

了

τ L_{smallIPD, P_{i}} + (1 - τ) L_{matchingPennies, P_{i}}

。此问题有两种解决方案 - 一种是双方玩家合作，另一种是双方一致选择操作。均匀动作解决方案具有虚数 EVals，因此它仅在游戏中的学习方法下是稳定的，而两者合作的解决方案具有实数 EVals。这些解之间有一个 Hopf 分叉。有关此问题的标准方法，请参见图 2，请参见附录图 3，以将此问题与小 IPD 或匹配便士进行对比。参见附录图 5 来可视化这个问题的特征结构。

Table 2. Notation 表 2.表示法

Ridge Rider [32] 山脊骑士 [32]

IPD

Iterated Prisoners' Dilemma
迭代的囚徒困境

GAN

Generative Adversarial Network [4]
生成对抗网络 [4]

LOLA

Learning with opponent learning awareness [27]
与对手一起学习学习意识 [27]

EVec, EVal EVec、EVal

Shorthand for Eigenvector or Eigenvalue
特征向量或特征值的简写

SGD

Stochastic Gradient Descent
随机梯度下降

SimSGD

Simultaneous SGD 同步 SGD

:=

Defined to be equal to
定义为等于

x, y, z, \dots \in C

Scalars 标量

x, y, z, \dots \in C^{n}

Vectors 向量

X, Y, Z, \dots \in C^{n \times n}

Matrices 矩阵

X^{⊤}

The transpose of matrix

X

矩阵

X

的转置

The identity matrix 单位矩阵

ℜ (z), ℑ (z)

The real or imaginary component of

z \in C

的实部或虚部

z \in C

The imaginary unit.

z \in C ⟹ z = ℜ (z) + i ℑ (z)

虚数单位。

z \in C ⟹ z = ℜ (z) + i ℑ (z)

\bar{z}

The complex conjugate of

z \in C

的复共轭

z \in C

| z | := \sqrt{z \bar{z}}

The magnitude or modulus of

z \in C

的大小或模量

z \in C

\arg (z)

The argument or phase of

z \in C ⟹ z = | z | \exp (i \arg (z))

的

z \in C ⟹ z = | z | \exp (i \arg (z))

参数或阶段

A, B

A symbol for the outer/inner players
外/内玩家的符号

d_{A}, d_{B} \in N

The number of weights for the outer/inner players
外侧/内侧玩家的权重数量

θ

A symbol for the parameters or weights of a player
玩家的参数或权重的元件

θ_{A} \in R^{d_{A}}, θ_{B} \in R^{d_{B}}

The outer/inner parameters or weights
外部/内部参数或权重

L : R^{n} \to R

A symbol for a loss
亏损的象征

L_{A} (θ_{A}, θ_{B}), L_{B} (θ_{A}, θ_{B})

The outer/inner losses

- R^{d_{A} + d_{B}} \mapsto R

外部/内部损失

- R^{d_{A} + d_{B}} \mapsto R

g_{A} (θ_{A}, θ_{B}), g_{B} (θ_{A}, θ_{B})

Gradient of outer/inner losses w.r.t. their weights in

R^{d_{A} / d_{B}}

外部/内部损失的梯度与其权重

R^{d_{A} / d_{B}}

θ_{B}^{*} (θ_{A}) := \underset{θ_{B}}{\arg min} L_{B} (θ_{A}, θ_{B})

θ_{B}^{*} (θ_{A}) := \underset{θ_{B}}{\arg min} L_{B} (θ_{A}, θ_{B})

| | :--- |

The best-response of the inner player to the outer player
内线玩家对外线玩家的最佳响应

L_{A}^{*} (θ_{A}) := L_{A} (θ_{A}, θ_{B}^{*} (θ_{A}))

The outer loss with a best-responding inner player
与反应最好的内部球员的外部损失

θ_{A}^{*} := \underset{θ_{A}}{\arg min} L_{A}^{*} (θ_{A})

Outer optimal weights with a best-responding inner player
外部最佳权重和最佳响应的内线球员

d := d_{A} + d_{B}

The combined number of weights for both players
两名玩家的权重总和

ω := [θ_{A}, θ_{B}] \in R^{d}

A concatenation of the outer/inner weights
外部/内部权重的串联

\hat{g} (ω) := [g_{A} (ω), g_{B} (ω)] \in R^{d}

A concatenation of the outer/inner gradients
外部/内部梯度的串联

ω^{0} = [θ_{A}^{0}, θ_{B}^{0}] \in R^{d}

The initial parameter values
初始参数值

j

An iteration number 迭代编号

{\hat{g}}^{j} := \hat{g} (ω^{j}) \in R^{d}

The joint-gradient vector field at weights

ω^{j}

权重

ω^{j}

处的关节梯度向量场

\nabla_{ω} {\hat{g}}^{j} := {\nabla_{ω} \hat{g} |}_{ω^{j}} \in R^{d \times d}

The Jacobian of the joint-gradient

\hat{g}

at weights

ω^{j}

\hat{g}

weights

ω^{j}

时关节梯度的雅可比矩阵

\hat{H}

The game Hessian 游戏 Hessian

ω^{Saddle}

A saddle point 马鞍点

α \in C

The step size or learning rate
步长或学习率

λ \in C, e

Notation for an arbitrary Eval or Evec
任意 Eval 或 Evec 的表示法

Sp (M) \in C^{n}

The spectrum - or set of eigenvalues - of

M \in R^{n \times n}

的频谱或特征值

M \in R^{n \times n}

集

ρ (M) := max_{z \in Sp (M)} | z |

The spectral radius in

R^{+}

M \in R^{n \times n}

的

R^{+}

M \in R^{n \times n}

光谱半径

F_{α} (ω)

Fixed point operator for our optimization
用于我们优化的定点运算符

A

The archive from our method
来自我们方法的存档

γ

Discount Factor 折扣系数

τ

The mixture weighting for the objectives
目标的混料加权

RR Ridge Rider [32] IPD Iterated Prisoners' Dilemma GAN Generative Adversarial Network [4] LOLA Learning with opponent learning awareness [27] EVec, EVal Shorthand for Eigenvector or Eigenvalue SGD Stochastic Gradient Descent SimSGD Simultaneous SGD := Defined to be equal to x,y,z,cdots inC Scalars x,y,z,cdots inC^(n) Vectors X,Y,Z,cdots inC^(n xx n) Matrices X^(TT) The transpose of matrix X I The identity matrix ℜ(z),ℑ(z) The real or imaginary component of z inC , The imaginary unit. z inCLongrightarrow z=ℜ(z)+iℑ(z) bar(z) The complex conjugate of z inC |z|:=sqrt(z bar(z)) The magnitude or modulus of z inC arg(z) The argument or phase of z inCLongrightarrow z=|z|exp(i arg(z)) A,B A symbol for the outer/inner players d_(A),d_(B)inN The number of weights for the outer/inner players theta A symbol for the parameters or weights of a player theta_(A)inR^(d_(A)),theta_(B)inR^(d_(B)) The outer/inner parameters or weights L:R^(n)rarrR A symbol for a loss L_(A)(theta_(A),theta_(B)),L_(B)(theta_(A),theta_(B)) The outer/inner losses -R^(d_(A)+d_(B))|->R g_(A)(theta_(A),theta_(B)),g_(B)(theta_(A),theta_(B)) Gradient of outer/inner losses w.r.t. their weights in R^(d_(A)//d_(B)) "theta_(B)^(**)(theta_(A)):=arg min_(theta_(B))L_(B)(theta_(A),theta_(B))" The best-response of the inner player to the outer player L_(A)^(**)(theta_(A)):=L_(A)(theta_(A),theta_(B)^(**)(theta_(A))) "The outer loss with a best-responding inner player" theta_(A)^(**):=arg min_(theta_(A))L_(A)^(**)(theta_(A)) Outer optimal weights with a best-responding inner player d:=d_(A)+d_(B) The combined number of weights for both players omega:=[theta_(A),theta_(B)]inR^(d) A concatenation of the outer/inner weights hat(g)(omega):=[g_(A)(omega),g_(B)(omega)]inR^(d) A concatenation of the outer/inner gradients omega^(0)=[theta_(A)^(0),theta_(B)^(0)]inR^(d) The initial parameter values j An iteration number hat(g)^(j):= hat(g)(omega^(j))inR^(d) The joint-gradient vector field at weights omega^(j) grad_(omega) hat(g)^(j):=grad_(omega)( hat(g))|_(omega^(j))inR^(d xx d) The Jacobian of the joint-gradient hat(g) at weights omega^(j) hat(H) The game Hessian omega^("Saddle ") A saddle point alpha inC The step size or learning rate lambda inC,e Notation for an arbitrary Eval or Evec Sp(M)inC^(n) The spectrum - or set of eigenvalues - of M inR^(n xx n) rho(M):=max_(z in Sp(M))|z| The spectral radius in R^(+)of M inR^(n xx n) F_(alpha)(omega) Fixed point operator for our optimization A The archive from our method gamma Discount Factor tau The mixture weighting for the objectives

:=

| Defined to be equal to | |

x, y, z, \dots \in C

| Scalars | |

x, y, z, \dots \in C^{n}

| Vectors | |

X, Y, Z, \dots \in C^{n \times n}

| Matrices | |

X^{⊤}

| The transpose of matrix

X

| | I | The identity matrix | |

ℜ (z), ℑ (z)

| The real or imaginary component of

z \in C

| | , | The imaginary unit.

z \in C ⟹ z = ℜ (z) + i ℑ (z)

| |

\bar{z}

| The complex conjugate of

z \in C

| |

‖ z ‖ := \sqrt{z \bar{z}}

| The magnitude or modulus of

z \in C

| |

\arg (z)

| The argument or phase of

z \in C ⟹ z = ‖ z ‖ \exp (i \arg (z))

| |

A, B

| A symbol for the outer/inner players | |

d_{A}, d_{B} \in N

| The number of weights for the outer/inner players | |

θ

| A symbol for the parameters or weights of a player | |

θ_{A} \in R^{d_{A}}, θ_{B} \in R^{d_{B}}

| The outer/inner parameters or weights | |

L : R^{n} \to R

| A symbol for a loss | |

L_{A} (θ_{A}, θ_{B}), L_{B} (θ_{A}, θ_{B})

| The outer/inner losses

- R^{d_{A} + d_{B}} \mapsto R

| |

g_{A} (θ_{A}, θ_{B}), g_{B} (θ_{A}, θ_{B})

| Gradient of outer/inner losses w.r.t. their weights in

R^{d_{A} / d_{B}}

| |

θ_{B}^{*} (θ_{A}) := \underset{θ_{B}}{\arg min} L_{B} (θ_{A}, θ_{B})

| The best-response of the inner player to the outer player | |

L_{A}^{*} (θ_{A}) := L_{A} (θ_{A}, θ_{B}^{*} (θ_{A}))

| The outer loss with a best-responding inner player | |

θ_{A}^{*} := \underset{θ_{A}}{\arg min} L_{A}^{*} (θ_{A})

| Outer optimal weights with a best-responding inner player | |

d := d_{A} + d_{B}

| The combined number of weights for both players | |

ω := [θ_{A}, θ_{B}] \in R^{d}

| A concatenation of the outer/inner weights | |

\hat{g} (ω) := [g_{A} (ω), g_{B} (ω)] \in R^{d}

| A concatenation of the outer/inner gradients | |

ω^{0} = [θ_{A}^{0}, θ_{B}^{0}] \in R^{d}

| The initial parameter values | |

j

| An iteration number | |

{\hat{g}}^{j} := \hat{g} (ω^{j}) \in R^{d}

| The joint-gradient vector field at weights

ω^{j}

| |

\nabla_{ω} {\hat{g}}^{j} := {\nabla_{ω} \hat{g} ‖}_{ω^{j}} \in R^{d \times d}

| The Jacobian of the joint-gradient

\hat{g}

at weights

ω^{j}

| |

\hat{H}

| The game Hessian | |

ω^{Saddle}

| A saddle point | |

α \in C

| The step size or learning rate | |

λ \in C, e

| Notation for an arbitrary Eval or Evec | |

Sp (M) \in C^{n}

| The spectrum - or set of eigenvalues - of

M \in R^{n \times n}

| |

ρ (M) := max_{z \in Sp (M)} ‖ z ‖

| The spectral radius in

R^{+}

M \in R^{n \times n}

| |

F_{α} (ω)

| Fixed point operator for our optimization | |

A

| The archive from our method | |

γ

| Discount Factor | |

τ

| The mixture weighting for the objectives |

Figure 3. This shows the phase portrait for two standard optimization algorithms on a range of problems. Following the gradient is shown in red, while LOLA - a method for learning in games - is shown in blue. Left: The small IPD, which has solutions in the top right and bottom left. Middle: Matching pennies, which has a single solution in the middle. Following the gradient does not find this solution because it has imaginary EVals, so we must a method like LOLA. Right: A mixture of small IPD and matching pennies. Following the gradient only finds the solution in the top right, because the center solution has imaginary EVals. LOLA can find either solution. Takeaway: The mixture game has a range of phenomena, including a imaginary EVal solution, a real EVal solution and a Hopf bifurcation. We may want to use a method for learning in games, so we can robustly converge to different solutions.
图 3.这显示了两种标准优化算法在一系列问题上的阶段图。遵循渐变以红色显示，而 LOLA（一种在游戏中学习的方法）以蓝色显示。左：小 IPD，在右上角和左下角有解决方案。中间：匹配便士，中间只有一个解决方案。遵循梯度找不到这个解，因为它有虚数 EVals，所以我们必须使用像 LOLA 这样的方法。右图：小 IPD 和匹配的便士的混合物。遵循梯度只会找到右上角的解，因为中心解有虚数 EVal。LOLA 可以找到任何一种解决方案。要点：混合博弈有一系列现象，包括虚构的 EVal 解、实的 EVal 解和 Hopf 分岔。我们可能希望使用一种在游戏中学习的方法，这样我们就可以稳健地收敛到不同的解决方案。

Figure 4. This shows the spectrum of the game Hessian in log-polar coordinates during training. The spectrum at the start of training is in low alpha, while at the end it is in high alpha. We also color each EVec based on how much it points at a player, which we calculate by finding the ratio of the first players component of EVec’s norm to the norm of the entire EVec

{| e_{1 : d_{B}} |}_{1} / | e |_{1}

. Takeaway: Only some EVals are real and lie entirely in a single players space - these align with search directions for single objective RR. During training, the EVals cross the imaginary axis - i.e., where

\arg (λ) = \pm π / 2

shown in red- indicating potential Hopf bifurcations. At the end of training we have positive (i.e.

\arg (λ) = 0

) and negative (i.e.

\arg (λ) = \pm π

) real EVals, showing potential bifurcations that are similar to saddles.
图 4.这显示了训练期间对数极坐标中游戏 Hessian 的频谱。训练开始时的频谱处于低 alpha 状态，而训练结束时处于高 alpha 状态。我们还根据每个 EVec 指向玩家的程度为每个 EVec 着色，我们通过查找 EVec 范数的第一个玩家分量与整个 EVec 范数的比率来计算

{| e_{1 : d_{B}} |}_{1} / | e |_{1}

。要点：只有一些 EVals 是真实的，并且完全位于单个玩家空间中 - 这些与单个目标 RR 的搜索方向一致。在训练过程中，EVals 穿过假想轴 - 即，以

\arg (λ) = \pm π / 2

红色显示 - 表示潜在的 Hopf 分叉。在训练结束时，我们有正（即

\arg (λ) = 0

）和负（即

\arg (λ) = \pm π

）真实 EVals，显示出类似于鞍座的潜在分叉。

Figure 5. We display various aspects of the players learning dynamics for the small IPD and matching pennies mixture problem. Top left: The log-norm of the joint gradient

\hat{g}

. When this is 0 - i.e., the corners of the grid and the center - we are at a stationary point, which is required, but not sufficient for solutions. Top right: The loss averaged over both players, allowing us to assess how desirable different solutions are. Middle left: The log magnitude of the game Hessian’s first Eval

λ

. Middle right: The arg of

λ

. Bottom left: The real part of

λ

. Bottom Right: The imaginary part of

λ

. Takeaway: This range of visualizations allows us to see where stationary points are, which stationary points are solutions, how desirable solutions are for the players, and where the bifurcation occurs. Note: It is difficult to see that the gradient norm goes to 0 near the corners of the grid, but this - in fact - begins to happen if we get close enough (in both the mixed objective and the small IPD).
图 5.我们展示了球员学习小 IPD 和匹配便士混合问题的动态的各个方面。左上：关节梯度的对数范

\hat{g}

数。当它为 0 时 - 即网格的角和中心 - 我们处于一个静止点，这是必需的，但不足以解决。右上：双方玩家的平均损失，使我们能够评估不同的解决方案有多可取。左中：游戏 Hessian 的第一次 Eval 的对数量级

λ

。右中：的参数

λ

。左下角：的

λ

实部。右下：的

λ

虚部。要点：这一系列可视化使我们能够看到静止点在哪里，哪些静止点是解决方案，玩家想要的解决方案有多理想，以及分叉发生的位置。注意：很难看出梯度范数在网格的角落附近变为 0，但事实上，如果我们足够接近（在混合物镜和小 IPD 中），这种情况就会开始发生。

$^{1}$ Facebook AI Research $^{2}$ University of Toronto $^{3}$ Vector Institute $^{4}$ University of Oxford $^{5}$ UC Berkeley $^{6}$ Google Brain $^{7}$ Radboud University. Correspondence to: <lorraine @cs.toronto.edu>.
$^{1}$ Facebook AI Research $^{2}$ 多伦多大学 $^{3}$ Vector Institute $^{4}$ 、 $^{5}$ 加州大学伯克利分校、加州大学伯克利 $^{6}$ 分校 Google Brain $^{7}$ Radboud 大学。通信地址： .

Jonathan Lorraine 123123^(123){ }^{123} Jack Parker-Holder 1414^(14){ }^{14} Paul Vicol 2323^(23){ }^{23} Aldo Pacchiano 55^(5){ }^{5} Luke Metz 66^(6){ }^{6} Tal Kachman 77^(7){ }^{7} Jakob Foerster 11^(1){ }^{1}

Abstract 抽象

1. Introduction 1. 引言

2. Background 2. 背景

2.1. Ridge Rider 2.1. 山脊骑士

2.2. Optimization in Games2.2. 游戏中的优化

2.3. Local Bifurcations and Separatrices2.3. 局部分叉和分离

3. Our Method: Game Ridge Rider (GRR)3. 我们的方法：Game Ridge Rider （GRR）