深度持续学习中的可塑性丧失
抽象
现代深度学习系统专门针对训练发生一次然后不再发生的问题设置，而不是持续训练发生的持续学习设置。如果深度学习系统应用于持续学习环境，那么众所周知，它们可能无法记住早期的例子。更基本但鲜为人知的是，他们也可能失去学习新例子的能力，这种现象称为可塑性丧失。我们使用 MNIST 和 ImageNet 数据集直接演示可塑性损失，这些数据集被重新用于持续学习作为任务序列。在 ImageNet 中，二进制分类性能从早期任务的 89% 准确率下降到第 2000 个任务的 77%，大约是线性网络的水平。可塑性损失发生在广泛的深度网络架构、优化器、激活函数、批量归一化、辍学中，但通过 $L^{2}$ 正则化大大缓解，特别是当与权重扰动相结合时。此外，我们引入了一种新的算法——连续反向传播——它略微修改了传统的反向传播，以在每个示例之后重新初始化一小部分较少使用的单元，并且似乎无限期地保持可塑性。 ^{1}
关键词： 持续学习 ， 深度学习 ， 可塑性丧失 ， ImageNet ， MNIST ， 反向传播
^{†} 期刊： Neural Networks
[inst1]organization=阿尔伯塔大学计算机科学系
[inst2]organization=阿尔伯塔省机器智能研究所CIFAR AI主席
1可塑性损失
现代深度学习系统已经专门用于问题设置，在这种设置中，训练在大型数据集上发生一次，然后就再也没有发生过。深度学习在语音识别[1]和图像分类[2]的所有早期成功案例中都使用了这种训练设置。后来，当深度学习适应强化学习时（例如，Mnih et al. [ 3]），引入了重放缓冲区和批处理等技术，使其非常接近于一次训练设置。深度学习的最新应用，如GPT3 [ 4] 和 DallE [ 5] ，也被作为单个大批量的数据进行训练。当然，在许多应用程序中，数据分布会随着时间的推移而变化，训练必须以某种形式继续进行;在这些情况下，最常见的策略是不断收集数据，然后有时从头开始训练一个新网络，同样是一次训练。训练一次设置是现代深度学习方法设计和实践中不可或缺的一部分。
相比之下，持续学习问题设置侧重于不断从新数据中学习。持续学习设置对于学习系统面临不断变化的数据流的问题是有利的。例如，考虑一个负责在房屋中导航的机器人。在“一次训练”的设置下，机器人必须从头开始重新训练，否则每当房屋的布局发生变化时，机器人就有可能被淘汰。如果布局经常更改，则需要从头开始不断重新训练。另一方面，在持续学习的设置下，机器人可以简单地从新数据中学习，并不断适应房屋的变化。近年来，持续学习引起了越来越多的关注，并且正在组织新的专门会议来关注它，例如终身学习代理会议（CoLLAS）。在本文中，我们重点关注持续学习环境。
深度学习系统在接触新数据时，往往会忘记他们以前学到的大部分内容，这种现象被称为“灾难性遗忘”。换句话说，深度学习方法不能在持续学习问题中保持稳定性。这种现象在1900年代后期的早期神经网络中首次出现[6,7]。最近，随着深度学习的出现，灾难性遗忘再次受到关注，因为许多论文致力于保持深度持续学习的稳定性[8,9,10,11,12,13,14,15]。
与灾难性遗忘不同，更根本的是持续学习的能力，即不断从新数据中学习的能力。我们将这种能力称为可塑性。保持可塑性对于持续学习系统至关重要，因为它使它们能够适应数据流的变化。失去可塑性的持续学习系统如果数据流发生变化，就会变得毫无用处。在本文中，我们重点关注可塑性丧失的问题。这种关注点是不同的，但与更常见的灾难性遗忘关注点相辅相成。
深度学习系统在持续学习问题中会失去可塑性吗？他们这样做的一些证据来自 2000 年代初的心理学文献。Ellis和Lambon Ralph [ 16]、Zevin和Seidenberg [ 17]以及Bonin等[ 18]在监督回归问题中表明早期神经网络的可塑性丧失。这些论文使用了一种设置，其中一组示例被呈现给网络一定数量的 epoch，然后用额外的示例扩展训练集，然后继续进行另一个数量的 epoch 的训练。他们发现，在控制了 epoch 数后，原始训练集中示例的误差低于后来添加的示例。这些论文提供了证据，证明深度学习及其所基于的反向传播算法[19]往往会产生可塑性的损失。这些早期作品的一个局限性是，按照今天的标准，使用的网络相对较浅，而且使用的算法也不是当今最流行的算法。例如，他们没有使用 ReLU 激活函数或 Adam 优化器。关于人工神经网络的早期心理学研究提供了诱人的线索，但并没有为现代深度学习网络是否表现出可塑性丧失的问题提供明确的答案。
Turning now to the machine learning literature, we can find several recent studies suggesting loss of plasticity in deep learning. Chaudhry et al. [20] observed loss of plasticity in continual image classification problems. Their experiments are not fully satisfactory as a demonstration of loss of plasticity because they introduced other confounding variables. In their setup, when a new task was presented, new outputs, called heads, were added to the network; the number of outputs grew as additional tasks were encountered. Loss of plasticity was thereby confounded with the effects of interference from old heads. Chaudhry et al. found that when old heads were removed at the start of a new task, the loss of plasticity was minimal, which suggests that the loss of plasticity they observed was mainly due to interference from old heads. A second limitation of Chaudhry et al.’s study is that it used only ten tasks and thus did not assess loss of plasticity when deep learning methods face a long sequence of tasks.
Ash and Adams [21] showed that initializing a neural network by first training it on a fraction of the dataset can lead to much lower final performance compared to the case in which the network is trained once on the entire dataset. The worse performance after pretraining is an important example of loss of plasticity in deep learning and hints at the possibility of a major weakness in full continual learning with many task changes or nonstationarity. Berariu et al. [22] extended Ash and Adams’ work to show that the more pretraining stages there were, the worse the final performance. However, Berariu et al.’s experiments were performed in a stationary problem setting, and the loss of plasticity they observed was small. Although the results in these papers hint at a fundamental loss of plasticity in deep learning systems, none directly demonstrated loss of plasticity in continual learning.
More evidence for loss of plasticity in modern deep learning has been obtained in the reinforcement learning community, where recent papers have shown substantial loss of plasticity. Nishikin et al. [23] showed that early learning in reinforcement learning problems could harm later learning, an effect they termed “primacy bias.” This result could be due to the loss of plasticity of deep learning networks in continual learning settings, as reinforcement learning is inherently continual due to changes in the policy. Lyle et al. [24] also showed that some deep reinforcement learning agents could lose the ability to learn new functions over time. These are important data points, but the conclusions that can be drawn from them are limited by the inherent complexity of modern deep reinforcement learning. All of these papers, both from the psychology literature at the turn of the century and recent papers in machine learning and reinforcement learning, are evidence that deeplearning systems lose plasticity, but fall short of a fully satisfactory demonstration of the phenomena.
In this paper, we attempt a more definitive answer to the question of loss of plasticity in modern deep learning. We show that deeplearning methods suffer from loss of plasticity in continual supervised learning problems and that such a loss of plasticity can be severe. First, we demonstrate that deep learning suffers from loss of plasticity in a continual supervisedlearning problem based on the ImageNet dataset that involves thousands of learning tasks. Using supervised learning tasks instead of reinforcement learning removes the complexity and associated confounding that inevitably arises in reinforcement learning. And having thousands of tasks allows us to assess the full extent of the loss of plasticity. Then we use two computationally cheaper problems (a variant of MNIST and the SlowlyChanging Regression problem) to establish the generality of deep learning’s loss of plasticity over a wide range of hyperparameters, optimizers, network sizes, and activation functions.
After showing the severity and generality of loss of plasticity in deep learning, we seek a deeper understanding of its causes. This leads us to explore existing methods that might ameliorate loss of plasticity. We find that some methods, like Adam [25] and dropout [26], significantly exacerbate loss of plasticity, while other methods, like $L^{2}$ regularization [27, pp. 227–230] and Shrink and Perturb [21], substantially ease the loss of plasticity in many cases. Finally, we propose a new algorithm—continual backpropagation—that robustly solves the problem of loss of plasticity in both our systematic experiments with supervised learning problems and in a preliminary experiment with a reinforcement learning problem. The usual backpropagation algorithm has two crucial components: 1) initialization of the network’s connection weights with small random numbers and 2) stochastic gradient descent on every presentation of a training example [19]. Continual backpropagation extends the initialization to all time steps, by selectively reinitializing a small fraction of lowutility hidden units at every presentation of a training example. Continual backpropagation performs both gradient descent and selective reinitialization continually.
2 Loss of Plasticity in ImageNet
The main results of this paper are contained in this section. Recall that the primary point of this paper is to establish in a more definitive way that deep learning methods lose the ability to learn in continual learning. To establish this, we use the classical test problem known as ImageNet and adapt it to continual learning. ImageNet has been historically important for deep learning and is still a widely used test bed. A classical deep learning problem like ImageNet is the best place to show the issue of loss of plasticity. In this section, we show that deep learning suffers from loss of plasticity in a continual learning problem based on ImageNet.
Imagenet is a large database of images and their labels that has been influential throughout machine learning [28, Imagenet.org] and that played a pivotal role in the rise of deep learning [2]. ImageNet allowed researchers to conclusively show that deep learning can solve a problem like object recognition, a hallmark of intelligence. Classical machine learning methods had a classification error rate of 25% [29] when deeplearning methods reduced it to below 5% [30], the human error rate on the dataset.
The ImageNet database comprises millions of images labelled by nouns (classes) such as types of animals and everyday objects. In the standard trainonce setting, the dataset is partitioned into a training set and a test set. The learning system is first trained on the training set, and then the learner is frozen, and its performance is measured on the test set.
Network Architecture for Continual ImageNet  

Layer 1: Convolutional + MaxPooling  
Number of Filters  32  Activation  ReLU 
\hdashlineConvolutional Filter Shape  (5,5)  Convolutional Filter Stride  (1,1) 
\hdashlineMaxPooling Filter Shape  (2,2)  MaxPooling Filter Stride  (1,1) 
Layer 2: Convolutional + MaxPooling  
Number of Filters  64  Activation  ReLU 
\hdashlineConvolutional Filter Shape  (3,3)  Convolutional Filter Stride  (1,1) 
\hdashlineMaxPooling Filter Shape  (2,2)  MaxPooling Filter Stride  (1,1) 
Layer 3: Convolutional + MaxPooling  
Number of Filters  128  Activation  ReLU 
\hdashlineConvolutional Filter Shape  (3,3)  Convolutional Filter Stride  (1,1) 
\hdashlineMaxPooling Filter Shape  (2,2)  MaxPooling Filter Stride  (1,1) 
Layer 4: Fully Connected  
Output Size  128  Activation  ReLU 
Layer 5: Fully Connected  
Output Size  128  Activation  ReLU 
Layer 6: Fully Connected  
Output Size  2  Activation  Linear 
We sought to adapt Imagenet from the trainonce setting to the continuallearning setting while minimizing all other changes. The Imagenet database we used consists of 1000 classes, each of 700 images. We constructed from these a nearendless sequence of binary classification tasks. For example, the first task might be to distinguish Class1 from Class2, and the second task might be to distinguish Class3 from Class4. Each binary classification task was handled in conventional way. The 700 images for each class were divided into 600 images for a training set and 100 images for a test set. On each task, the deep learning network was first trained on the training set of 1200 images, and then its classification accuracy was assessed on the test set of 200 images. That training consisted of multiple passes through the training set, called epochs. After training and testing on one task, the next task was begun based on a different pair of classes. By selecting the pairs of tasks randomly without replacement, we generated a sequence of 2000 different tasks. All tasks used the downsampled 32x32 version of the ImageNet dataset, as is often done to save computation [31]. We call the resulting continuallearning problem Continual ImageNet. In this section, we use Continual ImageNet to establish loss of plasticity.
To show that deep learning loses the ability to learn, we applied a wide variety of standard deep learning networks to Continual ImageNet. We describe here a specific, representative convolutional network with ReLU activation function. The network had three convolutionalplusmaxpooling layers followed by three fully connected layers, as detailed in Table 1. This network is narrower than those commonly used on the ImageNet dataset because here we are trying to discriminate only two classes at a time rather than all 1000. The final layer consisted of just two units, the heads, corresponding to the two classes. At task changes, the input weights of the heads were reset to zero. Resetting the heads in this way can be viewed as introducing new heads for the new classes. This resetting of the output weights is not ideal for studying plasticity, as the learning system gets access to privileged information on the timing of task changes (and we do not use it in the experiments reported later in this paper). We use it here because it is closest to the standard practice in deep continual learning, which is to introduce new heads for new classes [8, 9, 10, 11].
We tested many variations of the network architectures, hyperparameters, and optimizers to obtain good performance on the first binary classification task. We now describe the details of one such training procedure applied to the network described in Table 1. The network was trained using stochastic gradient descent (SGD) with momentum on the crossentropy loss and initialized once prior to the first task. For each task, the learning algorithm performed 250 passes through the training set using minibatches of size 100. The momentum hyperparameter was 0.9. We tested various stepsize parameters but are only presenting the performance for step sizes 0.01, 0.001, and 0.0001 for clarity of the figure. We performed 30 runs for each hyperparameter value, varying the sequence of tasks and other randomness. Across different hyperparameters, the same sequences of pairs of classes were used.
As a measure of performance on a task, we used the percentage of the 200 test images that were correctly classified. This measure is plotted as a function of task number for the first ten tasks (left panel of Figure 1) and in bins of 50 tasks for all 2000 tasks (right panel of Figure 1) for the best step sizes. The first data point in the right panel is the average performance on the first 50 tasks; the next is the average performance over the next 50 tasks, and so on.
The network had poor performance on the 2000th task for all hyperparameter values. The largest step size, 0.01, performed the best on the first two tasks, but performance then fell on subsequent tasks, eventually reaching a level below the linear baseline. For even larger step sizes like 0.1, the network diverged after the first few tasks. At the smaller step sizes, performance rose initially and then fell and was only slightly better than the linear baseline after 2000 tasks. The performance for intermediate step sizes like 0.003 followed the general trend of the step sizes presented in Figure 1. We have found this to be a common pattern in our experiments: for a welltuned network, performance first improves, then falls substantially, ending near or below the linear baseline. We have observed this pattern for many network architectures, hyperparameters, and optimizers. The specific choices of network architecture, hyperparameters, and optimizers affect when the performance starts to drop, but we observed a severe performance drop for a wide range of choices. To increase confidence in the results and to ensure that the performance drop is not due to some bug in the implementation, one of the authors independently reproduced the results of this experiment.
This experiment was designed to closely follow standard deeplearning practice in every way possible, except in requiring the network to keep learning. Failure to learn better than a linear network in later tasks is substantial evidence that the standard practice of deep learning simply does not work in continual problems. We showed systematically through a direct experiment that deep learning consistently loses the ability to learn new things, i.e. deep learning loses plasticity, in continual learning problem. However, there are of course many other variations of deep learning that could be tried. To reach the next level of confidence and understanding we switch in the next section to a less computationally complex problem where we can study the nuances of the phenomena more thoroughly.
3 Robust Loss of Plasticity in Permuted MNIST
We now use a computationally cheap problem based on the MNIST dataset [32] to test the generality of loss of plasticity. MNIST is one of the most common supervisedlearning datasets used in deep learning. It consists of 60,000 28x28 grayscale images of the handwitten digits from 0 to 9 together with their correct labels. For example, the left image in Figure 2a shows an image that is labelled by the digit 7. The smaller number of classes and the simpler images enables much smaller deep learning networks to perform well on this dataset than are needed on ImageNet. The smaller networks in turn mean much less computation is needed to perform the experiments and thus experiments can be performed in greater quantities and under a variety of different conditions, enabling us to perform deeper and more extensive studies of plasticity. As with ImageNet, MNIST is typically used in a trainonce setting.
We created a continual supervised learning problem using permuted MNIST datasets [33, 34]. An individual permuted MNIST dataset is created by permuting the pixels in the original MNIST dataset. For example, the pixel that originally appeared in the upperleft corner might be moved to the lowerright corner, or to any other particular place in the image, and so on for all the other pixels. The right image in Figure 2a is an example of such a permuted image for an image labelled with the digit 7. Given a way of permuting, all 60,000 images are permuted in the same way to produce the new permuted MNIST dataset.
By repeatedly randomly selecting from the approximately $10^{1930}$ possible permutations, we created a sequence of 800 permuted MNIST datasets and supervised learning tasks. For each task, we presented each of its 60,000 images onebyone in a random order to the learning network. Then we moved to the next permuted MNIST task and repeated the whole procedure, and so on for up to 800 tasks. No indication was given to the network at the time of task switching. With the pixels being permuted in a completely unrelated way, we might expect classification performance to fall dramatically at the time of each task switch. Nevertheless, across tasks, there could be some savings, some improvement of speed of learning, or alternatively there could be loss of plasticity—loss of the ability to learn across tasks. The network was trained on a single pass through the data and there are no minibatches. We call this problem Online Permuted MNIST.
We applied feedforward neural networks with three hidden layers to Online Permuted MNIST. We did not use convolutional layers, as they could not be helpful on the permuted problem because the spatial information is lost (in MNIST, convolutional layers are often not used even on the standard, nonpermuted problem). For each example, the network estimated the probabilities of each of the 10 classes (0–9), compared them to the correct label, and performed SGD on the crossentropy loss. As a measure of performance, we recorded the percentage of times the network’s highest probability label was the correct one over a task’s 60,000 images. We plot this pertask performance measure versus task number in Figure 2b. The weights were initialized according to a Kaiming distribution.
The first panel of Figure 2b shows the progression of performance across tasks for a network of with 2000 units per layer, and various values of the stepsize parameter. Note that performance first increased across tasks, then began falling steadily across all subsequent tasks. This drop in performance means that the network is slowly losing the ability to learn on new tasks. This loss of plasticiy is consistent with the loss of plasticity that we observed in ImageNet.
Next we varied the network size. Instead of 2000 units per layer, we tried 100, 1000, and 10,000 units per layer. In this experiment we ran for only 150 tasks, primarily because the largest network took much longer to run. The time courses of performance at good step sizes for each network size are shown in the middle panel of Figure 2b. The loss of plasticity with continued training is most pronounced at the smaller network sizes, but even the largest networks show some loss of plasticity.
Next we studied the effect of the rate at which the task changed. Going back to the original network with 2000unit layers, instead of changing the permutation each 60,000 examples, we now changed it after each 10,000, 100,000, or 1M examples, and ran for 48M examples total no matter how often the task changed. The examples in these cases were selected randomly from the permuted MNIST task’s dataset, with replacement. As a performance measure of the network on a task, we used the percent correct over all of the task’s images. That is, there were only 48 data points at the slowest rate, and 4800 data points at the fastest rate. The progression of performance is shown in the right plot in Figure 2b. Again performance fell across tasks, even if the change was very infrequent. Altogether these results show that the phenomena of loss of plasticity robustly arises in this form of backpropagation.
Finally, we tested if different activation functions remove the loss of plasticity. We tested these activation functions in a new idealized SlowlyChanging Regression problem. It is an even cheaper problem where we can run a single experiment on a CPU in 15 minutes, allowing us to do extensive studies. The details of this problem are given in B. In this problem, we show loss of plasticity for networks with six different activation functions: sigmoid, tanh, ELU, leakyReLU, ReLU, and Swish.
So far in the paper, we performed direct tests of loss of plasticity with backpropagation. Our experiments in Continual Imagenet, online Permuted MNIST, and Slowly changing regression show that loss of plasticity is a general phenomenon, and it can be catastrophic in some cases. It happens for a wide range of activation functions, hyperparameters, rates of distribution change, and for both under and over parameterized networks. Although continual learning has been studied for a long time from [35, 36, 37], a direct test of the ability to maintain plasticity was missing. The experiments so far in the paper fill this gap and show that conventional backpropagation does not work in continual learning.
There are two major goals in continual learning: maintaining stability and maintaining plasticity. Maintaining stability is concerned with memorizing useful information, and maintaining plasticity is about finding new useful information when the data distribution changes. It is difficult for current deeplearning methods to remember useful information as they tend to forget previously learned information [6, 7, 37]. We focused on continually finding useful information, not on remembering useful information. Our work on loss of plasticity is different but complementary to the work on maintaining stability.
4 Understanding Loss of Plasticity
In this section, we turn our attention to understanding why backpropagation loses plasticity in continual learning problems. The only difference in the learning algorithm across time are the network weights. In the beginning, the weights were small random numbers as they were sampled from the initial distribution; however, after learning some tasks, the weights became optimized for the most recent task. Thus, the starting weights for the next task are qualitatively different from those for the first task. As this difference in the weights is the only difference in the learning algorithm across time, the initial weight distribution must have some unique properties that make backpropagation plastic in the beginning. The initial random distribution might have many properties that enable plasticity, like the diversity of units, nonsaturated units, small weight magnitude etc.
As we demonstrate now, many advantages of the initial distribution are lost concurrently with the loss of plasticity. The loss of each of these advantages partially explains the degradation in performance that we have observed. We then provide arguments for how the loss of these advantages could contribute to the loss of plasticity and measures that quantify how prevalent the phenomenon is. We provide an indepth study in the Online Permuted MNIST problem that will serve as motivation for several solution methods that could mitigate the loss of plasticity.
The first noticeable phenomenon that occurs concurrently with the loss of plasticity is the continual increase in the fraction of constant units. When a unit becomes constant, the gradients flowing back from the unit become zero or very close to zero, severely slowing down learning. In the case of ReLU activations, this occurs when the output of the activations is zero for all examples of the task; such units are often said to be dead [38, 39]. In the case of the sigmoid or tanh activation functions, this phenomenon occurs when the output of a unit is too close to either of the extreme values of the activation function; such units are often said to be saturated (see [40] and [41, Chapter 19]).
To measure the number of dead units in a network with ReLU activation, we count the number of units with a value of zero for all examples in a random sample of two thousand images at the beginning of each new task. An analogous measure in the case of sigmoid or tanh activations is the number of units that are $\epsilon$ away from either of the extreme values of the function for some small positive $\epsilon$ [42]. We only focus on ReLU networks in this section.
In our experiments in the online permuted MNIST problem, the deterioration of online performance is accompanied by a large increase in the number of dead units (left panel of Figure 3). For the step size of 0.01, up to 25% of units die after 800 tasks. This increase in the number of dead units partially explains why the performance of backpropagation degrades over time. In the next section, we will see that methods that stop the units from dying can significantly reduce the loss of plasticity. This suggests that the increase in dead units is one of the causes of the loss of plasticity in backpropagation.
Another phenomenon that occurs with the loss of plasticity is the steady growth of the network’s average weight magnitude. We measure the average magnitude of the weights by adding up their absolute value and dividing by the total number of weights in the network. In the permuted MNIST experiment, the degradation of online classification accuracy of backpropagation observed in Figure 2b is associated with an increase in the average magnitude of the weights (center panel of Figure 3).
The growth of the network’s weights can represent a problem because large weight magnitudes are often associated with learning instability. For example, an increase in the magnitude of the weights is associated with the wellknown problem of exploding gradients in recurrent neural networks [43]. The weights of the network are a multiplicative factor of the gradient, a steady increase in the magnitude of the weights could lead to divergence of the stochastic approximation algorithms used for training the network [44, 45]. Moreover, the convergence of descent algorithms, such as stochastic gradient descent, require gradients to remain bounded [46].
The last phenomenon that occurs with the loss of plasticity is the drop in the effective rank of the representation. Similar to the rank of a matrix, which represents the number of linearly independent dimensions, the effective rank takes into consideration how each dimension influences the transformation induced by a matrix [47]. A high effective rank signals that most of the dimensions of the matrix contribute equally to the transformation induced by the matrix. On the other hand, a low effective rank corresponds to few dimensions having any significant effect on the transformation, implying that the information in most of the dimensions is close to being redundant.
Formally, consider a matrix $\Phi\in\mathbb{R}^{n\times m}$ with singular values $\sigma_{k}$ for $k=1,2,...,q$, and $q=\max(n,m)$. Let $p_{k}=\sigma_{k}/\\boldsymbol{\sigma}\_{1}$, where $\boldsymbol{\sigma}$ is the vector containing all the singular values, and $\\cdot\_{1}$ is the $\ell^{1}$norm. The effective rank of matrix $\Phi$, or $\text{erank}(\Phi),$ is defined as
$\displaystyle\text{erank}(\Phi)\overset{.}{=}\exp\left\{H(p_{1},p_{2},...,p_{q})\right\},\text{where }H(p_{1},p_{2},...,p_{q})=\sum^{q}_{k=1}p_{k}\log(p_{k}).$  (1) 
Note that the effective rank is a continuous measure that ranges between one and the rank of matrix $\Phi$.
In the case of neural networks, the effective rank of a hidden layer measures the number of units that can produce the output of the layer. If a hidden layer has low effective rank, then a small number of units can produce the output of the layer meaning that many of the units in the hidden layer are not providing any useful information. We approximate the effective rank on a random sample of two thousand examples before training on each task.
In our experiments, loss of plasticity is accompanied by a decrease in the average effective rank of the network (right panel of Figure 3). This phenomenon in itself is not necessarily a problem. After all, it has been shown that gradientbased optimization seems to favour lowrank solutions through implicit regularization of the loss function or implicit minimization of the rank itself [48, 49]. However, a lowrank solution might not be the best starting point for learning from new observations because most of the hidden units provide little to no information.
The decrease in effective rank could explain the loss of plasticity in our experiments in the following way. After each task, the learning algorithm finds a lowrank solution for the current task, which then serves as the initialization for the next task. As the process continues, the effective rank of the representation layer keeps decreasing after each task, limiting the number of solutions that the network is able to represent immediately at the start of each new task.
In this section, we looked deeper at the networks that lost plasticity in the Online Permuted MNIST problem. We noted that the only difference in the learning algorithm across time is the weights of the network, which means that the initial weight distribution has some properties that allowed the learning algorithm to be plastic in the beginning. And as learning progressed, the weights of the network got away from the initial distribution, and the algorithm started to lose plasticity. We found that loss of plasticity is correlated with an increasing weight magnitude, a decrease in the effective rank of the representation, and an increase in the fraction of dead units. Each of these correlates partially explains the loss of plasticity faced by backpropagation.
5 Existing DeepLearning Methods for Mitigating Loss of Plasticity
We now investigate several existing methods that could help mitigate the loss of plasticity. We study five existing methods: $L^{2}$regularization [27, pp. 227–230], dropout [26], online normalization [50], shrinkandperturb [21], and Adam [25]. We chose $L^{2}$regularization, dropout, normalization, and Adam because these methods are commonly used in deep learning practice. While shrinkandperturb is not a commonly used method, we chose it because it reduces the failure of pretraining, a problem which is an instance of loss of plasticity. For each method, we give a brief description of how the method works and give reasons for why one expects the method to work well in general. Then, to assess if these methods can mitigate the loss of plasticity, we apply them on Online Permuted MNIST. We also use the three correlates of loss of plasticity found in the previous section to get a deeper understanding of the performance of these methods. At the end of this section, we present results on Online Permuted MNIST and discuss the benefits and shortcomings of each method.
An intuitive way to address the loss of plasticity is to use parameter regularization as loss of plasticity is associated with a growth of weight magnitudes, shown in Section 4. We used $L^{2}$regularization, which adds a penalty to the loss function proportional to the $\ell^{2}$norm of the weights of the network. The networks trained using $L^{2}$regularization are expected to have a smaller weight magnitude than networks trained with backpropagation alone because $L^{2}$regularization directly reduces the weight magnitude. Additionally, it is possible that stopping the growth of weight magnitudes would reduce the loss of plasticity as growing weight magnitude is associated with loss of plasticity.
A method related to parameter regularization is shrinkandperturb [21]. As the name suggests, shrinkandperturb preforms two operations, first it shrinks all the weights and second it adds random noise to these weights. Due to the shrinking part of shrinkandperturb, it is expected to result in a smaller average weight magnitude than backpropagation. Moreover, one might expect that the parameter noise introduced by shrinkandperturb would help decrease the number of dead units and increase the effective rank of the network. As we observed in the previous section, loss of plasticity is associated with a large number of dead units, a low effective rank and large weights. If shrinkandperturb mitigates all three correlates to the loss of plasticity, shrinkandperturb might reduce loss of plasticity.
An important technique in modern deep learning is called dropout [26]. Dropout randomly sets each hidden unit to zero with a small probability. The original motivation for dropout was to prevent hidden units from coadapting—relying on each other to generate accurate predictions [26]. Moreover, dropout can also be seen as a method for introducing randomness during training, which prevents units from overspecializing and makes the overall network robust to noise [27, pp. 255–265]. Because units in networks trained with dropout are trained not to rely on information conveyed by other units, one might expect that dropout will increase the effective rank of the hidden layers. Aditionally, the increase in the effective rank of networks trained using dropout might also be accompanied by a decrease in the loss of plasticity.
Another commonly used technique in deep learning is batch normalization [51]. In batch normalization, the output of each hidden layer is normalized and rescaled using statistics computed from each minibatch of data. Batch normalization was initially proposed to address the problem of internal covariate shift in neural networks [51]. After its introduction, it was also found that batch normalization improved the optimization conditions when training neural networks by smoothing out the optimization landscape of the loss function [52] and by improving the condition number, the ratio of the largest and smallest singular values, of the weight matrices of the network [50]. Normalization is expected to mitigate the problem of dead units because it is designed to ensure that units have a mean preactivation of zero and a variance of one. Moreover, because it also improves the condition number of the weight matrices of the network [50], one might expect that networks trained with normalization would have a higher effective rank than networks trained with backpropagation alone. As a consequence of all these effects, networks trained with normalization might show a reduced loss of plasticity than backpropagation.
No assessment of alternative methods can be complete without Adam [25] as it is considered one of the most useful tools in modern deep learning. Adam optimizer is a variant of stochastic gradient descent that uses an estimate of the first moment of the gradient scaled inversely by the second moment of the gradient to update the weights instead of directly using the gradient. Adam has been observed to work well with nonstationary losses [25], especially in deep reinforcement learning settings [53, 54, 55]. It is possible that the robustness of Adam to nonstationary losses might help mitigate the loss of plasticity observed in nonstationary continual learning problems, such as the Permuted MNIST problem.
Properly implementing these methods for the Online Permuted MNIST problem require us to understand some details of these methods as well as the hyperparameters they come with. $L^{2}$regularization adds a penalty to the loss function proportional to the $\ell^{2}$norm of the weights of the network. This introduces a hyperparameter $\lambda$ that modulates the contribution of the penalty term. The incremental version of shrinkandperturb also introduces the same regularization term as $L^{2}$regularization, and it also adds a small amount of random noise to the weights on each update. The introduction of noise introduces another hyperparameter, the variance of the noise. In dropout, the probability with which hidden units are set to zero is a hyperparameter, and we refer to this probability as $p$. Batch normalization is not amenable to the online setting used in the Online Permuted MNIST problem. Thus, we used online normalization [56], an online variant of batch normalization. Online normalization introduces two hyperparameters used for the incremental estimation of the statistics in the normalization steps. Adam has two hyperparameters which are used for the incremental estimation of the statistics in the normalization steps. These hyperparameters are used to compute the moving averages of the first and second moments of the gradient. We used the default values of these hyperparameters proposed in the original paper.
We tested these methods in the Online Permuted MNIST problem using the same network architecture we used in Section 4: a fullyconnected network of three hidden layers with 2000 units each and ten output units. Similar to 4, we measure the online classification accuracy on all the 60k examples of the task.
All these methods come with a set of hyperparameters, and the exact performance of the methods depends on the values of these hyperparameters. However, we can still find hyperparameter values that represent how well the method can perform. For each method, we tested various combinations of hyperparameters. In this section, we only present results for one set of these hyperparameters. A shows the results for three different combinations of hyperparameters and provides more details on hyperparameter selection. In general, we chose the hyperparameter value that had the highest average classification accuracy over the 800 tasks. A picture emerges when we plot the curve for these hyperparameter values, shown in Figure 4a. Similar to the plots in Section 4, the first point in the plot is the average classification accuracy over the first task, the next over the next task and so on. The lines corresponding to each method are not the best performance of the method, but they are representative of the best performance of the methods. It is possible to get slightly better performance for these methods by a deeper finetuning of these hyperparameters, nevertheless, the pattern of results and the behaviour of these methods is well summarized by Figure 4a.
Changing the hyperparameter setting changed the performance of each method. For Adam, different values of the stepsize parameter changed how fast the performance dropped for Adam. With dropout, we found that the higher the dropout probability, the faster and sharper the drop in performance. Dropout with probability of 0.01 performed the best, and its performance was almost identical to that of backpropagation. However, Figure 4a shows the performance for a dropout probability of 0.1 to represent how the method performs. For online normalization, changing the hyperparameters changed when the performance of the method peaked and it also slightly changed how fast it gets to its peak performance. The weightdecay parameter in $L^{2}$regularization controlled the peak of the performance and how fast it dropped. Shrinkandperturb was also sensitive to the variance of the noise. If the noise was too high, the loss of plasticity was much more severe, and if it was too low, it did not have any effect.
We also tested how these methods affect the three correlates of loss of plasticity we identified in Section 4. Figure 4b shows the summaries of the three correlates for the loss of plasticity for the hyperparameter settings used in Figure 4a.
For $L^{2}$regularization, the weight magnitude does not continually increase. Moreover, as expected, the nonincreasing weight magnitude is associated with a lower loss of plasticity. However, $L^{2}$regularization does not fully mitigate the loss of plasticity. We explain this result using the other two correlates for the loss of plasticity. While $L^{2}$regularization reduces the average weight magnitude of the network, it increases the percentage of dead units and decreases the effective rank.
Similar to $L^{2}$regularization, shrinkandperturb stops the weight magnitude from continually increasing. Moreover, it also reduces the percentage of dead units. However, it has a lower effective rank than backpropagation, but still higher than that of $L^{2}$regularization. Not only does shrinkandperturb has a lower loss of plasticity than backpropagation, but it also has minimal loss of plasticity, and it has the highest classification accuracy on the 800th task among all the methods we have tested so far.
One would have thought that dropout will increase effective rank compared to backpropagation. This is not what we found; dropout has about the same effective rank as backpropagation. Moreover, dropout has about the same weight magnitude and number of dead units as backpropagation. Surprisingly, dropout resulted in a higher loss of plasticity than backpropagation. The poor performance of dropout is not explained by our three correlates of loss of plasticity. A thorough investigation of dropout is beyond the scope of this paper, but it will be an interesting direction for future work.
Online normalization was expected to result in fewer dead units and a higher effective rank than networks trained with backpropagation. This happens in the earlier tasks, but both measures deteriorate over time. In the later tasks, the network trained using online normalization has a higher percentage of dead units and a lower effective rank than the network trained using backpropagation. The online classification accuracy is consistent with these results. Initially, online norm results in better online classification accuracy, but by later tasks, the classification accuracy of online normalization gets lower than that of backpropagation.
Due to Adam’s robustness to nonstationary losses, one would have expected that Adam would result in a lower loss of plasticity than backpropagation. This is the opposite of what happens. Adam’s loss of plasticity can be categorized as catastrophic as it plummets drastically. Consistent with our previous results, Adam scores poorly in the three measures corresponding to the causes for the loss of plasticity. There is a dramatic drop in the effective rank of the network trained with Adam. We also tested Adam with different activation functions on the Slowlychanging regression problem and found that loss of plasticity with Adam is usually worse than with SGD.
Many methods that one might have thought would help mitigate the loss of plasticity significantly worsened the loss of plasticity. The loss of plasticity with Adam is particularly dramatic, and the network trained with Adam quickly lost almost all of its diversity, as measured by the effective rank. This dramatic loss of plasticity of Adam is an important result for deep reinforcement learning as Adam is the default optimizer in deep reinforcement learning and reinforcement learning is inherently continual due to the everchanging policy. Similar to Adam, other commonly used methods like dropout and normalization worsen the loss of plasticity. Normalization has better performance in the beginning, but later it has a sharper drop in performance than backpropagation. In the experiment, dropout just makes the performance worse. We saw that the higher the dropout probability, the larger the loss of plasticity. These results mean that some of the most successful tools in the trainonce setting do not transfer to continual learning and we need to focus on directly developing tools for continual learning.
None of the existing methods fully maintain plasticity. Some popular methods like normalization, Adam, and dropout worsen the loss of plasticity. On the other hand, $L^{2}$regularization and shrinkandperturb reduce the loss of plasticity. Shrinkandperturb is particularly effective, as it almost entirely mitigates the loss of plasticity. However, even with shrinkandperturb, there is a minimal loss of plasticity. Additionally, both shrinkandperturb and $L^{2}$regularization are very sensitive to hyperparameter values. They only reduce the loss of plasticity for a very small range of parameters, while for other hyperparameter values, they make the loss of plasticity worse. This sensitivity to hyperparameters can limit the application of these methods to continual learning. Additionally, shrinkandperturb does not fully resolve the three correlates of loss of plasticity, it has a lower effective rank than backpropagation, and it still has a high fraction of dead units. Shrinkandperturb shrinks the weights and adds randomness to them at each step. $L^{2}$regularization on its own does not mitigate the loss of plasticity. This means that both parts of shrinkandperturb are important to reduce the loss of plasticity. And somewhat surprisingly, continual injection of randomness is important to mitigate the loss of plasticity.
6 Continual Backpropagation: Stochastic Gradient Descent with Selective Reinitialization
We now attempt to develop a new algorithm that can fully mitigate loss of plasticity in continual learning problems as well as solve all three correlates of loss of plasticity. In the previous section, we learned that continual injection of randomness is important to reduce the loss of plasticity. However, the continual injection of randomness in the previous section was tied to the idea of shrinking the weights. There exists prior work [57] that proposed a more direct way of continually injecting randomness by selectively reinitializing lowutility units in the network. But the ideas presented in that paper were not fully developed and could only be used with neural networks with a single hidden layer and a single output, so they can not be used with modern deep learning in their current form. In this section, we fully develop the idea of selective reinitialization so it can be used with modern deep learning. The resulting algorithm combines conventional backpropagation with selective reinitialization. We call it continual backpropagation.
In one sense, continual backpropagation is a simple and natural extension of the conventional backpropagation algorithm to continual learning. The conventional backpropagation algorithm has two main parts: initialization with small random weights and gradient descent at every time step. This algorithm is designed for the trainonce setting, where learning happens once and never again. It only initializes the connections with small random numbers in the beginning, but continual backpropagation does so continually. Continual backpropagation makes conventional backpropagation continual by performing similar computations at all times. The guiding principle behind continual backpropagation is that good continual learning algorithms should do similar computations at all times.
Continual backpropagation selectively reinitializes lowutility units in the network. Selective reinitialization has two steps. The first step is to find lowutility units and the second is to reinitialize them. Every time step, a fraction of hidden units $\rho$, called replacementrate, are reinitialized in every layer. When a new hidden unit is added, its outgoing weights are initialized to zero. Initializing the outgoing weights as zero ensures that the newly added hidden units do not affect the already learned function. However, initializing the outgoing weight to zero makes the new unit vulnerable to immediate reinitialized as it has zero utility. To protect new units from immediate reinitialization, they are protected from a reinitialization for maturity threshold $m$ number of updates.
One major limitation of prior work on selective reinitialization is that the utility measure is limited to networks with a single hidden layer and one output. We overcome this limitation by proposing a utility measure that can be applied to arbitrary networks. Our utility measure has two parts. The first part measures the contribution of the units to its consumers. A consumer is any unit that uses the output of a given unit. A consumer can be other hidden units or the output units of the network. And the second part of the utility measures units’ ability to adapt.
The first part of our utility measure, called the contribution utility, is defined for each connection or weight and each unit. The basic intuition behind the contribution utility is that magnitude of the product of units’ activation and outgoing weight gives information about how valuable this connection is to its consumers. If a hidden unit’s contribution to its consumer is small, its contribution can be overwhelmed by contributions from other hidden units. In such a case, the hidden unit is not useful to its consumer. The same measure of connection utility has been proposed for the network pruning problem [58]. We define the contribution utility of a hidden unit as the sum of the utilities of all its outgoing connections. The contribution utility is measured as a running average of instantaneous contributions with a decay rate, $\eta$. In a feedforward neural network, the contributionutility, $c_{l,i,t}$, of the $i$th hidden unit in layer $l$ at time $t$ is updated as
$\displaystyle c_{l,i,t}$  $\displaystyle=\eta*c_{l,i,t1}+(1\eta)*h_{l,i,t}*\sum_{k=1}^{n_{l+1}}w_{l,i,k,t},$  (2) 
where $h_{l,i,t}$ is the output of the $i^{th}$ hidden unit in layer $l$ at time $t$, $w_{l,i,k,t}$ is the weight connecting the $i^{th}$ unit in layer $l$ to the $k^{th}$ unit in layer $l+1$ at time $t$, $n_{l+1}$ is the number of units is layer $l+1$.
We use a meancorrected version of the contribution utility. The contribution utility is inspired by the network pruning problem. The network pruning problem arises in the trainonce problem setting, where the connections are removed after learning is done. However, we are studying the continual learning setting where hidden units must be replaced while learning, so we need to consider the learning process’s effect on a unit’s contribution. Specifically, we need to remove the part of the contribution from the utility which is correlated with the bias. In the special case when a consumer has just one input unit and a bias, SGD will transfer the average part of the contribution to the bias unit over time when the consumer is removed. We define the meancorrected contribution utility, $z_{l,i,t}$, as the product of the magnitude of the connecting weight and the magnitude of the activation minus the average value of the activation.
$\displaystyle f_{l,i,t}$  $\displaystyle=\eta*f_{l,i,t1}+(1\eta)*h_{l,i,t},$  (3)  
$\displaystyle\hat{f}_{l,i,t}$  $\displaystyle=\frac{f_{l,i,t1}}{1\eta^{a_{l,i,t}}},$  (4)  
$\displaystyle z_{l,i,t}$  $\displaystyle=\eta*z_{l,i,t1}+(1\eta)*h_{l,i,t}\hat{f}_{l,i,t}*\sum_{k=1}^{n_{l+1}}w_{l,i,k,t},$  (5) 
where $h_{l,i,t}$ is hidden units’ output, $w_{l,i,k,t}$ is the weight connecting the hidden unit to the $k$th unit in layer $l+1$, $n_{l+1}$ is the number of units is layer $l+1$, $a_{l,i,t}$ is the age of the hidden unit at time $t$. Here, $f_{l,i,t}$ is a running average of $h_{l,i,t}$ and $\hat{f}_{l,i,t}$ is the biascorrected estimate. Additionally, we also transfer the unit’s average contribution, $\hat{f}_{l,i,t}*w_{l,i,k,t}$, to the bias of its consumers when the unit is removed so the consumers are less affected by the unit’s removal. This idea of moving the average contribution to the bias has also been used for the network pruning task [59].
The second part of our utility measure captures how fast a unit can adapt. We measure the ability to adapt as the inverse of the average magnitude of the units’ input weights, and we call it adaptationutility. The adaptation utility, the inverse of the average input weight magnitude, intuitively tries to capture how fast a hidden unit can get change the function it is representing. Additionally, the inverse of the weight magnitude is a particularly reasonable measure for the speed of adaptation for Adamtype optimizers. In Adam, the change in weight in a single update is either upper bounded by the step size parameter or a small multiple of the step size parameter [25]. So, during each update, hidden units with smaller weights can have a larger relative change in the function they represent.
Finally, we define the overall utility of a hidden unit as the running average of the product of its meancorrected contribution utility and adaptation utility. The overall utility, $\hat{u}_{l,i,t}$, becomes
$\displaystyle y_{l,i,t}$  $\displaystyle=\frac{h_{l,i,t}\hat{f}_{l,i,t}*\sum_{k=1}^{n_{l+1}}w_{l,i,k,t}}{\sum_{j=1}^{n_{l1}}w_{l1,j,i,t}}$  (6)  
$\displaystyle u_{l,i,t}$  $\displaystyle=\eta*u_{l,i,t1}+(1\eta)*y_{l,i,t},$  (7)  
$\displaystyle\hat{u}_{l,i,t}$  $\displaystyle=\frac{u_{l,i,t1}}{1\eta^{a_{l,i,t}}}.$  (8) 
The instantaneous overall utility is depicted in Figure 5.
The final algorithm combines conventional backpropagation with selective reinitialization to continually inject random hidden units from the initial distribution. Continual backpropagation performs a gradientdescent and selective reinitialization step at each update. Algorithm 1 specifies the continual backpropagation algorithm for a feedforward neural network. Our continual backpropagation algorithm overcomes the limitation of prior work ([60, 57]) on selective reinitialization and makes it compatible with modern deep learning. Prior work had two significant limitations. First, their algorithm was only applicable to neural networks with a single hidden layer and a single output. Second, it was limited to LTU activations, binary weights, and SGD. We overcome all of these limitations. Our algorithm is applicable to arbitrary feedforward networks. We describe how to use it with modern activations and optimizers like Adam in E. The name “Continual” backpropagation comes from an algorithmic perspective. The backpropagation algorithm, as proposed by [19], had two parts, initialization with small random numbers and gradient descent. However, initialization only happens initially, so backpropagation is not a continual algorithm as it does not do similar computations at all times. On the other hand, continual backpropagation is continual as it performs similar computations at all times.
We then applied continual backpropagation on Continual Imagenet, Online Permuted MNIST, and slowlychanging regression. We started with Online Permuted MNIST. We used the same network as in the previous section, a network with 3 hidden layers with 2000 hidden units each. We trained the network using SGD with a step size of 0.003. For continual backpropagation, we show the online classification accuracy for various values of replacement rates. Replacement rate is the main hyperparameter in continual backpropagation, it controls how rapidly units are reinitialized in the network. For example, a replacement rate of $1e4$ for our network with 2000 hidden units in each layer would mean replacing one unit in each layer after every 5 examples. The hyperparameters for $L^{2}$regularization, Shrink and Perturb, Online Norm, Adam, and Dropout were chosen as described in the previous section. The online classification accuracy of various algorithms on Online Permuted MNIST is presented in Figure 6a. The results are averaged over thirty runs.
Among all the algorithms, only continual backpropagation has a nondegrading performance. The performance of all the other algorithms degrades over time. Additionally, continual backpropagation is stable for a wide range of hyperparameter values. Note that the two bestperforming algorithms are continual backpropagation and shrinkandperturb. And both of these algorithms enable having small weight magnitudes and diversity of representation by their design.
Let us take a deeper look at the network that is learning via continual backpropagation. The evolution of the correlates of loss of plasticity when using continual backpropagation is shown in Figure 6b. Continual backpropagation mitigates all three correlates of loss of plasticity. It has almost no dead units, stops the network weights from growing, and maintains a high effective rank across tasks. All algorithms that maintain a low weight magnitude reduced loss of plasticity. This supports our claim that low weight magnitudes are important for maintaining plasticity. The algorithms that maintain low weight magnitudes were continual backpropagation, $L^{2}$regularization, and shrinkandperturb. Shrinkandperturb and continual backpropagation have an additional advantage over $L^{2}$regularization: they inject randomness into the network. This injection of randomness leads to a higher effective rank and lower number of dead units, which leads to both of these algorithms performing better than $L^{2}$regularization. However, continual backpropagation injects randomness selectively, effectively removing all dead units from the network and leading to a higher effective rank. This smaller number of dead units and a higher effective rank explains the better performance of continual backpropagation.
Then we applied continual backpropagation on Continual ImageNet. Similar to the experiments on Continual ImageNet in Section 2, we used SGD with momentum and the same network that is described in Table 1. We also tested L2 regularization, and shrinkandperturb on Continual Imanget, as these are the only two methods that reduced loss of plasticity in online Permuted MNIST. For all algorithms, we present the performance of the hyperparameter value that had the largest average classification accuracy over 5000 tasks. The classification accuracy of various algorithms on Continual ImageNet is shown in Figure 7. The results are averaged over thirty runs. The details of the hyperparameter selection for all algorithms used in Figure 7 are presented in A. The first point in the plot is the average accuracy on the first 50 tasks; the next is the average accuracy over the next 50 tasks and so on.
Continual backpropagation fully mitigates the loss of plasticity in Continual ImageNet. Its classification accuracy on the 5000th task is better than on the first task. It also outperforms existing techniques like $L^{2}$regularization and shrinkandperturb.
In this section, we introduced continual backpropagation, which continually reinitializes lowutility hidden units alongside gradient descent. Continual backpropagation fully maintains plasticity in Continual ImageNet, Online Permuted MNIST. We also tested continual backpropagation on slowlychanging regression and found that it overcomes the loss of plasticity for all activation functions for both SGD and Adam. Continual backpropagation outperforms all the existing methods on all three continual learning problems. It is much less sensitive to its hyperparameters than other algorithms like $L^{2}$regularization and shrinkandperturb. It also mitigates all three correlates of plasticity as it maintains a low average weight magnitude, a very small percentage of dead units, and a high effective rank. Additionally, we performed an ablation study for the utility measure in continual backpropagation. The results of the ablation study are present in D, and it shows that all components of the utility measure are important for best performance. Apart from the three supervised learning problems, we also tested these algorithms in a continual supervised learning problem. The results of that experiment are largely consistent with the results of supervised learning experiiments and are presented in C
Continual backpropagation opens new directions for algorithmic exploration. We now take a deeper look at how continual backpropagation is connected to other ideas in the literature, and how it suggests many exciting directions for future work. We expect that future work will explore these directions and propose new and more robust variants of continual backpropagation.
Selective reinitialization uses a utility measure to find and replace lowutility units. One limitation of continual backpropagation is that the utility measure is based on heuristics. Even though it performs well, future work on more principled utility measures will improve the foundations of continual backpropagation. Our current utility measure is not a global measure of utility as it does not consider how a given unit affects the overall represented function. One possibility is to develop utility measures where utility is propagated backwards from the loss function. The idea of utility in continual backpropagation is closely related to connection utility in the neural network pruning literature. Various papers [61, 62, 63] have proposed different measures of connection utility for the network pruning problem. Adapting these utility measures to mitigate the loss of plasticity is a promising direction for new continual backpropagation algorithms.
The idea of selective reinitialization is similar to the emerging idea of dynamic sparse training [64, 65]. In dynamic sparse training, a sparse network is trained from scratch and connections between different units are generated and removed during training. Removing connections requires a measure of utility, and the initialization of new connections requires a generator similar to selective reinitialization. The main difference between dynamic sparse training and continual backpropagation is that dynamic sparse training operates on connections between units while continual backpropagation operates on units. Consequently, the generator in dynamic sparse training must also decide which new connections to grow. Dynamic sparse training has achieved promising results in supervised and reinforcement learning problems [66, 67], where dynamic sparse networks achieve performance close to dense networks even at high sparsity levels. Dynamic sparse training is a promising idea to maintain plasticity.
The idea of adding new units to neural networks is present in the continual learning literature [68, 69, 9]. This idea is usually manifested in algorithms that dynamically increase the size of the network. So, these methods do not have an upper limit on memory requirements. For example, Rusu et al. [69] presented a method that keeps expanding the size of the network as it sees more and more data. Their method expands the network by allocating a new subnetwork whenever there is a new task. Although these methods are related to the ideas in continual backpropagation, none are suitable for comparison, as continual backpropagation is designed for the case when the learning system has finite memory. And these methods would therefore require nontrivial modification to apply to our setting of finite memory.
Prior works on the importance of initialization have focused on finding the correct weight magnitude to initialize the weights. Glorot and Bengio [40] analytically and empirically showed that it is essential to initialize the weights so that the gradients do not become exponentially small in the initial layers of a network with sigmoid activations and the gradient is preserved across layers. He et al. [70] built on this idea for the ReLU activation function. Sutskever et al. [71] showed that initialization with small weights is critical for sigmoid activations as they may saturate if the weights are too large. Despite all this work on the importance of initialization, the fact that its benefits are only present in the beginning but not continually has been overlooked as most of these papers focused on trainonce setting, where learning has to be done just once.
Loss of plasticity might also be connected to the lottery ticket hypothesis [72]. The hypothesis states that randomlyinitialized networks contain subnetworks that can achieve performance close to that of the original network with a similar number of updates. These subnetworks are called winning tickets. We found that in continual learning problems, the effective rank of the representation at the beginning of tasks reduces over time. In a sense, the network obtained after training on multiple tasks has less randomness and diversity than the original random network. The reduced randomness might mean that the network has fewer winning tickets. And this reduced number of winning tickets might explain the loss of plasticity. Fully exploring the connection between loss of plasticity and the lottery ticket hypothesis can deepen our understanding of loss of plasticity.
Some recent works have focused on quickly adapting to the changes in the data stream [73, 74, 75]. However, the problem settings in these papers were offline as they had two separate phases, one for learning and the other for evaluation. To use these methods online, they have to be pretrained on tasks that represent tasks the learner will encounter during the online evaluation phase. This requirement of having access to representative tasks in the pretraining phase is not realistic for lifelong learning systems as the real world is nonstationary, and even the distribution of tasks can change over time. These methods are not comparable to those we studied in our work, as we studied fully online methods that do not require pretraining.
In this work, we found that methods that continually injected randomness while maintaining small weight magnitudes significantly reduced the loss of plasticity. Many works have found that adding noise while training neural networks can improve training and testing performance. The main benefits of adding noise have been reported to be avoiding overfitting and improving training performance [76, 77, 78]. However, the benefits of injected noise are controversial because it can be tricky to inject noise to avoid poor performance. For example, Greff et al. [79] claimed adding noise always worsened their network performance. In our case, when the data distribution is nonstationary, we found that continually injecting noise can help maintain the plasticity in neural networks.
The loss of plasticity also provides an alternative explanation for various recent results in deep reinforcement learning. The reinforcement learning problem has various sources of nonstationarity, from the changing policy to the changing target function. Due to these nonstationarities, we expect that deep neural networks will suffer from a loss of plasticity in reinforcement learning problems. Igl et al. [80] found that deep reinforcement learning systems can lose their generalization abilities in the presence of nonstationarities. This loss of generalization ability could be an instance of plasticity loss, similar to the plasticity loss we observed in Continual Imagenet and online Permuted MNIST. Kumar et al. [81] observed a reduction in the effective rank of the representation in some deep reinforcement learning algorithms. That reduction in the effective rank is similar to the rank reduction we saw in the online Permuted MNIST problem, and their observation could be another instance of plasticity loss in deep networks. Nikishin et al. [23] proposed resetting some layers of the network to overcome plasticity loss. Their solution method, which resets some layers of the network, is in line with our results that initialization with small random numbers is essential for learning. Their method drastically changes the network by reinitializing a large part of the network, while continual backpropagation only slightly changes the network, but it does so continually.
7 Discussion
In this paper, we have shown significantly more directly, thoroughly, and systematically than in prior work that standard deep learning methods fail in continual learning settings. We tested deep learning methods in continual supervised learning problems derived from standard datasets like Imagenet and MNIST. These were the simplest experiments where loss of plasticity could have happened, as the experiments were designed to follow standard deep learning practices except for requiring the system to keep learning on new tasks. Staying close to the standard practice ensured there were no confounding issues, and the experiments directly showed that deep learning methods lose plasticity. Deep learning methods lose plasticity in both classification and regression problems. Plasticity loss happened for various activation functions, optimizes, network sizes, and various popular methods like dropout, normalization, Adam, and regularization, meaning that loss of plasticity is a widespread phenomenon. We performed these experiments systematically with the highest methodological standards by performing wide parameter sweeps and atleast 30 independent runs for all methods. By standard deep learning methods, we mean methods that are specialized to the trainonce setting, and by fail, we mean that they lose the ability to learn new things. In continual learning problems, standard deep learning methods do not work well.
We also have a significantly better understanding of loss of plasticity and the solution methods. The root cause of loss of plasticity is that the benefits provided by initialization with small random numbers are lost over time. A deep dive into the networks revealed that many useful properties of initialization like small weight magnitudes, few dead units, and a diversity representation are lost over time. Many commonly used methods like Adam and dropout significantly worsened plasticity loss, while properly tuned $L^{2}$ regularization and shrinkandperturb mitigated plasticity loss in many cases. We developed the continual backpropagation algorithm, which extends the conventional backpropagation algorithm by selectively reinitializing a small fraction of lowutility hidden units at each update. Continual backpropagation fully mitigated plasticity loss in all supervised continual learning problems. It ranks units by their utility to the network’s functioning, and more improvements are possible in how ranking is done, particularly for recurrent networks.
This work opens many directions for understanding, and solving the loss of plasticity. Although we have made significant progress in understanding the loss of plasticity, it remains unclear which specific properties of initialization with small random numbers are important for maintaining plasticity. Recent work [82, 83] has made exciting progress in this direction, and it remains an important avenue for future work. A crucial missing piece in the continual learning literature is the formalization of different phenomena like loss of plasticity and catastrophic forgetting. New papers [84] are already attempting to fill this gap in the literature and future work that formalizes these concepts will be a valuable contribution to our understanding of continual learning. One of the primary consumers of ideas in continual learning is reinforcement learning. Recent work has improved performance in many reinforcement learning problems by applying plasticitypreserving methods [85, 86, 87, 88]. Modern deep reinforcement learning methods use techniques like target networks and replay buffers to make reinforcement learning an almost trainonce setting. A more direct application of continual learning is in methods that do not use techniques to make reinforcement learning a trainonce setting. We look ahead to a promising future of better and more robust continual learning algorithms.
Acknowledgements
We would like to thank Martha White and Prabhat Nagarajan for their feedback on the work. We gratefully acknowledge the Digital Research Alliance of Canada for providing the computational resources to carry out the experiments in this paper. We also acknowledge the funding from the Canada CIFAR AI Chairs Program, DeepMind, the Alberta Machine Intelligence Institute (Amii), and the Natural Sciences and Engineering Research Council (NSERC) of Canada. This work was made possible by the stimulating and supportive research environment fostered by the members of the Reinforcement Learning and Artificial Intelligence (RLAI) Laboratory, particularly the insightful discussions held at the agent state group.
References
 [1] Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 82–97 (2012).
 [2] Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, vol. 25 (Curran Associates, Inc., 2012).
 [3] Mnih, V. et al. Humanlevel control through deep reinforcement learning. Nature 518, 529–533 (2015).
 [4] Brown, T. et al. Language models are fewshot learners. In Advances in Neural Information Processing Systems, vol. 33, 1877–1901 (Curran Associates, Inc., 2020).
 [5] Ramesh, A. et al. Zeroshot texttoimage generation. In Proceedings of the 38th International Conference on Machine Learning, vol. 139 of Proceedings of Machine Learning Research, 8821–8831 (PMLR, 2021).
 [6] McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, vol. 24, 109–165 (Elsevier, 1989).
 [7] French, R. M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3, 128–135 (1999).
 [8] Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, 3521–3526 (2017).
 [9] Yoon, J., Yang, E., Lee, J. & Hwang, S. J. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations (2018).
 [10] Aljundi, R. et al. Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., 2019).
 [11] Golkar, S., Kagan, M. & Cho, K. Continual learning via neural pruning. In Real Neurons & Hidden Units: Future directions at the intersection of neuroscience and artificial intelligence @ NeurIPS 2019 (2019).
 [12] Riemer, M. et al. Learning to learn without forgetting by maximizing transfer and minimizing interference. In International Conference on Learning Representations (2019).
 [13] Rajasegaran, J., Hayat, M., Khan, S. H., Khan, F. S. & Shao, L. Random path selection for continual learning. In Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., 2019).
 [14] Javed, K. & White, M. Metalearning representations for continual learning. In Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., 2019).
 [15] Veniat, T., Denoyer, L. & Ranzato, M. Efficient continual learning with modular networks and taskdriven priors. In International Conference on Learning Representations (2021).
 [16] Ellis, A. W. & Lambon Ralph, M. A. Age of acquisition effects in adult lexical processing reflect loss of plasticity in maturing systems: insights from connectionist networks. Journal of Experimental Psychology: Learning, memory, and cognition 26, 1103 (2000).
 [17] Zevin, J. D. & Seidenberg, M. S. Age of acquisition effects in word reading and other tasks. Journal of Memory and Language 47, 1–29 (2002).
 [18] Bonin, P., Barry, C., Méot, A. & Chalard, M. The influence of age of acquisition in word reading and other tasks: A never ending story? Journal of Memory and Language 50, 456–476 (2004).
 [19] Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by backpropagating errors. Nature 323, 533–536 (1986).
 [20] Chaudhry, A., Dokania, P. K., Ajanthan, T. & Torr, P. H. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), 532–547 (2018).
 [21] Ash, J. & Adams, R. P. On warmstarting neural network training. In Advances in Neural Information Processing Systems, vol. 33, 3884–3894 (Curran Associates, Inc., 2020).
 [22] Berariu, T. et al. A study on the plasticity of neural networks. arXiv preprint arXiv:2106.00042 (2021).
 [23] Nikishin, E., Schwarzer, M., D’Oro, P., Bacon, P.L. & Courville, A. The primacy bias in deep reinforcement learning. In International Conference on Machine Learning, 16828–16847 (PMLR, 2022).
 [24] Lyle, C., Rowland, M. & Dabney, W. Understanding and preventing capacity loss in reinforcement learning. In International Conference on Learning Representations (2022).
 [25] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, Track Proceedings (2015).
 [26] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. R. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012).
 [27] Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
 [28] Deng, J. et al. Imagenet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009).
 [29] Russakovsky, O. et al. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252 (2015).
 [30] Hu, J., Shen, L. & Sun, G. Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141 (2018).
 [31] Chrabaszcz, P., Loshchilov, I. & Hutter, F. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819 (2017).
 [32] LeCun, Y. & Cortes, C. MNIST handwritten digit database (1998). URL http://yann.lecun.com/exdb/mnist/.
 [33] Goodfellow, I., Mirza, M., Xiao, D. & Aaron Courville, Y. B. An empirical investigation of catastrophic forgeting in gradientbased neural networks. In International Conference on Learning Representations (2014).
 [34] Zenke, F., Poole, B. & Ganguli, S. Continual learning through synaptic intelligence. In International Conference on Machine Learning, 3987–3995 (PMLR, 2017).
 [35] Caruana, R. Multitask learning. Machine learning 28, 41–75 (1997).
 [36] Ring, M. B. Child: A first step towards continual learning. In Learning to learn, 261–292 (Springer, 1998).
 [37] Parisi, G. I., Kemker, R., Part, J. L., Kanan, C. & Wermter, S. Continual lifelong learning with neural networks: A review. Neural Networks 113, 54–71 (2019).
 [38] Lu, L., Shin, Y., Su, Y. & Karniadakis, G. E. Dying relu and initialization: Theory and numerical examples. arXiv preprint arXiv:1903.06733 (2019).
 [39] Shin, Y. & Karniadakis, G. E. Trainability of relu networks and datadependent initialization. Journal of Machine Learning for Modeling and Computing 1 (2020).
 [40] Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, vol. 9 of Proceedings of Machine Learning Research, 249–256 (PMLR, 2010).
 [41] Montavon, G., Orr, G. & Müller, K.R. Neural networks: tricks of the trade, vol. 7700 (springer, 2012).
 [42] Rakitianskaia, A. & Engelbrecht, A. Measuring saturation in neural networks. In 2015 IEEE Symposium Series on Computational Intelligence, 1423–1430 (2015).
 [43] Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In International conference on machine learning, 1310–1318 (PMLR, 2013).
 [44] Philipp, G., Song, D. & Carbonell, J. G. The exploding gradient problem demystifieddefinition, prevalence, impact, origin, tradeoffs, and solutions. arXiv preprint arXiv:1712.05577 (2017).
 [45] George Philipp, J. G. C., Dawn Song. Gradients explode  deep networks are shallow  resnet explained (2018). URL https://openreview.net/forum?id=HkpYwMZRb.
 [46] Bertsekas, D. Nonlinear Programming (Athena Scientific, 2016), 3 edn.
 [47] Roy, O. & Vetterli, M. The effective rank: A measure of effective dimensionality. In 2007 15th European Signal Processing Conference, 606–610 (2007).
 [48] Smith, S. L., Dherin, B., Barrett, D. & De, S. On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations (2021).
 [49] Razin, N. & Cohen, N. Implicit regularization in deep learning may not be explainable by norms. In Advances in Neural Information Processing Systems, vol. 33, 21174–21187 (Curran Associates, Inc., 2020).
 [50] Bjorck, N., Gomes, C. P., Selman, B. & Weinberger, K. Q. Understanding batch normalization. In Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., 2018).
 [51] Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456 (PMLR, 2015).
 [52] Santurkar, S., Tsipras, D., Ilyas, A. & Madry, A. How does batch normalization help optimization? Advances in neural information processing systems 31 (2018).
 [53] Lillicrap, T. P. et al. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR) (2016).
 [54] Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, vol. 48 of Proceedings of Machine Learning Research, 1928–1937 (PMLR, 2016).
 [55] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
 [56] Chiley, V. et al. Online normalization for training neural networks. Advances in Neural Information Processing Systems 32 (2019).
 [57] Mahmood, A. R. & Sutton, R. S. Representation search through generate and test. In AAAI Workshop: Learning Rich Representations from LowLevel Sensors, vol. 10 (2013).
 [58] Hu, H., Peng, R., Tai, Y.W. & Tang, C.K. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250 (2016).
 [59] Thimm, G. & Fiesler, E. Evaluating pruning methods. In Proceedings of the International Symposium on Artificial neural networks, 20–25 (1995).
 [60] Kaelbling, L. P. Learning in embedded systems (MIT press, 1993).
 [61] LeCun, Y., Denker, J. & Solla, S. Optimal brain damage. Advances in neural information processing systems 2 (1989).
 [62] Han, S., Huizi, M. & Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (2016).
 [63] Liu, J., Xu, Z., Shi, R., Cheung, R. C. C. & So, H. K. Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers. In International Conference on Learning Representations (2020).
 [64] Mocanu, D. C. et al. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications 9, 1–12 (2018).
 [65] Bellec, G., Kappel, D., Maass, W. & Legenstein, R. Deep rewiring: Training very sparse deep networks. In International Conference on Learning Representations (2018).
 [66] Chen, T. et al. Chasing sparsity in vision transformers: An endtoend exploration. Advances in Neural Information Processing Systems 34, 19974–19988 (2021).
 [67] Sokar, G., Mocanu, E., Mocanu, D. C., Pechenizkiy, M. & Stone, P. Dynamic sparse training for deep reinforcement learning. The 31st International Joint Conference on Artificial Intelligence (2022).
 [68] Zhou, G., Sohn, K. & Lee, H. Online incremental feature learning with denoising autoencoders. In Artificial intelligence and statistics, 1453–1461 (PMLR, 2012).
 [69] Rusu, A. A. et al. Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016).
 [70] He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, 1026–1034 (2015).
 [71] Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, vol. 28 of Proceedings of Machine Learning Research, 1139–1147 (PMLR, 2013).
 [72] Frankle, J. & Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (2019).
 [73] Finn, C., Abbeel, P. & Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, 1126–1135 (PMLR, 2017).
 [74] Wang, Y.X., Ramanan, D. & Hebert, M. Growing a brain: Finetuning by increasing model capacity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2471–2480 (2017).
 [75] Nagabandi, A. et al. Learning to adapt in dynamic, realworld environments through metareinforcement learning. In International Conference on Learning Representations (2019).
 [76] Holmstrom, L., Koistinen, P. et al. Using additive noise in backpropagation training. IEEE transactions on neural networks 3, 24–38 (1992).
 [77] Graves, A., Mohamed, A.r. & Hinton, G. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 6645–6649 (2013).
 [78] Neelakantan, A. et al. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807 (2015).
 [79] Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28, 2222–2232 (2017).
 [80] Igl, M., Farquhar, G., Luketina, J., Boehmer, W. & Whiteson, S. Transient nonstationarity and generalisation in deep reinforcement learning. In International Conference on Learning Representations (2021).
 [81] Kumar, A., Agarwal, R., Ghosh, D. & Levine, S. Implicit underparameterization inhibits dataefficient deep reinforcement learning. In International Conference on Learning Representations (2021).
 [82] Lyle, C. et al. Understanding plasticity in neural networks. In Krause, A. et al. (eds.) Proceedings of the 40th International Conference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research, 23190–23211 (PMLR, 2023).
 [83] Abbas, Z., Zhao, R., Modayil, J., White, A. & Machado, M. C. Loss of plasticity in continual deep reinforcement learning. arXiv preprint arXiv:2303.07507 (2023).
 [84] Kumar, S. et al. Continual learning as computationally constrained reinforcement learning. arXiv preprint arXiv:2307.04345 (2023).
 [85] Nikishin, E. et al. Deep reinforcement learning with plasticity injection. arXiv preprint arXiv:2305.15555 (2023).
 [86] D’Oro, P. et al. Sampleefficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations (2023).
 [87] Sokar, G., Agarwal, R., Castro, P. S. & Evci, U. The dormant neuron phenomenon in deep reinforcement learning. In Krause, A. et al. (eds.) Proceedings of the 40th International Conference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research, 32145–32168 (PMLR, 2023).
 [88] Schwarzer, M. et al. Bigger, better, faster: Humanlevel Atari with humanlevel efficiency. In Krause, A. et al. (eds.) Proceedings of the 40th International Conference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research, 30365–30380 (PMLR, 2023).
 [89] Clevert, D., Unterthiner, T. & Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). In 4th International Conference on Learning Representations (2016).
 [90] Maas, A. L., Hannun, A. Y. & Ng, A. Y. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing (2013).
 [91] Nair, V. & Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, 807–814 (2010).
 [92] Ramachandran, P., Zoph, B. & Le, Q. V. Searching for activation functions. In 6th International Conference on Learning Representations, May 3, 2018, Workshop Track Proceedings (OpenReview.net, 2018).
 [93] McCulloch, W. S. & Pitts, W. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5, 115–133 (1943).
 [94] Sutton, R. S. & Whitehead, S. D. Online learning with random representations. In Proceedings of the Tenth International Conference on International Conference on Machine Learning, ICML’93, 314–321 (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993).
 [95] Coumans, E. & Bai, Y. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org (2016).
 [96] Andrychowicz, M. et al. What matters for onpolicy deep actorcritic methods? a largescale study. In International Conference on Learning Representations (2021).
 [97] Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
Appendix A Methods
In this section, we describe the procedure for selecting hyperparameters for various algorithms for the online Permuted MNIST and Imagent problems.
In online permuted MNIST, we tested a wide range of hyperparameters settings for each method. In Figure 8 we show three different hyperparameter settings for each different algorithm. The performance followed a similar trend for other hyperparameters values. All the algorithms used a step size of 0.003, which was the bestperforming step size for backpropagation in Figure 2b (Left). In the case of shrinkandperturb, the scaling factor was set to the same value as the best regularization parameter found for $L^{2}$regularization, which is a suitable choice because shrinkandperturb is equivalent to $L^{2}$regularization with parameter noise.
Due to limited computational resources, we only used ten runs for the hyperparameter sweep. For all the methods except for dropout, we selected the hyperparameter setting that resulted in the highest average percent correct during the whole training period and ran 20 more runs with that hyperparameter value. For dropout, we selected a dropout probability of 0.1 instead of the hyperparameter setting with the highest online classification accuracy. We did this because the hyperparameter setting with the highest online classification accuracy was 0.03, which performs the same and is almost the same algorithm as backpropagation and would not be representative of the behaviour of dropout.
The parameter sweep for continual backpropagation is presented in the main body of the paper (6, right). It follows the same experimental design except that all the lines in the plot correspond to the average over 30 runs.
We now describe the hyperparameter selection for $L^{2}$regularization, shrinkandperturb, and continual backpropagation on Continual ImageNet. The main text presents the results for these algorithms on Continual ImageNet in Figure 7. We performed a grid search for all algorithms to find the best set of hyperparameters. The values of hyperparameters used for the grid search are described in Table 2 $L^{2}$regularization has two hyperparameters, step size and weight decay. Shrinkandperturb has three hyperparameters, step size, weight decay, and noise variance. We swept over two hyperparameters of continual backpropagation, step size and replacement rate. The maturity threshold in continual backpropagation was set to 100 for all sets of hyperparameters. For both backpropagation and $L^{2}$regularization, the performance was poor for step sizes of 0.1 or 0.003. So, we chose to only use step sizes of 0.03 and 0.01 for continual backpropagation and shrinkandperturb. We performed ten independent runs for all sets of hyperparameters. We chose the set of hyperparameters with the highest average classification accuracy over 5000 tasks as the bestperforming set of hyperparameters for a given algorithm. Then we performed 20 additional runs to complete 30 runs for the bestperforming set of hyperparameters to produce the results in Figure 7.
hyperparameter Selection in Continual ImageNet  

Algorithm Name  values of step size  values of weight decay /replacementrate  values of noisevariance 
$L^{2}$regularization  {0.1, 0.03, 0.01, 0.003}  {3e5, 1e5, 3e6, 1e6}  Not applicable 
\hdashline Shrinkandperturb  {0.03, 0.01}  {3e5, 1e5, 3e6, 1e6}  {1e4, 1e5, 1e6, 1e7} 
\hdashline Continual backpropagation  {0.03, 0.01}  {3e3, 1e3, 3e4, 1e4, 3e5}  Not applicable 
Appendix B Loss of Plasticity With Different Activations in the Slowly Changing Regression Problem
There remains the issue of the network’s activation function. We have been using ReLU, the most popular choice at present, but there are a number of other possibilities. For these experiments, we switched to an even smaller, more idealized problem. SlowlyChanging Regression is a computationally inexpensive problem where we can run a single experiment on a CPU core in 15 minutes, allowing us to do extensive studies. As its name suggests, this problem is a regression problem—meaning the labels are real numbers, with a squared loss, rather than nominal values with a crossentropy loss—and the nonstationarity is slow and continual rather than abrupt as in a switch from one task to another. In SlowlyChanging Regression, we study loss of plasticity for networks with six popular activation functions: sigmoid, tanh, ELU [89], leakyReLU [90], ReLU [91], and Swish [92].
In SlowlyChanging Regression, the learner receives a sequence of examples. The input for each example is a binary vector of size $m+1$. The input has $f$ slowchanging bits, $mf$ random bits, and then one constant bit. The first $f$ bits in the input vector change slowly. After every $T$ examples, one of the first $f$ bits is chosen uniformly at random, and its value is flipped. These first $f$ bits remain fixed for the next $T$ examples. The parameter $T$ allows us to control the rate at which the input distribution changes. The next $mf$ bits are randomly sampled for each example. Lastly, the $(m+1)$th bit is a bias term with a constant value 1.
The target output is generated by running input through a neural network, which is set at the start of the experiment and kept fixed. As this network generates the target output and represents the desired solution, we call it the target network. The weights of the target networks are randomly chosen to be +1 or 1. The target network has one hidden layer with a Linear Threshold Unit [93], or LTU, activation. The value of the $i$th LTU is one if the input is above a threshold $\theta_{i}$, and zero otherwise. $\theta_{i}$ is set equal to $(m+1)\cdot\betaS_{i}$, where $\beta\in[0,1]$ and $S_{i}$ is the number of input weights with negative value [94]. The details of the input and target function in the slowlychanging regression problem are also described in Figure 9.
The details of the specific instance of the slowlychanging regression problem we use in this paper and the learning network used to predict its output are listed in Table B. Note that the target network is more complex than the learning network as the target network is wider, with 100 hidden units, while the learner has just five hidden units. Thus, because the input distribution changes every $T$ example and the target function is more complex than what the learner can represent, there is a need to track the best approximation.
slowlychanging regression Problem Parameters  
Parameter Name  Description  Value  
$m$  Number of input bits  21  
\hdashline$f$  Number of flipping bits  15  
\hdashline$n$  Number of hidden units  100  
\hdashline$T$  Duration between bit flips  10,000 time steps  
\hdashlineBias  Include bias term in input and output layers  True  
\hdashline$\theta_{i}$  LTU Threshold  $(m+1)\cdot\betaS_{i}$  
\hdashline$\beta$  Proportion used in LTU Threshold  0.7  
Learning Network Parameters  
Parameter Name  Value  
Number of hidden layers  1  
\hdashline Number of units in each hidden layer  5 
We applied learning networks with different activation functions to the slowlychanging regression. The learner used the backpropagation algorithm to learn the weights of the network. We used uniform Kaiming distribution [70] to initialize the learning network’s weights. The distribution is $U(b,b)$ with bound, $b=gain*\sqrt{\frac{3}{num\_inputs}}$, where $gain$ is chosen such that the magnitude of inputs does not change across layers. For tanh, Sigmoid, ReLU, and LeakyReLU, $gain$ is 5/3, 1, $\sqrt{2}$, and $\sqrt{2/(1+\alpha^{2})}$ respectively. For ELU and Swish, we used $gain=\sqrt{2}$, as was done in the original papers [89, 92].
We ran the experiment on the Slowlychanging regression problem for 3M examples. For each activation and value of step size, we performed 100 independent runs. First, we generated 100 sequences of examples (inputoutput pairs) for the 100 runs. Then these 100 sequences of examples were used for experiments with all activations and values of the step size parameter. We used the same sequence of examples to control the randomness in the data stream across activations and step sizes.
The results of the experiments are shown in Figure 10. We measured the squared error, the square of the difference between the true target and the prediction made by the learning network. In Figure 10c, the squared error is presented in bins of 40k examples. This means that the first data point is the average squared error on the first 40k examples, the next is the average squared error on the next 40k examples, and so on. The shaded region in the Figure shows the standard error of the binned error.
Figure 10 shows that in Slowlychanging regression, after performing well initially, the error increases for all step sizes and activations. For some activations like ReLU and tanh, loss of plasticity is severe, and the error increases to the level of the linear baseline. While for other activations like ELU, loss of plasticity is less severe, but still there is a significant loss of plasticity. These results mean that the loss of plasticity is not resolved by using commonly used activations.
Appendix C Extension to a continual RL problem, Slippery Ant
One of the major consumers of ideas in continual learning is the field of reinforcement learning as there are many sources of nonstationarity in reinforcement learning. In this section, we explore if the findings from continual supervised learning transfer to continual reinforcement learning (RL). However, a full exploration of this idea is beyond the scope of the current work. Fully exploring the issue of loss of plasticity in reinforcement learning is much harder because many additional confounders arise in reinforcement learning, and proper empirical research will require an extensive parameter sweep over tens of hyperparameters in each algorithm. So, the results in this section are preliminary as we did not perform full parameter sweeps in this section.
We developed a new continual RL problem, SlipperyAnt. SlipperyAnt is a continual variant of Pybullet’s [95] Ant problem. A continual variant is needed as the problems in Pybullet are stationary. In our problem, the environment changes after a prespecified time, making it necessary for the learner to adapt. We change the friction between the agent and the ground of the standard Pybullet Ant problem. In the standard problem, the value of friction is $1.5$. After every 10M time steps, we change the friction by loguniformly sampling it from $[10^{4},10^{+4}]$.
We used the PPO algorithm [55] to solve the SlipperyAnt problem. Two separate networks were used for the policy and the value function, and both had two hidden layers with 256 units. These networks had tanh activation as it performs the best with onpolicy algorithms like PPO [96]. The networks were trained using Adam alongside PPO to update the weights in the network.
To combine continual backpropagation with PPO, we used continual backpropagation instead of backpropagation to update the networks’ weights. Whenever the weights are updated using Adam, we also update them using generateandtest. We call the resulting algorithm Continual PPO, and we describe it in E.
All the algorithms that we tested are built on PPO and we used a standard set of parameters for PPO. In addition to the parameters for PPO, all algorithms require us to choose additional parameters. For $L^{2}$, we chose the best weight decay. For S&P, we chose the best perturb for the value of weight decay found for $L^{2}$. And for continual PPO, we chose the best replacement rate, maturity threshold pair. All parameter settings in are described in E. All other parameters were the same in PPO, PPO+$L^{2}$, PPO+S&P and continual PPO. The performance of PPO, PPO+$L^{2}$, PPO+S&P and continual PPO for Lecun initialization [97] on Slippery Ant is shown in Figure 11. The results are averaged over 100 runs.
The performance of PPO degrades dramatically as the environment changes in Slippery Ant. Whenever the environment changes, there will be a drop in performance as the agent has to learn a new policy. Due to these sudden drops the performance plot looks like a saw tooth, where each tooth shows the performance on one task. Across tasks, the performance that the PPO agent drops dramatically. This degradation is similar to the degradation of backpropagation’s performance on continual supervised learning problems, where it performs well initially, but its performance gets worse over time. This similarity with the performance of backpropagation is not surprising as backpropagation lies at the foundation of modern deep reinforcement learning algorithms like PPO.
All three continual PPO, PPO+S&P, and PPO+$L^{2}$ perform substantially better than PPO, and continual PPO performs best, where it continually performs almost as well as it does initially. There is still a small performance drop with continual PPO, which could be due to the additional confounders introduced by PPO. Fully understanding the mitigating loss of plasticity in continual RL problems is an important open avenue for future work.
In this section, we performed preliminary experiments in a reinforcement learning problem. These preliminary results in continual reinforcement learning problems are largely consistent with those in continual supervised learning problems. The fact that continual PPO does not fully maintain plasticity is interesting and worthy of future exploration. Additionally, we showed that continual backpropagation can be directly combined with existing algorithms. Continual PPO is an example of how to create continual variants of existing deep reinforcement learning algorithms.
Appendix D Ablation Study for the utility measure
The overall utility measure consists of two parts, the contribution utility and the adaptation utility. We compared various parts of the utility measure on the Slowlychanging regression problem. We use a learning network with tanh activation and Adam with a step size of 0.01. We also compare our utility measure with random utility and weightmagnitudebased utility. The results for this comparison are presented in Figure 12. We compared the following utility measures.

1.
Random utility: Utility, $r$ at every time step is uniformly randomly sampled from $U[0,1]$
$r_{l,i,t}=rand(0,1)$ 
2.
Weightmagnitude based utility: $wm$ at every time step is updated as:
$wm_{l,i,t}=(1\eta)*\sum_{k=1}^{n_{l+1}}w_{l,i,k,t}+\eta*wm_{l,i,t1}$ 
3.
Contribution utility, $c$, at every time step is updated as described in Equation 2

4.
Mean corrected contribution utility, $z$, at every time step is updated as described in Equation 3

5.
Adaptation utility, $a$, at every time step is updated as:
$a_{l,i,t}=(1\eta)*\frac{1}{\sum_{j=1}^{n_{l1}}w_{l,j,i,t}}+\eta*a_{l,i,t1}$ 
6.
Overall utility, $u$ ,at every time step is updated as described in Equation 6
The results for various utility measures are presented in Figure 12. The results show that all the components of our utility measure are needed for the best performance. They also show that our utility measure performs significantly better than random utility and weightmagnitude utility.
Next, we compared the two bestperforming utility measures from the slowly changing regression, overall and meancorrected contribution, on Slippery Ant. The results are presented in Figure 13. The results show that both utility measures perform significantly better than PPO. But, the overall utility performs significantly better than the other utility measure. This difference is more pronounced near the end when continual PPO with the overall utility measure performs almost as well as at the beginning.
Appendix E Continual PPO
We used the following values of the parameters for PPO, PPO+$L^{2}$ and Continual PPO in our experiments on our nonstationary RL problem. For Continual PPO and PPO+$L^{2}$ specific parameters, we chose the bestperforming values.

1.
Policy Network: (256, tanh, 256, tanh, Linear) + Standard Deviation variable

2.
Value Network (256, tanh, 256, tanh, linear)

3.
iteration size: 4096

4.
num epochs: 10

5.
minibatch size: 128

6.
GAE, $\lambda$: 0.95

7.
Discount factor, $\gamma$: 0.99

8.
clip parameter: 0.2

9.
Optimizer: Adam

10.
Optimizer step size: $1e4$

11.
Optimizer $\beta$s: (0.9, 0.999)

12.
weight decay (for $L^{2}$): $\{10^{3},10^{4},10^{5},10^{6}\}$

13.
replacement rate, maturity threshold (for CPPO): $\{(10^{3},1e2),(10^{3},5e2),(10^{4},5e3)$,
$(10^{4},5e2),(10^{5},1e4),(10^{5},5e4)\}$
PPO uses the Adam optimizer. To properly use Adam with continual backpropagation, we need to modify the Adam optimizer. Adam maintains estimates of the average gradient and the average squared gradient. Continual backpropagation resets some of the weights whenever a feature is replaced. Whenever continual backpropagation resets a weight, we set its estimated gradient and squared gradient to zero. Adam also maintains a ’timestep’ parameter to get an unbiased estimate of the average gradient. Again, whenever continual backpropagation resets a weight, we set its timestep parameter to zero. The continual backpropagation algorithm with Adam is described in Algorithm 3. And Algorithm 2 describes the Continual PPO algorithm.