Integrating Present and Past
in Unsupervised Continual Learning

Yipeng Zhang

^{1, 2}

, Laurent Charlin

^{1, 3}

, Richard Zemel

^{4}

, and Mengye Ren

^{5}

^{1}

Mila,

^{2}

Université de Montréal,

^{3}

HEC Montréal,

^{4}

Columbia University,

^{5}

NYU
{yipeng.zhang, lcharlin}@mila.quebec
zemel@cs.columbia.edu, mengye@cs.nyu.edu Work done partially at Columbia University.

Abstract

We formulate a unifying framework for unsupervised continual learning (UCL), which disentangles learning objectives that are specific to the present and the past data, encompassing stability, plasticity, and cross-task consolidation. The framework reveals that many existing UCL approaches overlook cross-task consolidation and try to balance plasticity and stability in a shared embedding space. This results in worse performance due to a lack of within-task data diversity and reduced effectiveness in learning the current task. Our method, Osiris, which explicitly optimizes all three objectives on separate embedding spaces, achieves state-of-the-art performance on all benchmarks, including two novel benchmarks proposed in this paper featuring semantically structured task sequences. Compared to standard benchmarks, these two structured benchmarks more closely resemble visual signals received by humans and animals when navigating real-world environments. Finally, we show some preliminary evidence that continual models can benefit from such more realistic learning scenarios. ²²2Published as a conference paper at CoLLAs 2024. Code is available at https://github.com/SkrighYZ/Osiris.

1 Introduction

Humans and animals learn visual knowledge through continuous streams of experiences. In machine learning, continual learning is central to the development of learning from data that changes over time (Parisi et al., 2019). In continual learning, the learner typically encounters a non-stationary data stream through a series of learning episodes, similar to how humans learn (Kurby and Zacks, 2008), where each episode assumes a stationary data distribution. When this learning process is unlabelled, it is referred to as unsupervised continual learning (UCL) (Madaan et al., 2022; Fini et al., 2022). A popular current approach to UCL uses self-supervised learning (SSL) (Chen et al., 2020; Zbontar et al., 2021), which aims at learning invariant representations across pairs of visually similar images. Representations learned with SSL are believed to exhibit less forgetting (McCloskey and Cohen, 1989; French, 1999) than when learned with supervised objectives such as cross-entropy (Madaan et al., 2022; Davari et al., 2022). This confers SSL an important advantage since minimizing forgetting is a central objective of continual learning (Parisi et al., 2019).

However, models trained in the UCL setting still do not perform as well as models trained from iid data (offline). Learning from iid data is ideal but unrealistic when the underlying data distribution changes over time. In contrast, in UCL, models only have access to data from the present distribution and limited access to the past. Despite recent effort in advancing UCL (Madaan et al., 2022; Fini et al., 2022; Gomez-Villa et al., 2024), limited progress has been made in closing this performance gap.

In this study, we take a step back to examine what features current UCL methods learn and why such challenges persist. Our investigation yields a unifying framework that disentangles the learning objectives specific to the present and past data. In particular, our framework jointly optimizes: 1) plasticity for learning within the present episode, 2) consolidation for integrating the present and the past representations, and 3) stability for maintaining the past representations.

Our framework reveals that existing UCL methods either are not very effective at optimizing for plasticity or lack an explicit formulation of cross-task³³3We use task and episode interchangeably. consolidation. To improve plasticity, we find that it is crucial to project features to an embedding space exclusively for optimizing the current-task objective since optimizing other objectives on the same space can impair the model’s ability to adapt to the new data distribution. Meanwhile, the lack of an explicit cross-task consolidation objective reduces the data diversity within a batch and causes the present-task representations to overlap with the past ones.

We address these two limitations by explicitly optimizing all three objectives in our framework with two parallel projector branches. Our innovations are inspired by recent progress in the SSL literature on contrastive loss decomposition (Wang and Isola, 2020) and leveraging multiple embedding spaces (Xiao et al., 2021). We name our method Osiris (Optimizing stability, plasticity, and cross-task consolidation via isolated spaces). Osiris achieves state-of-the-art performance on a suite of UCL benchmarks, including the standard Split-CIFAR-100 (Rebuffi et al., 2017) where tasks consist of randomly-drawn object classes. Additionally, we find that BatchNorm (Ioffe and Szegedy, 2015) is not suitable for UCL since it presupposes a stationary distribution, and advise future studies to use GroupNorm (Wu and He, 2018) instead.

Besides existing benchmarks, we consider the impact of structure in the sequence of episodes typically encountered in our everyday experiences. We build temporally structured task sequences of CIFAR-100 and Tiny-ImageNet images (Le and Yang, 2015; Deng et al., 2009) that resemble the structure of visual signals that humans and animals receive when navigating real-world environments. Interestingly, on the Structured Tiny-ImageNet benchmark, our method outperforms the offline iid model, showing some preliminary evidence that UCL algorithms can benefit from real-world task structures.

In summary, our main contributions are:

We propose a unifying framework for UCL consisting of three objectives that integrate the present and the past tasks, and show that existing methods optimize a subset of these objectives but not all of them.
We propose Osiris, a UCL method that directly optimizes all three objectives in our framework.
For emulating more realistic learning environments, we propose two UCL benchmarks, Structured CIFAR-100 and Structured Tiny-ImageNet,that feature semantic structure on classes within or across tasks. We also propose two new metrics to measure plasticity and consolidationin UCL.
We show that Osiris achieves state-of-the-art performance on all benchmarks, matching offline iid learning on the standard Split-CIFAR-100 with five tasks, and even outperforms it on the Structured Tiny-ImageNet benchmark.

2 Preliminaries

2.1 Self-Supervised Learning

Self-supervised learning (SSL) objectives are remarkably effective in learning good representations from unlabeled image data (Chen et al., 2020; Zbontar et al., 2021; He et al., 2020; Caron et al., 2020, 2021). Their idea is to enforce the model to be invariant to low-level cropping and distortions of the image, which encourages it to encode semantically meaningful features. We focus our analysis on the representative contrastive learning method SimCLR (Chen et al., 2020) because it has well-studied geometric properties on the feature space, which can help analyses (Wang and Isola, 2020), and it exhibits strong performance in our experiments.

Formally, let $A$ be a stochastic function that applies augmentations (random cropping, color jittering, etc.) to $x_{i} \sim D$ . For brevity, we fix the anchor $x_{i}$ when describing the loss and denote the two augmented views of the anchor with $x_{i}, x_{i}^{'} = A (x_{i})$ , and augmented views of other images with $x_{j} (j \neq i)$ . Let $f_{Θ} : X \to R^{d_{E}}$ denote our hypothesis family parameterized by $Θ$ , where $d_{E}$ is the output feature dimension of our model and $X \supseteq D$ . Let $g_{Φ} : R^{d_{E}} \to R^{d_{P}}$ be a non-linear function that projects $f_{Θ} (\cdot)$ to some subspace of $R^{d_{P}}$ . Then, the contrastive loss is defined as

L^{S S L} (D; f_{Θ}, g_{Φ}) = E_{x_{i} \sim D, {x_{j}}_{j} i i d \sim D} [- log \frac{exp (z_{i}^{⊤} z_{i}^{'}) / τ}{exp (z_{i}^{⊤} z_{i}^{'}) / τ + \sum_{j \neq i} exp (z_{i}^{⊤} z_{j}) / τ}],

(1)

where $z_{i}$ is the normalized feature vector, i.e, $z_{i} = g_{Φ} (f_{Θ} (x_{i})) / ∥ g_{Φ} (f_{Θ} (x_{i})) ∥_{2}$ and similarly for $z_{i}^{'}$ and $z_{j}$ ; $z_{i}, z_{i}^{'}, z_{j} \in S^{d_{P} - 1}$ where $S^{d_{P} - 1}$ denotes the $d_{P}$ -dimensional unit sphere. $τ$ is a temperature hyperparameter which we omit for brevity in our analysis. Intuitively, the gradient of this loss with respect to $z_{i}$ is a weighted (with weights in $[0, 1]$ ) sum of $- z_{i}^{'}$ and every $z_{j} (j \neq i)$ . The optimal model minimizes the distance between representations of positive pairs and maximizes the pairwise distance of different inputs. In the remainder of this paper, we use $L^{S S L} (D; f_{Θ}, g_{Φ})$ to denote the contrastive loss in Eq. 1 within set $D$ on the normalized output space of $g_{Φ} \circ f_{Θ}$ for brevity.

Generalized contrastive loss.

We can extend Eq. 1 to a more general form, $L^{S S L} (S_{+}, S_{-}; f_{1}, f_{2})$ , to denote the asymmetric contrastive loss where we use views of the same example in set $S_{+}$ as positive pairs and views of examples in set $S_{-}$ as negatives. One augmented view of the anchor, $x_{i}$ , is encoded by $f_{1}$ and the comparands, $x_{i}^{'}$ and $x_{j}$ ’s, are encoded by $f_{2}$ . Formally,

L^{S S L} (S_{+}, S_{-}; f_{1}, f_{2}) = E_{x_{i} \sim S_{+}, {x_{j}}_{j} \sim S_{-}} [- log \frac{exp (s (f_{1} (x_{i}), f_{2} (x_{i}^{'}))}{exp (s (f_{1} (x_{i}), f_{2} (x_{i}^{'}))) + \sum_{j} exp (s (f_{1} (x_{i}), f_{2} (x_{j})))}],

(2)

where $s (\cdot, \cdot)$ denotes the cosine similarity function.

2.2 Unsupervised Continual Learning

UCL studies the problem of representation learning on a set of unlabeled data distributions, ${D_{1}, \dots, D_{T}}$ , which the learner sequentially observes. Within each task $D_{t}$ , the learner is presented a batch of randomly selected examples $X = {x_{i}}_{i = 1}^{B}$ at each step, where $x_{i} i i d \sim D_{t}$ and $B$ is the batch size. Suppose $x$ is an image that belongs to some semantic concept class; then, the learner does not know either the class label or the task label $t$ and only observes the image itself. The goal is to learn a good $Θ$ such that $f_{Θ} (x)$ encodes useful information about $x \sim X$ that can be directly used in subsequent tasks, where $⋃_{t = 1}^{T} D_{t} \subseteq X$ . A similar learning setup, with no knowledge of task labels but with class labels, is typically referred to as class-incremental learning in the supervised continual learning (SCL) literature (van de Ven et al., 2022).

To use SSL objectives as our learning signal in UCL, the expectation in Eq. 1 can be estimated by averaging the loss over all examples in a batch $X$ . For each $x_{i}$ , we use $x_{i}$ as the anchor and $X ∖ {x_{i}}$ as negatives. Two common baselines are considered in UCL:

Sequential Finetuning (FT): At task $t$ , we only sample the batch $X$ from $D_{t}$ .
Offline Training (Offline): $X$ is sampled iid from $D = ⋃_{t = 1}^{T} D_{t}$ throughout training.

In SCL, FT serves as the performance lower bound of models trained sequentially on $D_{1 \dots T}$ , whereas Offline is expected to be a soft⁴⁴4See Sec. 4.4 for cases when UCL methods outperform Offline. upper bound because it has access to the full dataset at any training step.

3 Dissecting the Learning Objective of UCL

We organize this section as follows. In Sec. 3.1, we describe three desirable properties for representation learning under the UCL setting. We then present Osiris in Sec. 3.2, which explicitly optimizes these properties. Finally, in Sec. 3.3, we show that Osiris is an instance of a more general framework, and that existing UCL methods implicitly address a subset of its components.

3.1 Three Desirable Properties

Features that facilitate plasticity or stability are commonly studied in UCL. In this section, we highlight another category of features, which we call cross-task consolidation features. We argue that UCL models need to consider plasticity, stability, and consolidation, in order to achieve good performance.

Plasticity and stability.

Plasticity refers to the model’s ability to optimize the learning objective on the present task $D_{t}$ . Intuitively, FT usually learns the present task well because it does not consider data in other tasks. On the other hand, stability refers to the model’s ability to maintain performance on past tasks. This is commonly achieved either by regularizing the model—with some previous checkpoints—in their parameter space (Kirkpatrick et al., 2017; Schwarz et al., 2018; Chaudhry et al., 2018) or their output space (Buzzega et al., 2020; Fini et al., 2022), or by jointly optimizing the learning objective on some data sampled from $D_{1 \dots t - 1}$ so that the model still performs well in expectation on previous tasks (Lin, 1992; Robins, 1995; Madaan et al., 2022). In practice, the distribution $D_{1 \dots t - 1}$ is usually estimated online with a memory buffer.

The stability-plasticity dilemma (Ditzler et al., 2015) refers to the conundrum where parameters of continual learning models need to be stable in order not to forget learned knowledge but also need to be plastic to improve the representations continually. Tackling this challenge has been the main focus of prior work in both SCL and UCL.

Cross-task consolidation.

Consolidation refers to the ability to distinguish data from different tasks. For example, if one task contains images of cats and dogs and another contains images of tigers and wolves, then learning to contrast dogs and wolves may yield fine-grained features that help reduce cross-task errors. Consolidation has been explored in SCL by leveraging class labels (Hou et al., 2019; Abati et al., 2020; Masana et al., 2021; Kim et al., 2022) or frozen representations (Aljundi et al., 2017; Wang et al., 2023), but it has been overlooked in UCL. Since we want to continually improve a unified representation for all seen data without labels, existing methods are not applicable.

3.2 Osiris: Integrating Objectives of Present and Past

Now we present Osiris, a method that explicitly optimizes plasticity, stability, and cross-task consolidation. All Osiris’s losses share the same encoder $f_{Θ}$ but may use different nonlinear MLP projectors denoted by $g_{Φ}, h_{Ψ} : R^{d_{E}} \to R^{d_{P}}$ . We illustrate the method in Fig. 1.

To estimate $D_{1 \dots t - 1}$ online, Osiris uses a memory buffer $M$ to store data examples observed by the model so far. In this study, we assume the sampling strategy for data storing and retrieval are both uniform, with the former being achieved through online reservoir sampling (Vitter, 1985). Various works have studied non-uniform storing and retrieval (Aljundi et al., 2019a, b, c; Yoon et al., 2021; Gu et al., 2022), but they are orthogonal to this study. Throughout our analysis, we use $X$ to denote a batch of data sampled iid from $D_{t}$ and $Y$ to denote a batch sampled iid from $M$ .

3.2.1 Plasticity Loss

The loss of the current task in the form of Eq. 1 is

L_{c u r r e n t} = L^{S S L} (X; f_{Θ}, g_{Φ}) .

(3)

It has been shown by Wang and Isola (2020) that, asymptotically, the perfect minimizer of $L_{c u r r e n t}$ projects all $x \in D_{t}$ uniformly to the representation space, a unit hypersphere. We hypothesize that additional losses may prevent the model from learning this solution on $D_{t}$ effectively, because the uniform distribution for $D_{t}$ is unlikely the global optima for other losses.

Fortunately, prior SSL work offer insights on tackling the stability-plasticity dilemma: the backbone encoder $f_{Θ}$ encodes the necessary information that helps minimize SSL losses on multiple nonlinearly projected output spaces (Xiao et al., 2021; Chen et al., 2020). To learn the new task effectively, we do not apply other losses on the output space of $g_{Φ} \circ f_{Θ}$ ; additional losses are calculated on representations projected from the outputs of $f_{Θ}$ with some other projector $h_{Ψ}$ . This allows the model to freely distribute $g_{Φ} (f_{Θ} (x))$ in order to optimize $L_{c u r r e n t}$ , while potentially maintaining some other distributions of $f_{Θ} (x)$ or $h_{Ψ} (f_{Θ} (x))$ . The benefit of this approach is that $f_{Θ}$ still encodes the features that help optimize $L_{c u r r e n t}$ on the output space of $g_{Φ} \circ f_{Θ}$ .

3.2.2 Stability Loss

Like most prior studies in continual learning, we introduce a loss to promote stability and reduce forgetting. We study two approaches, which we discuss next. They both use the projector $h_{Ψ}$ .

Osiris-D(istillation).

The first approach uses distillation and requires storing a frozen checkpoint of the encoder, $f_{Θ_{t - 1}^{*}}$ , at the end of task $t - 1$ . It asks the current model to predict a data example from a batch of examples encoded by the checkpoint, therefore encouraging it to retain previously learned features. The loss can be written with the notation of Eq. 2 as

(4)

The idea of $L_{p a s t - D}$ is similar to CaSSLe (Fini et al., 2022), but we distill our model with $f_{Θ_{t - 1}^{*}} (x)$ and not $g_{Φ_{t - 1}^{*}} (f_{Θ_{t - 1}^{*}} (x))$ , where $g_{Φ_{t - 1}^{*}}$ is the projector checkpoint. Our approach has several advantages: (a) it has been shown that the encoder output produces better representations than the projector output (Chen et al., 2020); (b) the gradient of our stability loss does not pass through $g_{Φ}$ , which allows more freedom in exploring the current-task features; and (c) we do not need to store $g_{Φ_{t - 1}^{*}}$ . In practice, we find the symmetric loss works better than having only the first term. Note that this induces negligible computational overheads because we can reuse the representations of $X$ to compute the second term.

Osiris-R(eplay).

The second approach does not require storing parameters. It applies the contrastive loss on $Y$ alone. The loss can be written as

L_{p a s t - R} = L^{S S L} (Y; f_{Θ}, h_{Ψ}) .

(5)

This loss prevents forgetting by optimizing the learning objective on $D_{1 \dots t - 1}$ , in expectation. It is similar to ER, to be discussed in Sec. 3.3, but uses a different projector.

Remark.

Although using both $L_{p a s t - R}$ and $L_{p a s t - D}$ may yield better performance, we emphasize that this is not our goal. Because both losses aim to reduce forgetting, we use them to demonstrate the flexibility of our framework and investigate the pros and cons of replay versus distillation in our experiments. We expect that it is possible to use any replay or output regularization method explored in SCL (e.g., DER in Buzzega et al. 2020) as our stability loss, as long as they operate on the output space of $h_{Ψ}$ .

Figure 1: Left: illustration of our method. Dashed arrows denote optional computations because the stability loss $L_{p a s t}$ can be achieved through distillation or replay. Right: conceptual loss space. A separate projector helps with optimization.

3.2.3 Cross-Task Consolidation Loss

Recall from Sec. 3.2.1 that the perfect minimizer of $L_{c u r r e n t}$ projects all data from $D_{t}$ uniformly to the representation space. Similarly, representations of $D_{1 \dots t - 1}$ encoded by the perfect minimizer of $L_{p a s t}$ are also distributed uniformly. This means that the model may still suffer from representation overlaps between $D_{1 \dots t - 1}$ and $D_{t}$ (or between a pair of tasks from $D_{1 \dots t - 1}$ ) even if they successfully optimize $L_{c u r r e n t}$ and $L_{p a s t}$ . Although using separate projectors for $L_{c u r r e n t}$ and $L_{p a s t}$ may help, we propose to introduce a loss that explicitly reduces the overlap.

Consider features that are useful in discriminating instances of $D_{t}$ from those of $D_{1 \dots t - 1}$ ; they may not be readily encoded in $f_{Θ_{t - 1}^{*}}$ , as the model has not seen any data from $D_{t}$ at the end of task $t - 1$ . Thus, distillation does not help much in this case. Instead, we propose to leverage the memory $M$ . We find using an additional projector for this loss is unnecessary, so we reuse the output space of $h_{Ψ}$ . Our consolidation loss is

(6)

This loss contrasts the current task and the memory, which promotes learning features that help discriminate the instances from the current task and past tasks. Similar to Eq. 4, we use a symmetric loss for $L_{c r o s s}$ .

Remark.

One might expect that $L_{c r o s s}$ encourages representations of $M$ to collapse to a single point to minimize their similarity with $D_{t}$ . We believe that this is unlikely because we find empirically that the stability loss helps the model to learn well-behaved representations of $M$ . For example, optimizing $L_{p a s t - D}$ , which is calculated on $D_{t}$ , yields features that are transferable to $M$ . Similarly, $L_{p a s t - R}$ directly prevents collapses with the contrastive loss on $M$ . In addition, since data storing is performed online, $M$ may contain examples from $D_{t}$ while learning it, causing $L_{c r o s s}$ to contrast examples within $D_{t}$ . Nevertheless, because most SSL objectives encourage discrimination on the instance level rather than the class level, the model would still learn useful features with $L_{c r o s s}$ in this scenario.

3.2.4 Overall Loss

The overall loss of our model is

L = L_{c u r r e n t} + \frac{1}{2} (L_{c r o s s} + L_{p a s t}) .

(7)

Similarly to Fini et al. (2022), we do not perform hyperparameter tuning on the loss weights (although it might yield better results) and fix the additional weights to sum to one to demonstrate the potential of this framework. In Fig. 1, we illustrate the conceptual optimization landscape. As discussed above, no solution minimizes all three losses individually in the same space. With isolated features spaces, our method reduces the extent to which $L_{c r o s s}$ or $L_{p a s t}$ directly constrain the model from learning the current task, which promotes plasticity. Note that $f_{Θ}$ still needs to maintain a unified representation that preserves information useful for minimizing all three losses.

3.3 Expressing UCL Methods with A Unifying Framework

Now, we extend Osiris to a more general framework in the form of a unified optimization objective, based on the encoder-projector architecture of SSL models. At task $t$ , it can be expressed as the following:

L_{X i i d \sim D_{t}, Y i i d \sim M}^{*} (X, Y; f_{Θ}, f_{Θ_{t - 1}^{*}}, g_{Φ}, g_{Φ_{t - 1}^{*}}, h_{Ψ}, m_{Ω}) = L_{c u r r e n t} (X; f_{Θ}, g_{Φ}) + λ_{1} L_{c r o s s} (X, Y; f_{Θ}, h_{Ψ}) + λ_{2} L_{p a s t} (X, Y; f_{Θ}, f_{Θ_{t - 1}^{*}}, g_{Φ}, g_{Φ_{t - 1}^{*}}, m_{Ω}) .

(8)

Here, we reuse the notations from Sec. 3.2 and introduce some new ones: $m_{Ω}$ is an additional nonlinear MLP projector parameterized by $Ω$ , and $λ_{1}$ , $λ_{2}$ denote the loss weights ( $λ_{1}, λ_{2} \geq 0$ ). The three terms in Eq. 8 learn features that promote plasticity, cross-task consolidation, and stability, respectively. Note that it is not necessary for the objective to use all the arguments included in the parentheses or different parameterizations for $g_{Φ}, h_{Ψ}, m_{Ω}$ ; we only use Eq. 8 as the general form. Nevertheless, an ideal model explores good features from the first two terms and uses the third term to ensure that it does not forget them over time.

Interestingly, most existing UCL methods implicitly optimize the terms in Eq. 8, which we formally show next. Our analysis is similar to Wang et al. (2024), but is distinct in that we decompose the objective from a representation learning perspective rather than a methodology perspective. We fix the first term (shared across methods) and let $f_{Θ}$ and $g_{Φ}$ be the current-task encoder and projector as described in Sec. 2.1. One exception is dynamic model architectures where new parameters are added during learning (Yoon et al., 2018; Rusu et al., 2016); we do not discuss architecture-based methods as it would be equivalent to progressively adding arguments to the first term. We illustrate the focus of different UCL methods in Table 1.

Elastic Weight Consolidation (EWC)

is a classic baseline in continual learning (Kirkpatrick et al., 2017; Schwarz et al., 2018; Chaudhry et al., 2018). It uses $(Θ - Θ_{t - 1}^{*})^{⊤} F_{t} (Θ - Θ_{t - 1}^{*})$ as $L_{p a s t}$ , where $F_{t}$ is the diagonal Fisher information matrix at task $t$ that can be estimated with $F_{t - 1}, D_{t - 1}, f_{Θ_{t - 1}^{*}}$ , and $g_{Φ_{t - 1}^{*}}$ at the end of task $t - 1$ . EWC does not use a memory buffer, but needs to store $F_{t}$ and $Θ_{t - 1}^{*}$ . It sets $λ_{1} = 0$ and does not consider cross-task consolidation.

	Method	$L_{c u r r e n t}$	$L_{c r o s s}$	$L_{p a s t}$
	FT	$✓$
Replay	ER (Lin, 1992; Robins, 1995)			$✓$
	ER+		$✓$	$✓$
	ER++		$✓$	$✓$
	DER (Buzzega et al., 2020)			$✓$
	LUMP $^{‡}$ (Madaan et al., 2022)		$✓$	$✓$
Distill	EWC (Schwarz et al., 2018)			$✓$
	CaSSLe (Fini et al., 2022)	$✓$		$✓$
	POCON (Gomez-Villa et al., 2024)	$✓$		$✓$
Ours	Osiris-R (Replay)	$✓$	$✓$	$✓$
Ours	Osiris-D (Replay+Distillation)	$✓$	$✓$	$✓$

Table 1: Comparison of UCL methods based on the feature components optimized.

^{‡}

Analysis for LUMP is based on applying mixup in the latent space.

CaSSLe

(Fini et al., 2022) is the previous state of the art in UCL. It uses the SSL objective as $L_{p a s t}$ to regularize the model on a separate output space projected from the main model output. The CaSSLe loss can be expressed with Eq. 2 as $L^{S S L} (X, X; m_{Ω} \circ f_{Θ}, g_{Φ_{t - 1}^{*}} \circ f_{Θ_{t - 1}^{*}})$ . CaSSLe uses another projector $h$ on the output of $g$ to form $m$ , i.e., $m_{Ω} = h_{Ψ} \circ g_{Φ}$ . This objective encourages the main model $g_{Φ} \circ f_{Θ}$ to encode information that can be used to predict representations of a previous checkpoint, $g_{Φ_{t - 1}^{*}} \circ f_{Θ_{t - 1}^{*}}$ , thereby restricting the main model from losing features learned. The current task loss still acts on $g_{Φ} \circ f_{Θ}$ . Since the gradient from the regularization is back-propagated to $g$ , CaSSLe may still limit the model’s ability to learn the new task effectively. It also does not consider cross-task consolidation ( $λ_{1} = 0$ ) and does not use a memory buffer.

Experience Replay (ER)

(Lin, 1992; Robins, 1995) is a classic replay-based baseline. It uses $L^{S S L} (Y; f_{Θ}, g_{Φ})$ as $L_{p a s t}$ and sets $λ_{1} = 0$ . Unlike regularization-based methods, this loss may implicitly allow the model to discover new features that improve cross-task consolidation since $Y$ includes examples from $D_{1 \dots t - 1}$ .

ER+ and ER++

are our attempts to improve ER which, in addition to ER, exploit the abundance of the current task data for $L_{c r o s s}$ . One such way is to add the memory examples into negatives augmenting the current task negatives, i.e., $L^{E R +} = L^{S S L} (X, X \cup Y; g_{Φ} \circ f_{Θ}, g_{Φ} \circ f_{Θ})$ ; it is similar to the asymmetric loss used by Cha et al. (2021) in SCL and we refer to it as ER+. This loss indeed pushes representations of the current task and the memory apart, but it does not yield gradient that enforces alignment (Wang and Isola, 2020) of different views generated from memory examples. An alternative way is to use a full SSL loss on the union of the current batch and memory, i.e., $L^{E R + +} = L^{S S L} (X \cup Y; f_{Θ}, g_{Φ})$ . We refer to this method as ER++.

Dark Experience Replay (DER)

(Buzzega et al., 2020; Madaan et al., 2022) is an improved version of ER, where $λ_{1}$ remains $0$ and $L_{p a s t}$ becomes a regularizer on the output space which is empirically estimated with $\frac{1}{| Y |} \sum_{y \in Y} ∥ g_{Φ} (f_{Θ} (y)) - z ∥_{2}^{2}$ , where $z$ is the representation $g (f (y))$ encoded by the model when $y$ is stored into $M$ . It does not consider cross-task consolidation because this objective does not encourage learning new features.

Lump

(Madaan et al., 2022) is the state-of-the-art replay-based UCL method. It applies mixup (Zhang et al., 2018), a linear interpolation between $x \in X$ and $y \in Y$ , to generate inputs: ${~ x}_{i} = ν x_{i} + (1 - ν) y_{i}$ where $ν \sim Beta (α, α)$ with $α$ being a hyperparameter. The batch $~ X = {_{i}}_{i = 1}^{B}$ is passed to the only loss term of the model, $L^{S S L} (~ X; f_{Θ}, g_{Φ})$ . This loss form does not fall directly into our framework because it is hard to disentangle the effect of the loss on points in between data examples. We give the proposition below and prove it in Appendix A.

Proposition 1.

Let $ν \sim Beta (α, α)$ and let be as described above. Define $z_{i} := g (f (x_{i})) / ∥ g (f (x_{i})) ∥_{2}$ , $u_{i} := g (f (y_{i})) / ∥ g (f (y_{i})) ∥_{2}$ , and $_{i} := g (f (_{i})) / ∥ g (f (_{i})) ∥_{2}$ for all $i \in {1, \dots | X |}$ . Suppose that the representations are linear in between $x_{i}$ and $y_{i}$ , i.e., $_{i} = ν z_{i} + (1 - ν) u_{i}$ . Then

	$E_{ν} [\frac{\partial L^{L U M P}}{\partial z_{i}}]$	$= - a_{i} z_{i}^{'} + \sum j \neq i a_{j} z_{j}      L_{c u r r e n t} + \sum j \neq i b_{j} u_{j}      L_{c r o s s} - b_{i} u_{i}^{'},$		(9)
	$E_{ν} [\frac{\partial L^{L U M P}}{\partial u_{i}}]$	$= - c_{i} u_{i}^{'} + \sum j \neq i c_{j} u_{j}      L_{p a s t} + \sum j \neq i d_{j} z_{j}      L_{c r o s s} - d_{i} z_{i}^{'},$		(10)

where $a_{(\cdot)}, b_{(\cdot)}, c_{(\cdot)}, d_{(\cdot)} \geq 0$ are scalar functions of $α$ and the softmax probabilities of predictions.

The first equation represents the gradient of the loss w.r.t. representations of examples in the current-task batch $X$ . In contrast, the second equation represents the gradient w.r.t. examples in the batch $Y$ sampled from the memory. Recall from Sec. 2.1 that $(\cdot)^{'}$ denotes the representation of another view of the same input. The linearity assumption we make above may not seem to be true in general, but note that mixup aims to help the model behave linearly in between examples to generalize better (Zhang et al., 2018). Nevertheless, we aim to estimate LUMP’s effect with this decomposition. Besides the last term in each equation, which indicates the gradient that pulls $z_{i}$ and $u_{i}$ (the pair of examples being mixed) closer, Prop. 1 says that the gradient of LUMP with contrastive learning can be decomposed and affects all components of Eq. 8: (a) current task learning, (b) cross-task discrimination, and (c) past task learning. However, since all the losses act on the same output space and the coefficients are correlated, this does not allow flexible control of the learning emphasis.

4 Experiments

We organize this section as follows. In Sec 4.1, we describe our baselines and benchmarks, and discuss the incompatibility between BatchNorm and UCL. We then detail the evaluation metrics used in this study in Sec. 4.2. We show our results on the standard benchmarks in Sec. 4.3 and on the structured benchmarks in Sec. 4.4. Finally, in Sec. 4.5, we analyze the models’ behavior with additional experiments.

4.1 Experimental Protocol

Baselines.

We compare our method with FT and Offline defined in Sec. 2.2. We also compare with classic CL methods: online EWC (Schwarz et al., 2018), ER (Lin, 1992; Robins, 1995), DER (Buzzega et al., 2020) which we described in Sec. 3.3. We find that we can improve DER’s performance by normalizing the features before performing L2 regularization, so we report the performance of improved DER only. We also report the performance of ER+ and ER++ described in Sec. 3.3. In addition to these standard continual learning baselines, we compare with methods designed specifically for UCL, CaSSLe (Fini et al., 2022) and LUMP (Madaan et al., 2022). Concurrent work, POCON (Gomez-Villa et al., 2024), also aims to maximize plasticity for UCL. We use the online version of POCON for a fair comparison so that all methods observe the data the same number of times. Finally, we do not compare with C $^{2}$ ASR (Cheng et al., 2023) because it requires sorting the entire task stream and is an additional plug-in for UCL methods.

Figure 2: KNN accuracy of methods trained on the 20-task Split-CIFAR-100 with BatchNorm (BN) or GroupNorm (GN). $D_{1}$ -Only denotes an offline model trained for the same number of steps but only on the first task. CaSSLe (Fini et al., 2022) and LUMP (Madaan et al., 2022) are state-of-the-art UCL methods. The incompatibility between BN and UCL can be mitigated by using GN instead.

Benchmarks.

Standard Split-CIFAR-100. Following Madaan et al. (2022); Fini et al. (2022), we evaluate the models on the 5-task and 20-task sequences of CIFAR-100 (Krizhevsky, 2009). It contains 50,000 32 $\times$ 32 images from 100 classes that are randomly grouped into a disjoint set of tasks.
Structured CIFAR-100. In real-world environments, consecutive visual scenes are often similar and correlated in time. For example, seeing a great white shark immediately followed by office chairs is rather unlikely. We construct a temporally structured CIFAR-100 sequence by grouping the classes with the same superclass label (provided by the dataset) into a task and randomly shuffle the task order, which results in ten tasks. Examples of superclasses include vehicles, flowers, and aquatic mammals.
Structured Tiny-ImageNet. Real-world environments also boast an abundance of hierarchies. We may visit different city blocks in an urban area and then tour multiple spots in a wild park. To create a task sequence that captures hierarchical environment structure, we use Tiny-ImageNet-200 (Le and Yang, 2015; Deng et al., 2009), which includes 100,000 images of size 64 $\times$ 64 categorized into 200 classes whose location spans different environments. We first use a pre-trained scene classifier trained on Places365 (Zhou et al., 2017) to classify all images into indoor, city, and wild environments. We then use a majority vote to decide the environment label for each class. Finally, we arrange the classes in the order of indoor $\to$ city $\to$ wild and group them into ten tasks in order. This leads to four tasks indoors, three tasks in the city, and three tasks in the wild. Compared with Structured CIFAR-100, which only enforces correlation within each task, this benchmark additionally imposes correlation between consecutive tasks. It provides a realistic structure, but also aligns nicely with the classic task-incremental learning setup.

Implementation details.

We use a single-head ResNet-18 (He et al., 2016) as our backbone encoder. We use a two-layer MLP with a hidden dimension of 2048 and an output dimension of 128 as the projector. We use the ReLU function as activation after the hidden layer but not the output layer following Chen et al. (2020). For POCON, we follow the authors’ implementation and use four-layer projectors for distillation. We set the memory size $| M | = 500$ for all experiments. We train the models with a batch size of 256 for 200 epochs for UCL training, following Madaan et al. (2022). All methods use the same loss as FT during the first task. We provide additional training, data augmentation hyperparameters, and other details in Appendix B.

Incompatibility between BatchNorm and UCL.

It has been shown that BatchNorm (Ioffe and Szegedy, 2015) is not suitable for SCL since its running estimates of the feature moments (over the batch dimension) are biased towards the most recent task (Pham et al., 2021). An alternative normalization layer is GroupNorm (Wu and He, 2018), where the batch statistics are not needed because the normalization is applied along the feature dimension. The performance of BatchNorm and GroupNorm has not been investigated in UCL, although UCL usually requires much more training iterations than SCL. Thus, we hypothesize that using BatchNorm is harmful in UCL and the improvement of UCL methods over FT may not be as large as previously believed. In Fig. 2, we show that after we switch to GroupNorm, FT becomes a very strong baseline, and existing UCL methods do not help as much, although all methods show improved performance. Moreover, a model trained on only one task outperforms FT when equipped with BatchNorm but not GroupNorm, further showing the detrimental effect of BatchNorm in UCL. Therefore, we use GroupNorm in our experiments to highlight the core factors contributing to UCL performance.

4.2 Metrics for Evaluating UCL Methods

We have discussed the elements contributing to learning a good representation in UCL. We now introduce accompanying fine-grained metrics to evaluate them. After unsupervised training, we keep the encoders and discard the projectors following standard practice because projectors learn the invariance that minimizes the SSL loss and may discard too much information for downstream tasks (Chen et al., 2020). Following Madaan et al. (2022), let $A_{t, i}$ be the weighted KNN (Wu et al., 2018) test accuracy of the encoder $f_{Θ_{t}^{*}}$ on task $i$ after it finishes training on task $t$ . Note that the KNN classifier does not know the task labels (see Appendix C for results when task labels are given); therefore, it is important for $f_{Θ_{t}^{*}}$ to obtain a good representation geometry over the entire dataset. We report the mean and standard deviation of results obtained from three random seeds in all of our tables and plots. We use five metrics in this study; accuracy, forgetting, and forward transfer are commonly used and we propose knowledge gain and cross-task consolidation score to measure the model’s ability to optimize $L_{c u r r e n t}$ and $L_{c r o s s}$ in Eq. 8:

Overall Accuracy (A) is the accuracy of the final model on all classes in the dataset: $A = \frac{1}{T} \sum_{i = 1}^{T} A_{T, i}$ .
Forgetting (F) measures the difference between the model’s best accuracy on task $i$ at any point during training and its accuracy on task $i$ after training: $F = \frac{1}{T - 1} \sum_{i = 1}^{T - 1} {max}_{t \in {1, \dots T}} (A_{t, i} - A_{T, i})$ . Forgetting measures the model’s ability to optimize $L_{p a s t}$ in Eq. 8.
Knowldege Gain (K). It has been shown that representations learned in UCL are significantly less prone to forgetting and are more stable than their SCL counterparts (Davari et al., 2022; Madaan et al., 2022). On the other hand, models learned continually lose plasticity (Abbas et al., 2023). Thus, we use knowledge gain to quantify the accuracy increase on task $i$ before and after the model is trained on it. It is defined as $K = \frac{1}{T - 1} \sum_{i = 2}^{T} (A_{i, i} - A_{i - 1, i})$ . Knowledge gain is similar to the SCL metrics proposed by Chaudhry et al. (2018); Koh et al. (2023), but it is simple to calculate and is more suitable for UCL because UCL models generally have a reasonable performance on a task before even learning on it thanks to the generalization capability of SSL. Knowledge gain quantifies the model’s ability to optimize $L_{c u r r e n t}$ .
Cross-Task Consolidation (C) is defined as the test accuracy of a task-level KNN classifier on the frozen representations of the final model. As discussed in the previous sections, new knowledge acquisition is quantified by both knowledge gain and the ability to learn features that discriminate data across tasks, i.e., to optimize $L_{c r o s s}$ .
Forward Transfer (T) quantifies the generalization ability of UCL models by measuring how much of the learned representation can be helpful to an unseen task. It is defined as $T = \frac{1}{T - 1} \sum_{i = 2}^{T} (A_{i - 1, i} - R_{i})$ where $R_{i}$ is the accuracy of a randomly initialized model on task $i$ . It is used by Madaan et al. (2022); Fini et al. (2022).

	5-Task Split-CIFAR-100					20-Task Split-CIFAR-100
	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )
FT	50.7 ( $\pm$ 0.4)	2.9 ( $\pm$ 0.0)	9.0 ( $\pm$ 0.1)	59.1 ( $\pm$ 0.2)	30.7 ( $\pm$ 0.3)	45.6 ( $\pm$ 0.3)	2.9 ( $\pm$ 0.4)	2.5 ( $\pm$ 0.1)	47.2 ( $\pm$ 0.1)	28.9 ( $\pm$ 0.2)
ER	51.5 ( $\pm$ 0.4)	2.7 ( $\pm$ 0.4)	8.4 ( $\pm$ 0.3)	59.7 ( $\pm$ 0.4)	31.8 ( $\pm$ 0.2)	47.1 ( $\pm$ 0.7)	3.4 ( $\pm$ 0.6)	3.5 ( $\pm$ 0.4)	48.2 ( $\pm$ 0.5)	30.1 ( $\pm$ 0.2)
DER $^{†}$	51.0 ( $\pm$ 0.6)	3.0 ( $\pm$ 0.7)	9.6 ( $\pm$ 0.3)	59.0 ( $\pm$ 0.4)	30.5 ( $\pm$ 0.1)	45.7 ( $\pm$ 0.1)	2.6 ( $\pm$ 0.2)	2.6 ( $\pm$ 0.4)	47.2 ( $\pm$ 0.2)	28.7 ( $\pm$ 0.2)
LUMP	50.2 ( $\pm$ 0.6)	1.4 ( $\pm$ 1.1)	7.3 ( $\pm$ 0.3)	58.4 ( $\pm$ 0.4)	30.2 ( $\pm$ 0.1)	47.7 ( $\pm$ 1.1)	2.6 ( $\pm$ 0.9)	3.1 ( $\pm$ 0.3)	49.1 ( $\pm$ 1.0)	29.7 ( $\pm$ 0.0)
ER+	51.8 ( $\pm$ 0.6)	3.4 ( $\pm$ 0.5)	10.2 ( $\pm$ 0.3)	60.1 ( $\pm$ 0.3)	31.1 ( $\pm$ 0.4)	46.7 ( $\pm$ 0.3)	3.1 ( $\pm$ 0.3)	4.4 ( $\pm$ 0.0)	48.0 ( $\pm$ 0.1)	28.8 ( $\pm$ 0.1)
ER++	51.8 ( $\pm$ 0.3)	2.9 ( $\pm$ 0.6)	9.1 ( $\pm$ 0.4)	59.8 ( $\pm$ 0.3)	31.6 ( $\pm$ 0.4)	47.7 ( $\pm$ 0.3)	3.7 ( $\pm$ 0.3)	5.0 ( $\pm$ 0.4)	49.0 ( $\pm$ 0.5)	30.0 ( $\pm$ 0.1)
Osiris-R (Ours)	52.3 ( $\pm$ 0.5)	2.5 ( $\pm$ 0.7)	8.5 ( $\pm$ 0.1)	60.1 ( $\pm$ 0.1)	32.1 ( $\pm$ 0.2)	49.3 ( $\pm$ 0.3)	3.1 ( $\pm$ 0.2)	4.7 ( $\pm$ 0.2)	50.5 ( $\pm$ 0.5)	31.5 ( $\pm$ 0.3)
EWC	43.8 ( $\pm$ 0.6)	2.3 ( $\pm$ 0.8)	5.0 ( $\pm$ 0.4)	53.9 ( $\pm$ 0.7)	26.4 ( $\pm$ 0.3)	37.6 ( $\pm$ 0.2)	1.7 ( $\pm$ 0.1)	2.0 ( $\pm$ 0.3)	39.4 ( $\pm$ 0.3)	21.4 ( $\pm$ 0.4)
CaSSLe	51.2 ( $\pm$ 0.3)	0.7 ( $\pm$ 0.1)	7.2 ( $\pm$ 0.5)	59.5 ( $\pm$ 0.3)	30.4 ( $\pm$ 0.5)	48.0 ( $\pm$ 0.1)	1.9 ( $\pm$ 0.1)	-0.4 ( $\pm$ 0.2)	49.2 ( $\pm$ 0.2)	30.2 ( $\pm$ 0.2)
POCON $^{⋆}$	50.6 ( $\pm$ 0.7)	3.2 ( $\pm$ 1.0)	9.3 ( $\pm$ 0.4)	59.3 ( $\pm$ 0.3)	30.6 ( $\pm$ 0.3)	45.2 ( $\pm$ 0.4)	3.0 ( $\pm$ 0.6)	2.7 ( $\pm$ 0.4)	46.8 ( $\pm$ 0.3)	28.8 ( $\pm$ 0.1)
Osiris-D (Ours)	53.0 ( $\pm$ 0.2)	1.6 ( $\pm$ 0.5)	8.4 ( $\pm$ 0.3)	60.5 ( $\pm$ 0.1)	31.7 ( $\pm$ 0.3)	50.1 ( $\pm$ 0.2)	2.3 ( $\pm$ 0.2)	4.2 ( $\pm$ 0.3)	51.3 ( $\pm$ 0.1)	31.3 ( $\pm$ 0.4)
Offline	52.5 ( $\pm$ 0.4)	-	-	60.0 ( $\pm$ 0.4)	-	52.5 ( $\pm$ 0.4)	-	-	53.9 ( $\pm$ 0.2)	-

Table 2: Results on standard Split-CIFAR-100 with five or 20 tasks. The best model in each column is made bold and the second-best model is underlined.

^{†}

Improved DER.

^{⋆}

Online version of POCON. We separate replay-based (top) and distillation-based methods (bottom) for easier comparisons.

	Structured CIFAR-100					Structured Tiny-ImageNet
	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )
FT	45.0 ( $\pm$ 0.6)	5.5 ( $\pm$ 0.4)	7.8 ( $\pm$ 0.3)	59.2 ( $\pm$ 0.5)	28.7 ( $\pm$ 0.2)	34.2 ( $\pm$ 0.2)	4.7 ( $\pm$ 0.1)	5.7 ( $\pm$ 0.4)	43.5 ( $\pm$ 0.2)	26.0 ( $\pm$ 0.2)
LUMP	48.5 ( $\pm$ 0.7)	3.6 ( $\pm$ 0.6)	7.9 ( $\pm$ 0.5)	62.7 ( $\pm$ 0.7)	29.7 ( $\pm$ 0.6)	36.0 ( $\pm$ 1.0)	2.9 ( $\pm$ 1.1)	5.3 ( $\pm$ 0.2)	45.2 ( $\pm$ 0.8)	26.2 ( $\pm$ 0.4)
CaSSLe	47.4 ( $\pm$ 0.3)	1.9 ( $\pm$ 0.3)	4.1 ( $\pm$ 0.2)	61.6 ( $\pm$ 0.1)	30.0 ( $\pm$ 0.3)	35.4 ( $\pm$ 0.3)	2.1 ( $\pm$ 0.2)	3.2 ( $\pm$ 0.1)	43.9 ( $\pm$ 0.3)	26.7 ( $\pm$ 0.1)
POCON $^{⋆}$	45.8 ( $\pm$ 0.5)	5.2 ( $\pm$ 0.3)	7.7 ( $\pm$ 0.3)	59.6 ( $\pm$ 0.8)	29.3 ( $\pm$ 0.4)	34.4 ( $\pm$ 0.6)	3.8 ( $\pm$ 0.8)	5.5 ( $\pm$ 0.2)	43.6 ( $\pm$ 0.5)	25.7 ( $\pm$ 0.1)
Osiris-R (Ours)	49.0 ( $\pm$ 0.4)	5.0 ( $\pm$ 0.6)	8.3 ( $\pm$ 0.6)	62.8 ( $\pm$ 0.2)	31.7 ( $\pm$ 0.1)	36.3 ( $\pm$ 0.1)	3.6 ( $\pm$ 0.2)	5.5 ( $\pm$ 0.1)	45.1 ( $\pm$ 0.1)	27.5 ( $\pm$ 0.3)
Osiris-D (Ours)	49.8 ( $\pm$ 0.1)	4.4 ( $\pm$ 0.3)	8.4 ( $\pm$ 0.2)	64.2 ( $\pm$ 0.1)	31.5 ( $\pm$ 0.2)	37.5 ( $\pm$ 0.4)	2.9 ( $\pm$ 0.2)	5.0 ( $\pm$ 0.4)	46.5 ( $\pm$ 0.2)	28.1 ( $\pm$ 0.1)
Offline	52.5 ( $\pm$ 0.4)	-	-	67.9 ( $\pm$ 0.1)	-	36.8 ( $\pm$ 0.1)	-	-	46.4 ( $\pm$ 0.2)	-

Table 3: Results on 10-task sequences on structured CIFAR-100 and Tiny-ImageNet. The two best models are marked. Osiris-D performs the best, surpassing Offline on Structured Tiny-ImageNet.

4.3 Results on the Standard Benchmarks

In Table 2, we report the results of all models on Split-CIFAR-100. Our first observation is that FT is already a very strong baseline with a final accuracy only 2% below the Offline model’s accuracy on the 5-task sequence. Moreover, Osiris-D closes this gap with Offline on both overall accuracy ( $p = 0.185$ by an unpaired t-test) and consolidation score. We hypothesize that this is in part attributed to that Osiris-D leverages the data ordering by contrasting the current task and the memory, although task labels are not explicitly given. On the other hand, Offline does not have any task label information. Among all methods, Osiris consistently achieves the highest overall accuracy, consolidation score, and forward transfer, regardless of the number of tasks. Comparing Osiris-R and Osiris-D, we find that there’s still a trade-off between plasticity and stability. Osiris-R shows more knowledge gain at the expense of higher forgetting, and Osiris-D shows lower forgetting but sometimes lower knowledge gain.

CaSSLe shows low forgetting on both task sequences but lower knowledge gains than all the other methods except EWC, indicating that parallel projectors (in Osiris) might be a better choice than sequential ones (in CaSSLe) at improving plasticity. POCON is designed to maximize plasticity by distilling the model with a single-task expert, achieving the highest knowledge gain among the distillation-based methods (except Osiris-D on the 20 tasks). LUMP improves over FT on the 20-task but not on the 5-task sequence, and we hypothesize that it is because memory is more important when individual tasks are less diverse. Interestingly, the knowledge gain of FT is not the upper bound for UCL methods. This may be because UCL methods implicitly leverage learned representations or memory to help learn a new task (a form of forward transfer). For example, the methods with the highest knowledge gain on each benchmark are ER+ and ER++, which both involve contrasting the current task and the memory, a component that is not present in ER.

General findings.

Similar to Madaan et al. (2022), we find all UCL methods exhibit very low forgetting compared to previously reported numbers in SCL (Chaudhry et al., 2018). In general, the models with the lowest forgetting applies distillation. Indeed, a previously hypothesized criticism for replay-based methods is that they are prune to overfitting to memory (Fini et al., 2022), which we analyze empirically in Sec. 4.5. On the other hand, they show more plasticity and have higher knowledge gain on the new task.

4.4 Results on the Structured Benchmarks

We show the results of recent UCL methods on the structured benchmarks in Table 3. On Structured CIFAR-100, the methods show higher forgetting than on both random 5-task and 20-task sequences (a caveat for such a comparison is that here we have ten tasks). Nevertheless, CaSSLe and LUMP show relatively low forgetting but fail to address some of the other components of our framework (Eq. 8). All UCL methods improve over FT, and the two variants of Osiris outperform others in terms of knowledge gain, consolidation, forward transfer, and overall accuracy. This indicates that Osiris is robust to correlated task sequences.

On Structured Tiny-ImageNet, FT shows the highest knowledge gain, which means it benefits more from intra-task similarity. We hypothesize that contrastive learning benefits from high intra-task similarity because it provides hard negative pairs. Osiris again outperforms other UCL methods in terms of knowledge gain, consolidation, and forward transfer. Surprisingly, Osiris-D obtains better accuracy than Offline ( $p = 0.008$ by an unpaired t-test). From a curriculum-learning perspective, this suggests that the realistic, hierarchical structure offers a better task ordering than randomly constructed scenarios. We explore how such a task ordering affects the representation structure in the next section.

4.5 Analysis

Balancing stability, plasticity, and consolidation.

We now examine how UCL methods balance plasticity, cross-task consolidation, and stability. We use Osiris-D for our analysis in this section since Osiris-R shows similar behavior. Similar to the analysis provided by Gomez-Villa et al. (2024), in Fig. 3, we plot for different UCL methods the current task accuracy (plasticity), task-level KNN accuracy (cross-task consolidation), and accuracy of the first task (stability) throughout training on standard, 20-task Split-CIFAR-100. We plot the accuracy for all the other tasks in Appendix D. The first observation is that Osiris performs relatively well on all three aspects throughout training. LUMP and CaSSLe have similar overall accuracy in Table 2. They show the same level of cross-task consolidation in Fig. 3 because they do not directly enforce it. Among the two, LUMP shows higher plasticity but lower stability near the end of training. Both methods show better overall accuracy than FT, which may be attributed to their better consolidation scores.

	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )
w/o isolated space	47.8 ( $\pm$ 0.3)	2.8 ( $\pm$ 0.3)	2.0 ( $\pm$ 0.4)	49.3 ( $\pm$ 0.2)	30.4 ( $\pm$ 0.3)
w/o $L_{cross}$	45.9 ( $\pm$ 0.1)	2.5 ( $\pm$ 0.3)	1.8 ( $\pm$ 0.2)	47.2 ( $\pm$ 0.2)	29.2 ( $\pm$ 0.2)
w/o $L_{past}$	49.2 ( $\pm$ 0.4)	2.6 ( $\pm$ 0.3)	4.8 ( $\pm$ 0.5)	50.4 ( $\pm$ 0.3)	30.5 ( $\pm$ 0.2)
Full	50.1 ( $\pm$ 0.2)	2.3 ( $\pm$ 0.2)	4.2 ( $\pm$ 0.3)	51.3 ( $\pm$ 0.1)	31.3 ( $\pm$ 0.4)

Table 4: Ablation of Osiris-D’s components on 20-task Split-CIFAR-100. Isolating feature spaces is crucial for high knowledge gain;

L_{cross}

is important for knowledge gain and separation of task representations;

L_{past}

helps reduce forgetting.

Method	Env. 1	Env. 2	Env. 3
Offline	44.2 ( $\pm$ 0.6)	60.9 ( $\pm$ 0.7)	50.3 ( $\pm$ 0.5)
FT	39.0 ( $\pm$ 0.2)	57.1 ( $\pm$ 0.4)	54.7 ( $\pm$ 0.5)
Osiris-D	43.7 ( $\pm$ 0.6)	60.2 ( $\pm$ 0.7)	55.1 ( $\pm$ 0.5)

Table 5: Within-environment accuracy on Structured Tiny-ImageNet. Both FT and Osiris-D outperforms Offline on Environment 3, but only Osiris-D achieves performance similar to Offline on Env. 1 and 2.

Figure 4: Mean cosine similarity between pairs of examples drawn from pairs of classes. Environment switches are marked with dashed white lines. Classes within the third environment are projected to nearby positions on the representation space by Offline, but not by FT and Osiris-D.

Ablations.

In Table 4, we show the results of our framework after removing each component. When using a shared projector for all three losses, the model’s knowledge gain drops from 4.2% to 2.0%, which shows that using separate spaces helps plasticity, as we have hypothesized. When not using $L_{c r o s s}$ , the model shows the lowest cross-task discrimination score among all the models being compared here. It also shows low forgetting because the only loss applied on output space of $h \circ f$ is now the distillation loss ( $L_{p a s t}$ ). On the other hand, after removing $L_{p a s t}$ , the model exhibits the highest knowledge gain, which means that learning the new task gains benefit from $L_{c r o s s}$ . Finally, with all the components, Osiris balances all three aspects of our framework and achieves the best scores on all metrics except knowledge gain, without requiring manually adjusting the loss weights.

How does a structured task sequence affect representation?

In Table 5, we show the within-environment accuracy for Offline, FT, and Osiris-D. Both FT and Osiris perform better than Offline on the last environment but not as well on the first two. Compared to FT, Osiris shows less forgetting and performs better in previously-observed environments. To examine their representation geometry, we plot the mean cosine similarity matrix between features of examples from pairs of classes, for these three methods, in Fig. 4. The matrix shows that classes within the last environment are naturally projected by Offline to nearby positions on the representation hypersphere. Differently, FT and Osiris can better distinguish between classes in the third environment. Among the two, Osiris distinguishes between different classes within the first two environments better. Together with within-environment accuracy, this provides evidence that UCL methods benefit from the ordered task sequence such that they distinguish examples in the last environment better. At the same time, Offline is less sensitive distinguishing these examples.

Does replay-based methods overfit to memory?

It has been shown that replay-based methods can overfit to memory in SCL (Verwimp et al., 2021; Buzzega et al., 2021). The same phenomenon has been hypothesized by Fini et al. (2022) to also exist in UCL, but not yet empirically verified. As SSL learns features that generalize better (Madaan et al., 2022), we test this hypothesis empirically. In Fig. 2(b), we plot for replay-based methods the relative difference between the contrastive loss of a batch sampled from all data observed ( $L_{a l l}$ ) and from the memory ( $L_{m e m}$ ), both excluding any data from the current task. The relative difference is defined as $\frac{L_{a l l} - L_{m e m}}{L_{a l l}}$ . A large difference could indicate the failure to generalize the representations learned from memory to past tasks, which defeats the purpose of replay. For Osiris, we plot the curves with $h \circ f$ here in Fig. 2(b) and for $g \circ f$ in Appendix D. In Fig. 2(b), the curves for ER, LUMP, and Osiris-R increase at first and become relatively stable afterwards. Their final values are all much larger than zero, indicating the possibility of overfitting to memory. Since Osiris-D does not explicitly minimize the contrastive loss on the memory, it does not overfit to it and shows close-to-zero relative loss difference. This could explain why it always has lower forgetting than Osiris-R.

While Fig. 2(b) could indicate that Osiris-R overfits to the memory with $h \circ f$ , the effect of $L_{c r o s s}$ on the encoder $f$ appears to be less sensitive to the precision of the representations produced by $h \circ f$ . In all of our results (Tables 2, 3, 6, and 7), Osiris-R and Osiris-D are consistently the best performers in consolidation scores, which leads to good overall accuracy. Additionally, neither Osiris-R nor Osiris-D overfits with the representations produced by $g \circ f$ as shown in Fig. 4(a). Overall, our findings empirically support the claim in prior work that using replay for UCL may cause the model to overfit to the memory (Fini et al., 2022), but also show that we still need the memory to improve consolidation, which is crucial for performance.

5 Related Work

Self-supervised learning.

A large body of work in SSL belongs to the contrastive learning family (Chen et al., 2020; He et al., 2020; Misra and Maaten, 2020; Hadsell et al., 2006; Tian et al., 2020; Oord et al., 2018; Wu et al., 2018; Gutmann and Hyvärinen, 2010), which we focus on in this study. The main idea is to match the representations of two augmented views of the same image and repel the representations of different images to learn semantically meaningful representations. Clustering-based methods share high-level intuition but perform contrastive learning on the cluster level rather than the instance level (Caron et al., 2020; He et al., 2016). Other works have explored relaxing the need for negative pairs, usually by asymmetric architectures (Grill et al., 2020; Chen and He, 2021) or losses that enforce variance in representations (Zbontar et al., 2021; Bardes et al., 2022; Ermolov et al., 2021). SSL methods that work on transformers have also been proposed in recent years (He et al., 2022; Caron et al., 2021).

Continual learning.

SCL methods are commonly partitioned into three categories. Regularization-based methods (Kirkpatrick et al., 2017; Schwarz et al., 2018; Chaudhry et al., 2018; Zenke et al., 2017; Aljundi et al., 2018; Castro et al., 2018; Douillard et al., 2020; Hou et al., 2019; Wu et al., 2019) regularize model parameters such that they do not drift too far from previous optima. Replay-based methods (Buzzega et al., 2020; Robins, 1995; Hayes et al., 2020; Rebuffi et al., 2017; Chaudhry et al., 2019; Lopez-Paz and Ranzato, 2017; Aljundi et al., 2019a; Ostapenko et al., 2019) use a memory buffer storing data from past task and use them for replay. Finally, architecture-based methods (Ostapenko et al., 2021; Rusu et al., 2016; Serra et al., 2018; Li et al., 2019) dynamically introduce new parameters for each task to reduce forgetting. Limited progress has been made in UCL. The area is first explored by Rao et al. (2019); Smith et al. (2021), but their work is limited to small datasets such as handwritten digits and is hard to scale. Recent work focuses on improving SSL-based UCL (Madaan et al., 2022; Fini et al., 2022; Gomez-Villa et al., 2024, 2022; Cheng et al., 2023). In contrast, we investigate what features are essential in UCL and offer new insights and practical guidance around normalization. SSL also helps SCL (Cha et al., 2021; Caccia et al., 2021) by improving the model’s representations. It is worth mentioning that properties of representations learned with SSL in continual learning have been empirically studied (Davari et al., 2022; Galashov et al., 2023; Gallardo et al., 2021).

6 Conclusion

This work identifies three key components in UCL for integrating representation learning in the present and the past tasks: plasticity, stability and cross-task consolidation. Existing methods fall under our unifying framework by optimizing only a subset of objectives, whereas our proposed method Osiris explicitly optimizes and balances all three desiderata. Osiris achieves state-of-the-art performance on all UCL benchmarks and shows better accuracy on our realistic Structured Tiny-ImageNet benchmark than offline iid training. Our work sheds new light on the potential learning mechanisms of continual learning agents in the real world. Future work will extend our framework to non-contrastive SSL approaches and evaluate more realistic learning environments, such as lifelong video recordings.

Acknowledgment

We appreciate the constructive feedback from five anonymous reviewers. We also thank Lucas Caccia, Siddarth Venkatraman, and Michael Chong Wang for helpful discussions and Avery Ryoo for pointing out some typos in earlier drafts. LC and YZ acknowledge the generous support of the CIFAR AI Chair program. This work obtained support by the funds provided by the National Science Foundation and by DoD OUSD (R&E) under Cooperative Agreement PHY-2229929 (The NSF AI Institute for Artificial and Natural Intelligence). This research was also enabled in part by compute resources provided by Mila (mila.quebec).

References

D. Abati, J. Tomczak, T. Blankevoort, S. Calderara, R. Cucchiara, and B. E. Bejnordi (2020) Conditional channel gated networks for task-aware continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3931–3940. Cited by: §3.1.
Z. Abbas, R. Zhao, J. Modayil, A. White, and M. C. Machado (2023) Loss of plasticity in continual deep reinforcement learning. In Conference on Lifelong Learning Agents (CoLLAs), Proceedings of Machine Learning Research, Vol. 232, pp. 620–636. External Links: Link Cited by: 3rd item.
R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars (2018) Memory aware synapses: learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 139–154. Cited by: §5.
R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, and L. Page-Caccia (2019a) Online continual learning with maximally interfered retrieval. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.2, §5.
R. Aljundi, P. Chakravarty, and T. Tuytelaars (2017) Expert gate: lifelong learning with a network of experts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3366–3375. Cited by: §3.1.
R. Aljundi, K. Kelchtermans, and T. Tuytelaars (2019b) Task-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11254–11263. Cited by: §3.2.
R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio (2019c) Gradient based sample selection for online continual learning. Advances in Neural Information Processing Systems (NeurIPS) 32. Cited by: §3.2.
A. Bardes, J. Ponce, and Y. LeCun (2022) VICReg: variance-invariance-covariance regularization for self-supervised learning. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §5.
P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara (2020) Dark experience for general continual learning: a strong, simple baseline. Advances in Neural Information Processing Systems (NeurIPS) 33, pp. 15920–15930. Cited by: §3.1, §3.2.2, §3.3, Table 1, §4.1, §5.
P. Buzzega, M. Boschini, A. Porrello, and S. Calderara (2021) Rethinking experience replay: a bag of tricks for continual learning. In International Conference on Pattern Recognition (ICPR), pp. 2180–2187. Cited by: §4.5.
L. Caccia, R. Aljundi, N. Asadi, T. Tuytelaars, J. Pineau, and E. Belilovsky (2021) New insights on reducing abrupt representation change in online continual learning. In International Conference on Learning Representations (ICLR), Cited by: §5.
M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems (NeurIPS) 33, pp. 9912–9924. Cited by: §2.1, §5.
M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin (2021) Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9650–9660. Cited by: §2.1, §5.
F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari (2018) End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 233–248. Cited by: §5.
H. Cha, J. Lee, and J. Shin (2021) Co2l: contrastive continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9516–9525. Cited by: §3.3, §5.
A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–547. Cited by: §3.1, §3.3, 3rd item, §4.3, §5.
A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2019) Efficient lifelong learning with a-GEM. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §5.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pp. 1597–1607. Cited by: §1, §2.1, §3.2.1, §3.2.2, §4.1, §4.2, §5.
X. Chen and K. He (2021) Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15750–15758. Cited by: §5.
H. Cheng, H. Wen, X. Zhang, H. Qiu, L. Wang, and H. Li (2023) Contrastive continuity on augmentation stability rehearsal for continual self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5707–5717. Cited by: §4.1, §5.
M. Davari, N. Asadi, S. Mudur, R. Aljundi, and E. Belilovsky (2022) Probing representation forgetting in supervised and unsupervised continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16712–16721. Cited by: §1, 3rd item, §5.
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §1, 3rd item.
G. Ditzler, M. Roveri, C. Alippi, and R. Polikar (2015) Learning in nonstationary environments: a survey. IEEE Computational Intelligence Magazine 10 (4), pp. 12–25. Cited by: §3.1.
A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle (2020) Podnet: pooled outputs distillation for small-tasks incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 86–102. Cited by: §5.
A. Ermolov, A. Siarohin, E. Sangineto, and N. Sebe (2021) Whitening for self-supervised representation learning. In International Conference on Machine Learning (ICML), pp. 3015–3024. Cited by: §5.
E. Fini, V. G. T. Da Costa, X. Alameda-Pineda, E. Ricci, K. Alahari, and J. Mairal (2022) Self-supervised models are continual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9621–9630. Cited by: §1, §1, §3.1, §3.2.2, §3.2.4, §3.3, Table 1, Figure 2, 1st item, 5th item, §4.1, §4.3, §4.5, §4.5, §5.
R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3 (4), pp. 128–135. Cited by: §1.
A. Galashov, J. Mitrovic, D. Tirumala, Y. W. Teh, T. Nguyen, A. Chaudhry, and R. Pascanu (2023) Continually learning representations at scale. In Conference on Lifelong Learning Agents (CoLLAs), pp. 534–547. Cited by: §5.
J. Gallardo, T. L. Hayes, and C. Kanan (2021) Self-supervised training enhances online continual learning. In British Machine Vision Conference (BMVC), Cited by: §5.
A. Gomez-Villa, B. Twardowski, K. Wang, and J. van de Weijer (2024) Plasticity-optimized complementary networks for unsupervised continual learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1690–1700. Cited by: §1, Table 1, §4.1, §4.5, §5.
A. Gomez-Villa, B. Twardowski, L. Yu, A. D. Bagdanov, and J. van de Weijer (2022) Continually learning self-supervised representations with projected functional regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3867–3877. Cited by: §5.
J. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. (2020) Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems (NeurIPS) 33, pp. 21271–21284. Cited by: Appendix B, §5.
Y. Gu, X. Yang, K. Wei, and C. Deng (2022) Not just selection, but exploration: online class-incremental continual learning via dual view consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7442–7451. Cited by: §3.2.
M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 297–304. Cited by: §5.
R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 1735–1742. Cited by: §5.
T. L. Hayes, K. Kafle, R. Shrestha, M. Acharya, and C. Kanan (2020) Remind your neural network to prevent catastrophic forgetting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–483. Cited by: §5.
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2022) Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009. Cited by: §5.
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738. Cited by: Appendix B, §2.1, §5.
K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.1, §5.
S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 831–839. Cited by: §3.1, §5.
S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pp. 448–456. Cited by: §1, §4.1.
G. Kim, C. Xiao, T. Konishi, Z. Ke, and B. Liu (2022) A theoretical study on solving continual learning. Advances in Neural Information Processing Systems (NeurIPS) 35, pp. 5065–5079. Cited by: §3.1.
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences (PNAS) 114 (13), pp. 3521–3526. Cited by: §3.1, §3.3, §5.
H. Koh, M. Seo, J. Bang, H. Song, D. Hong, S. Park, J. Ha, and J. Choi (2023) Online boundary-free continual learning by scheduled data prior. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: 3rd item.
A. Krizhevsky (2009) Learning multiple layers of features from tiny images. University of Toronto. Cited by: 1st item.
C. A. Kurby and J. M. Zacks (2008) Segmentation in the perception and memory of events. Trends in Cognitive Sciences 12 (2), pp. 72–79. Cited by: §1.
Y. Le and X. Yang (2015) Tiny imagenet visual recognition challenge. CS 231N 7 (7), pp. 3. Cited by: §1, 3rd item.
X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong (2019) Learn to grow: a continual structure learning framework for overcoming catastrophic forgetting. In International Conference on Machine Learning (ICML), pp. 3925–3934. Cited by: §5.
L. Lin (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8, pp. 293–321. Cited by: §3.1, §3.3, Table 1, §4.1.
D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems (NeurIPS) 30. Cited by: §5.
D. Madaan, J. Yoon, Y. Li, Y. Liu, and S. J. Hwang (2022) Representational continuity for unsupervised continual learning. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: Appendix B, Appendix C, §1, §1, §3.1, §3.3, §3.3, Table 1, Figure 2, 1st item, 3rd item, 5th item, §4.1, §4.1, §4.2, §4.3, §4.5, §5.
M. Masana, J. Van de Weijer, B. Twardowski, et al. (2021) On the importance of cross-task features for class-incremental learning. arXiv preprint arXiv:2106.11930. Cited by: §3.1.
M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §1.
D. Misra (2020) Mish: a self regularized non-monotonic activation function. In British Machine Vision Conference (BMVC), Cited by: Appendix B.
I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §5.
O. Ostapenko, M. Puscas, T. Klein, P. Jahnichen, and M. Nabi (2019) Learning to remember: a synaptic plasticity driven framework for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11321–11329. Cited by: §5.
O. Ostapenko, P. Rodriguez, M. Caccia, and L. Charlin (2021) Continual learning via local module composition. Advances in Neural Information Processing Systems (NeurIPS) 34, pp. 30298–30312. Cited by: §5.
G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks 113, pp. 54–71. Cited by: §1.
Q. Pham, C. Liu, and H. Steven (2021) Continual normalization: rethinking batch normalization for online continual learning. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
S. Qiao, H. Wang, C. Liu, W. Shen, and A. Yuille (2019) Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520. Cited by: Appendix B.
D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, and R. Hadsell (2019) Continual unsupervised representation learning. Advances in Neural Information Processing Systems (NeurIPS) 32. Cited by: §5.
S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) ICaRL: incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2001–2010. Cited by: §1, §5.
A. Robins (1995) Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7 (2), pp. 123–146. Cited by: §3.1, §3.3, Table 1, §4.1, §5.
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §3.3, §5.
J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning (ICML), pp. 4528–4537. Cited by: Appendix B, §3.1, §3.3, Table 1, §4.1, §5.
J. Serra, D. Suris, M. Miron, and A. Karatzoglou (2018) Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning (ICML), pp. 4548–4557. Cited by: §5.
J. Smith, C. Taylor, S. Baer, and C. Dovrolis (2021) Unsupervised progressive learning and the stam architecture. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §5.
Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 776–794. Cited by: §5.
G. M. van de Ven, T. Tuytelaars, and A. S. Tolias (2022) Three types of incremental learning. Nature Machine Intelligence 4 (12), pp. 1185–1197. External Links: Document, ISBN 2522-5839, Link Cited by: §2.2.
E. Verwimp, M. De Lange, and T. Tuytelaars (2021) Rehearsal revealed: the limits and merits of revisiting samples in continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9385–9394. Cited by: §4.5.
J. S. Vitter (1985) Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11 (1), pp. 37–57. Cited by: §3.2.
L. Wang, J. Xie, X. Zhang, M. Huang, H. Su, and J. Zhu (2023) Hierarchical decomposition of prompt-based continual learning: rethinking obscured sub-optimality. In Advances in Neural Information Processing Systems (NeurIPS), External Links: Link Cited by: §3.1.
T. Wang and P. Isola (2020) Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning (ICML), pp. 9929–9939. Cited by: §1, §2.1, §3.2.1, §3.3.
Z. Wang, Y. Li, L. Shen, and H. Huang (2024) A unified and general framework for continual learning. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §3.3.
Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu (2019) Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 374–382. Cited by: §5.
Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §4.1.
Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3733–3742. Cited by: Appendix B, §4.2, §5.
T. Xiao, X. Wang, A. A. Efros, and T. Darrell (2021) What should not be contrastive in contrastive learning. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §3.2.1.
J. Yoon, D. Madaan, E. Yang, and S. J. Hwang (2021) Online coreset selection for rehearsal-based continual learning. In International Conference on Learning Representations (ICLR), Cited by: §3.2.
J. Yoon, E. Yang, J. Lee, and S. J. Hwang (2018) Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations (ICLR), Cited by: §3.3.
Y. You, I. Gitman, and B. Ginsburg (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Cited by: Appendix B.
J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny (2021) Barlow twins: self-supervised learning via redundancy reduction. In International Conference on Machine Learning (ICML), pp. 12310–12320. Cited by: Appendix B, §1, §2.1, §5.
F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning (ICML), pp. 3987–3995. Cited by: §5.
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations (ICLR), Cited by: §3.3, §3.3.
B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017) Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: 3rd item.

Appendix A Proof of Proposition 1

Proof.

Let $ν \sim Beta (α, α)$ . Define $z_{i} := g (f (x_{i})) / ∥ g (f (x_{i})) ∥_{2}$ , $u_{i} := g (f (y_{i})) / ∥ g (f (y_{i})) ∥_{2}$ , and $_{i} := g (f (_{i})) / ∥ g (f (_{i})) ∥_{2}$ for all $i \in {0, \dots | X |}$ . Assume that the representations are also linearly mixed in the representation space, i.e., $_{i} = ν z_{i} + (1 - ν) u_{i}$ .

Let $_{i}$ be the anchor, the representation of its other augmented view ${_{i}}^{'}$ as the positive, and a set of representations of mixed examples ${_{j}}_{j \neq i}$ as negatives. Then we can express the LUMP loss as

(11)

For convenience, define scalar quantities $p_{i}$ , $p_{j}$ ’s as the softmax probability of predictions:

	$p_{i}$	$= \frac{exp ({_{i}}^{⊤} {_{i}}^{'})}{exp ({_{i}}^{⊤} {_{i}}^{'}) + \sum_{j \neq i} exp ({_{i}}^{⊤}_{j})}, and$		(12)
	$p_{j}$	$= \frac{exp ({_{i}}^{⊤}_{j})}{exp ({_{i}}^{⊤} {_{i}}^{'}) + \sum_{k \neq i} exp ({_{i}}^{⊤}_{k})}, if j \neq i,$		(13)

where $0 \leq p_{i}, p_{j} \leq 1$ . Then gradient of the LUMP loss in Eq. 11 w.r.t. the anchor is

$E_{ν} [\frac{\partial L^{L U M P}}{\partial_{i}}]$	$= E_{ν} ⎡ ⎣ (p_{i} - 1) {_{i}}^{'} + \sum j \neq i p_{j}_{j} ⎤ ⎦$	(14)
	$= E_{ν} ⎡ ⎣ (p_{i} - 1) (ν z_{i}^{'} + (1 - ν) u_{i}^{'}) + \sum j \neq i p_{j} (ν z_{j} + (1 - ν) u_{j}) ⎤ ⎦$	(15)
	$= E_{ν} ⎡ ⎣ (p_{i} - 1) (ν) z_{i}^{'} + (p_{i} - 1) (1 - ν) u_{i}^{'} + \sum j \neq i (p_{j}) (ν) z_{j} + \sum j \neq i (p_{j}) (1 - ν) u_{j} ⎤ ⎦,$	(16)

where the equality holds between lines 14 and 15 because of our linearity assumption. Therefore, the gradient w.r.t. the current-task example’s representation is

	$E_{ν} [\frac{\partial L^{L U M P}}{\partial z_{i}}] = E_{ν} [\frac{\partial L^{L U M P}}{\partial_{i}} \frac{\partial_{i}}{\partial z_{i}}] = E_{ν} [ν \frac{\partial L^{L U M P}}{\partial_{i}}]$		(17)
	$= E_{ν} ⎡ ⎣ (p_{i} - 1) (ν^{2}) z_{i}^{'} + (p_{i} - 1) (1 - ν) (ν) u_{i}^{'} + \sum j \neq i (p_{j}) (ν^{2}) z_{j} + \sum j \neq i (p_{j}) (1 - ν) (ν) u_{j} ⎤ ⎦$		(18)
			(19)
	$= (p_{i} - 1) E [ν^{2}] z_{i}^{'} + \sum j \neq i (p_{j}) E [ν^{2}] z_{j} + \sum j \neq i (p_{j}) (E [ν] - E [ν^{2}]) u_{j} + (p_{i} - 1) (E [ν] - E [ν^{2}]) u_{i}^{'} .$		(20)

Notice that this is a weighted sum of $z_{i}^{'}$ , ${z_{j}}_{j}$ , ${u_{j}}_{j}$ , and $u_{i}^{'}$ . The first and second terms are the weighted representations of views of $x \in X$ the other two terms are the weighted representations of views of $y \in Y$ .

Now, the coefficients involve $p_{i}, p_{j}$ , which are the softmax probabilities given by Eq. 12 and Eq. 13; they also involve $E [ν], E [ν^{2}]$ , which are the first and second moments of Beta( $α$ , $α$ ) where $E [ν^{2}], E [ν] - E [ν^{2}] > 0$ . Therefore, it is easy to see that the coefficients on $z_{i}^{'}, u_{i}^{'}$ are negative, and the coefficients on ${z_{j}}_{j}$ and ${u_{j}}_{j}$ are positive. This conclusion holds as long as $E [ν^{2}], E [ν] - E [ν^{2}] > 0$ are satisfied; i.e., $ν$ does not necessarily need to be sampled from a symmetric Beta distribution as chosen by LUMP.

This concludes our proof for the first equality. The proof for $E_{ν} [\frac{\partial L^{L U M P}}{\partial u_{i}}]$ is similar due to symmetry. ∎

Appendix B Implementation Details

General details.

For GroupNorm, we set the number of groups to $min (32, ⌊ # channels / 4 ⌋)$ . We replace the ReLU activation functions following GroupNorm with Mish activation (Misra, 2020) to avoid dying ReLUs, or elimination singularities (Qiao et al., 2019). We use the same data augmentation procedure and parameters as Zbontar et al. (2021); Grill et al. (2020), and change the resized image size to $32 \times 32$ for CIFAR-100 and $64 \times 64$ for Tiny-ImageNet. We set $τ = 0.1$ for all contrastive losses. We train all our models for $200$ epochs with SGD optimizer, learning rate of 0.03, weight decay of $5 e^{- 4}$ , and batch size of $| X | = 256$ on all benchmarks, following Madaan et al. (2022). We do not use LARS (You et al., 2017) because we do not find it make significant differences on batches of size 256. We train Offline for the same number of steps as continual models. The models are trained on two NVIDIA Quadro RTX 8000 GPUs for all experiments. For KNN evaluation, we follow He et al. (2020); Wu et al. (2018)’s set up, where we set $k = 200$ and temperature $τ_{KNN} = 0.1$ .

UCL methods.

All hyperparameters are fine-tuned manually with at least three values. We adapt the official implementations of CaSSLe, LUMP to our codebase. Since POCON is a concurrent work and its code has not been made public yet, we re-implement it. For online EWC, we normalize the diagonal Fisher information matrix for each task following the authors (Schwarz et al., 2018) and set $γ = 1$ ; we use $50$ , $100$ as the weight on the regularization loss on the five and 20 task sequences of Split-CIFAR-100, respectively. For LUMP, we set the Beta distribution parameter $α = 0.4$ on all experiments; it’s the same value used by the authors on Tiny-ImageNet, and we find it performs better than the authors’ parameter $α = 0.1$ on CIFAR-100. For online POCON, we save the model checkpoint every $d s = 2, 000$ steps for CIFAR-100 and $d s = 4, 000$ steps for Tiny-ImageNet; we set the weights to $1$ on all its loss terms. For DER, ER, and ER+, we set the weights to $1$ on the additional loss terms. For all replay-based methods except for LUMP (which requires $| Y | = | X | = 256$ ), we uniformly sample $| Y | = \frac{3}{4} | X | = 192$ examples from the memory at each step for replay.

Appendix C Evaluation Results Given Task Identities

	5-Task Split-CIFAR-100					20-Task Split-CIFAR-100
	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )
FT	72.4 ( $\pm$ 0.3)	1.7 ( $\pm$ 0.3)	7.9 ( $\pm$ 0.0)	59.1 ( $\pm$ 0.2)	36.3 ( $\pm$ 0.5)	84.4 ( $\pm$ 0.1)	4.2 ( $\pm$ 0.1)	6.6 ( $\pm$ 0.0)	47.2 ( $\pm$ 0.1)	28.6 ( $\pm$ 0.1)
ER	72.9 ( $\pm$ 0.2)	1.1 ( $\pm$ 0.2)	6.9 ( $\pm$ 0.3)	59.7 ( $\pm$ 0.4)	36.9 ( $\pm$ 0.3)	85.1 ( $\pm$ 0.2)	3.0 ( $\pm$ 0.2)	5.1 ( $\pm$ 0.4)	48.2 ( $\pm$ 0.5)	29.6 ( $\pm$ 0.3)
DER $^{†}$	72.5 ( $\pm$ 0.3)	1.6 ( $\pm$ 0.4)	8.1 ( $\pm$ 0.2)	59.0 ( $\pm$ 0.4)	36.2 ( $\pm$ 0.4)	84.2 ( $\pm$ 0.2)	4.2 ( $\pm$ 0.2)	6.7 ( $\pm$ 0.5)	47.2 ( $\pm$ 0.2)	28.5 ( $\pm$ 0.2)
LUMP	71.2 ( $\pm$ 0.5)	0.9 ( $\pm$ 0.5)	5.8 ( $\pm$ 0.4)	58.4 ( $\pm$ 0.4)	35.8 ( $\pm$ 0.3)	85.3 ( $\pm$ 0.6)	2.4 ( $\pm$ 0.7)	4.3 ( $\pm$ 0.3)	49.1 ( $\pm$ 1.0)	29.3 ( $\pm$ 0.1)
ER+	73.0 ( $\pm$ 0.5)	1.7 ( $\pm$ 0.3)	7.9 ( $\pm$ 0.2)	60.1 ( $\pm$ 0.3)	36.8 ( $\pm$ 0.5)	85.1 ( $\pm$ 0.2)	3.9 ( $\pm$ 0.1)	6.4 ( $\pm$ 0.2)	48.0 ( $\pm$ 0.1)	29.2 ( $\pm$ 0.1)
ER++	72.5 ( $\pm$ 0.5)	1.3 ( $\pm$ 0.6)	7.0 ( $\pm$ 0.0)	59.8 ( $\pm$ 0.3)	36.8 ( $\pm$ 0.4)	85.3 ( $\pm$ 0.2)	2.8 ( $\pm$ 0.2)	4.9 ( $\pm$ 0.1)	49.0 ( $\pm$ 0.5)	29.6 ( $\pm$ 0.0)
Osiris-R (Ours)	73.3 ( $\pm$ 0.2)	1.3 ( $\pm$ 0.3)	6.7 ( $\pm$ 0.4)	60.1 ( $\pm$ 0.1)	37.7 ( $\pm$ 0.1)	86.5 ( $\pm$ 0.2)	2.5 ( $\pm$ 0.2)	5.4 ( $\pm$ 0.5)	50.5 ( $\pm$ 0.5)	30.2 ( $\pm$ 0.4)
EWC	65.6 ( $\pm$ 0.7)	1.4 ( $\pm$ 0.8)	3.7 ( $\pm$ 0.5)	53.9 ( $\pm$ 0.7)	32.8 ( $\pm$ 0.7)	79.0 ( $\pm$ 0.2)	1.9 ( $\pm$ 0.1)	3.7 ( $\pm$ 0.1)	39.4 ( $\pm$ 0.3)	24.0 ( $\pm$ 0.2)
CaSSLe	72.6 ( $\pm$ 0.2)	0.5 ( $\pm$ 0.4)	6.7 ( $\pm$ 0.5)	59.5 ( $\pm$ 0.3)	36.1 ( $\pm$ 0.4)	85.4 ( $\pm$ 0.1)	2.5 ( $\pm$ 0.1)	5.1 ( $\pm$ 0.2)	49.2 ( $\pm$ 0.2)	28.8 ( $\pm$ 0.2)
POCON $^{⋆}$	72.1 ( $\pm$ 0.7)	2.2 ( $\pm$ 0.8)	8.0 ( $\pm$ 0.5)	59.3 ( $\pm$ 0.3)	36.4 ( $\pm$ 0.4)	84.0 ( $\pm$ 0.2)	4.4 ( $\pm$ 0.4)	6.9 ( $\pm$ 0.3)	46.8 ( $\pm$ 0.3)	28.3 ( $\pm$ 0.3)
Osiris-D (Ours)	73.9 ( $\pm$ 0.1)	0.7 ( $\pm$ 0.3)	6.7 ( $\pm$ 0.5)	60.5 ( $\pm$ 0.1)	37.3 ( $\pm$ 0.2)	86.7 ( $\pm$ 0.3)	2.2 ( $\pm$ 0.3)	5.1 ( $\pm$ 0.2)	51.3 ( $\pm$ 0.1)	30.0 ( $\pm$ 0.2)
Offline	74.0 ( $\pm$ 0.2)	-	-	60.4 ( $\pm$ 0.4)	-	88.7 ( $\pm$ 0.1)	-	-	53.9 ( $\pm$ 0.2)	-

Table 6: Results on standard Split-CIFAR-100 with five or 20 tasks. The ground truth task identity for each example is given to the model at test time, so the model predicts the most probable class within the task. The best model in each column is made bold and the second-best model is underlined.

^{†}

Improved DER.

^{⋆}

Online version of POCON. We separate replay-based (top) and distillation-based methods (bottom) for easier comparisons.

	Structured CIFAR-100					Structured Tiny-ImageNet
	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )
FT	64.6 ( $\pm$ 0.8)	7.6 ( $\pm$ 0.8)	10.0 ( $\pm$ 0.2)	59.2 ( $\pm$ 0.5)	29.9 ( $\pm$ 0.3)	57.8 ( $\pm$ 0.2)	6.2 ( $\pm$ 0.3)	6.4 ( $\pm$ 0.2)	43.5 ( $\pm$ 0.2)	34.4 ( $\pm$ 0.1)
LUMP	66.6 ( $\pm$ 0.4)	3.7 ( $\pm$ 0.2)	7.2 ( $\pm$ 0.4)	62.7 ( $\pm$ 0.7)	30.6 ( $\pm$ 0.1)	59.5 ( $\pm$ 0.8)	2.9 ( $\pm$ 1.1)	4.9 ( $\pm$ 0.2)	45.2 ( $\pm$ 0.8)	34.1 ( $\pm$ 0.3)
CaSSLe	66.4 ( $\pm$ 0.2)	4.2 ( $\pm$ 0.4)	7.7 ( $\pm$ 0.3)	61.6 ( $\pm$ 0.1)	30.5 ( $\pm$ 0.1)	59.0 ( $\pm$ 0.6)	4.3 ( $\pm$ 0.3)	5.4 ( $\pm$ 0.6)	43.9 ( $\pm$ 0.3)	34.7 ( $\pm$ 0.3)
POCON $^{⋆}$	64.5 ( $\pm$ 0.4)	7.6 ( $\pm$ 0.6)	9.8 ( $\pm$ 0.3)	59.6 ( $\pm$ 0.8)	30.0 ( $\pm$ 0.3)	58.1 ( $\pm$ 0.7)	5.5 ( $\pm$ 0.9)	6.6 ( $\pm$ 0.1)	43.6 ( $\pm$ 0.5)	34.0 ( $\pm$ 0.3)
Osiris-R (Ours)	67.3 ( $\pm$ 0.2)	4.9 ( $\pm$ 0.2)	7.6 ( $\pm$ 0.1)	62.8 ( $\pm$ 0.2)	32.3 ( $\pm$ 0.1)	59.9 ( $\pm$ 0.2)	4.4 ( $\pm$ 0.3)	5.9 ( $\pm$ 0.3)	45.1 ( $\pm$ 0.1)	35.2 ( $\pm$ 0.4)
Osiris-D (Ours)	67.8 ( $\pm$ 0.1)	4.6 ( $\pm$ 0.1)	8.0 ( $\pm$ 0.2)	64.2 ( $\pm$ 0.1)	32.0 ( $\pm$ 0.1)	61.5 ( $\pm$ 0.1)	3.2 ( $\pm$ 0.3)	5.6 ( $\pm$ 0.5)	46.5 ( $\pm$ 0.2)	35.6 ( $\pm$ 0.1)
Offline	69.2 ( $\pm$ 0.3)	-	-	67.9 ( $\pm$ 0.1)	-	60.7 ( $\pm$ 0.2)	-	-	46.4 ( $\pm$ 0.2)	-

Table 7: Results on the 10-task sequences on structured CIFAR-100 and Tiny-ImageNet. The ground truth task identity for each example is given to the model at test time, so the model predicts the most probable class within the task. The two best models are marked. Osiris-D performs the best, surpassing Offline on Structured Tiny-ImageNet.

	A ( $↑$ )	F ( $↓$ )	K ( $↑$ )	C ( $↑$ )	T ( $↑$ )
w/o isolated space	85.6 ( $\pm$ 0.3)	2.8 ( $\pm$ 0.6)	4.9 ( $\pm$ 0.2)	49.3 ( $\pm$ 0.2)	29.8 ( $\pm$ 0.1)
w/o $L_{cross}$	84.4 ( $\pm$ 0.2)	4.1 ( $\pm$ 0.3)	6.2 ( $\pm$ 0.4)	47.2 ( $\pm$ 0.2)	28.9 ( $\pm$ 0.2)
w/o $L_{past}$	86.3 ( $\pm$ 0.3)	3.1 ( $\pm$ 0.2)	6.1 ( $\pm$ 0.3)	50.4 ( $\pm$ 0.3)	30.0 ( $\pm$ 0.1)
Full	86.7 ( $\pm$ 0.3)	2.2 ( $\pm$ 0.3)	5.1 ( $\pm$ 0.2)	51.3 ( $\pm$ 0.1)	30.0 ( $\pm$ 0.2)

Table 8: Ablation of Osiris-D’s components on 20-task Split-CIFAR-100. The ground truth task identity for each example is given to the model at test time, so the model predicts the most probable class within the task.

L_{cross}

benefits representations even in within-task discrimination when not considering classes from other tasks.

We do not provide access to the task identity for each example at test time in Sec. 4. In the scenario where task labels are given during evaluation (Madaan et al., 2022), it remains an open question whether consolidation provides benefits. We hypothesize that the consolidation term is beneficial because it potentially learns a broader set of features than FT by increasing the diversity of batches for the contrastive loss.

To provide some evidence, we perform the same set of evaluations as in Sec. 4, but with task labels given to the model. In other words, the model predicts the most probable class among classes within the same task as the ground truth class $i$ when calculating $A_{\cdot, i}$ . The consolidation score (task-level KNN accuracy) is still calculated without task identities, and thus stays the same. The results are shown in Table 6, Table 7, and Table 8. We find that the consolidation score still correlates with the accuracy, and Osiris-D consistently outperforms other models. In Table 8, without $L_{c r o s s}$ , Osiris-D experiences a 4.1% consolidation score drop and a 2.3% drop in accuracy. This shows that $L_{c r o s s}$ benefits the representation even in within-task discrimination.

Additionally, with consolidation, we expect the representations to separate all classes regardless of task identity. Therefore, they put the least assumptions on what classes we try to discriminate at test time and should appeal for a broader set of downstream use cases. For example, the test data is not required to be a subset of one of the tasks seen during training.

Appendix D Additional Plots

In this section, we show four additional sets of plots. First, to complement Fig. 2(b), we plot the relative loss difference curves for Osiris-R and Osiris-D on the outputs of $g \circ f$ , i.e., on the space where $L_{c u r r e n t}$ is applied, in Fig. 4(a). Then, Fig. 6 shows the accuracy versus extra storage for different UCL models on 20-task Split-CIFAR-100. Fig. 7 visualizes the pairwise cosine similarity distributions between examples from different class pairs. Finally, Fig. 8 through Fig. 11 plot the per-task accuracy of FT, LUMP, CaSSLe, and Osiris-D throughout training on 20-task Split-CIFAR-100.

(a) Osiris’s curves calculated with $g \circ f$ .

Figure 6: Accuracy on 20-task Split-CIFAR-100 versus additional storage. Storage is calculated by counting each additional model parameter besides the main model as well as each channel of each pixel in memory as a 64-bit float.

Figure 7: Left: pairwise feature similarity distributions between examples from the same (positive) or different (negative) classes, within each environment of Structured Tiny-ImageNet, as well as on all data (overall). The densities are estimated with a Gaussian kernel. The intersections are marked with a darker shade and the values are obtained by integrating the shaded area. An empty intersection of supports of the two distributions sufficiently entails a KNN classifier with perfect accuracy, but it is not necessary. Right: mean cosine similarity between pairs of examples drawn from pairs of classes.

Figure 8: Task 1-5 accuracy throughout training on 20-task Split-CIFAR-100.

Figure 9: Task 6-10 accuracy throughout training on 20-task Split-CIFAR-100.

Figure 10: Task 11-15 accuracy throughout training on 20-task Split-CIFAR-100.

Figure 11: Task 15-20 accuracy throughout training on 20-task Split-CIFAR-100.

Appendix E Structured Class Order

We list below the classes within each task in Structured Tiny-ImageNet which we described in Sec. 4. Tasks 1-4 contain classes in an indoor environment. Task 5 contains classes in both indoor and city environments and serves as a soft transition. Tasks 6 and 7 contain classes in a city environment. Finally, tasks 9 and 10 contain classes in a natural, wild environment.

Task 1: trilobite, binoculars, American lobster, bow tie, volleyball, banana, fur coat, barbershop, sombrero, water jug, bathtub, beer bottle, bell pepper, hourglass, ice cream, altar, lampshade, boa constrictor, frying pan, Christmas stocking.
Task 2: turnstile, tabby, potter’s wheel, chain, lemon, pill bottle, iPod, cockroach, oboe, punching bag, abacus, refrigerator, sock, bannister, candle, plate, ice lolly, Yorkshire terrier, apron, drumstick.
Task 3: poncho, dining table, neck brace, guacamole, gasmask, backpack, academic gown, vestment, cash machine, CD player, espresso, potpie, syringe, orange, plunger, desk, Chihuahua, miniskirt, pretzel, bucket.
Task 4: organ, chest, guinea pig, stopwatch, sandal, broom, pomegranate, barrel, wok, comic book, computer keyboard, meat loaf, pizza, basketball, remote control, teapot, mashed potato, teddy, cardigan, space heater.
Task 5: Egyptian cat, rocking chair, wooden spoon, pop bottle, sunglasses, magnetic compass, sewing machine, jellyfish, beaker, Labrador retriever, dumbbell, nail, obelisk, lifeboat, steel arch bridge, moving van, gondola, military uniform, pole, beach wagon.
Task 6: freight car, torch, umbrella, rugby ball, limousine, projectile, brass, go-kart, confectionery, pay-phone, German shepherd, reel, trolleybus, crane, fountain, jinrikisha, convertible, tractor, butcher shop, thatch.
Task 7: suspension bridge, bullet train, kimono, picket fence, water tower, school bus, maypole, birdhouse, sports car, beacon, parking meter, bikini, swimming trunks, flagpole, triumphal arch, cannon, Persian cat, scoreboard, police van, lawn mower.
Task 8: dragonfly, scorpion, American alligator, tarantula, lion, golden retriever, mantis, bullfrog, African elephant, snail, bighorn, baboon, sea cucumber, brown bear, cougar, seashore, king penguin, koala, ladybug, tailed frog.
Task 9: black widow, ox, grasshopper, acorn, fly, Arabian camel, coral reef, cliff dwelling, goldfish, goose, spider web, brain coral, barn, monarch, black stork, spiny lobster, standard poodle, sulphur butterfly, viaduct, albatross.
Task 10: sea slug, chimpanzee, snorkel, slug, gazelle, dam, European fire salamander, hog, centipede, lesser panda, walking stick, lakeside, bee, mushroom, dugong, cauliflower, bison, alp, orangutan, cliff.

Integrating Present and Past in Unsupervised Continual Learning

Abstract

1 Introduction

2 Preliminaries

2.1 Self-Supervised Learning

Generalized contrastive loss.

2.2 Unsupervised Continual Learning

3 Dissecting the Learning Objective of UCL

3.1 Three Desirable Properties

Plasticity and stability.

Cross-task consolidation.

3.2 Osiris: Integrating Objectives of Present and Past

3.2.1 Plasticity Loss

3.2.2 Stability Loss

Osiris-D(istillation).

Osiris-R(eplay).

Remark.

3.2.3 Cross-Task Consolidation Loss

Remark.

3.2.4 Overall Loss

3.3 Expressing UCL Methods with A Unifying Framework

Elastic Weight Consolidation (EWC)

CaSSLe

Experience Replay (ER)

ER+ and ER++

Dark Experience Replay (DER)

Lump

Proposition 1.

4 Experiments

4.1 Experimental Protocol

Baselines.

Benchmarks.

Implementation details.

Incompatibility between BatchNorm and UCL.

4.2 Metrics for Evaluating UCL Methods

4.3 Results on the Standard Benchmarks

General findings.

4.4 Results on the Structured Benchmarks

4.5 Analysis

Balancing stability, plasticity, and consolidation.

Ablations.

How does a structured task sequence affect representation?

Does replay-based methods overfit to memory?

5 Related Work

Self-supervised learning.

Continual learning.

6 Conclusion

Acknowledgment

References

Appendix A Proof of Proposition 1

Proof.

Appendix B Implementation Details

General details.

UCL methods.

Appendix C Evaluation Results Given Task Identities

Appendix D Additional Plots

Appendix E Structured Class Order

Integrating Present and Past
in Unsupervised Continual Learning