HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-SA 4.0
arXiv:2402.16674v1 [cs.CV] 26 Feb 2024

ConSept: Continual Semantic Segmentation via Adapter-based Vision Transformer

Bowen Dong1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Guanglei Yang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Wangmeng Zuo11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Lei Zhang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China (e-mail: [email protected]). 2. Department of Computing, the Hong Kong Polytechnic University: (e-mail: [email protected]). ✉ denotes corresponding author.
Abstract

In this paper, we delve into the realm of vision transformers for continual semantic segmentation, a problem that has not been sufficiently explored in previous literature. Empirical investigations on the adaptation of existing frameworks to vanilla ViT reveal that incorporating visual adapters into ViTs or fine-tuning ViTs with distillation terms is advantageous for enhancing the segmentation capability of novel classes. These findings motivate us to propose Continual semantic Segmentation via Adapter-based ViT, namely ConSept. Within the simplified architecture of ViT with linear segmentation head, ConSept integrates lightweight attention-based adapters into vanilla ViTs. Capitalizing on the feature adaptation abilities of these adapters, ConSept not only retains superior segmentation ability for old classes, but also attains promising segmentation quality for novel classes. To further harness the intrinsic anti-catastrophic forgetting ability of ConSept and concurrently enhance the segmentation capabilities for both old and new classes, we propose two key strategies: distillation with a deterministic old-classes boundary for improved anti-catastrophic forgetting, and dual dice losses to regularize segmentation maps, thereby improving overall segmentation performance. Extensive experiments show the effectiveness of ConSept on multiple continual semantic segmentation benchmarks under overlapped or disjoint settings. Code will be publicly available at https://github.com/DongSky/ConSept.

Index Terms:
Continual Learning, Semantic Segmentation, Vision Transformer, Visual Adapter, Knowledge Distillation.

I Introduction

Semantic segmentation [1, 2, 3] aims to recognize and segment region masks for the given categories. The evolution of semantic segmentation algorithms [3, 4, 5, 6, 7] has allowed the segmentation models to produce precise masks for individual images. However, in practical scenarios, these aforementioned algorithms are expected to possess the capability to assimilate novel concepts continually through specific learning paradigms, i.e., continual semantic segmentation, which ensures that the acquired algorithms exhibit proficient performance across both old and newly added categories and mitigate the phenomenon of catastrophic forgetting [8, 9]. In this paper, we focus on the class-incremental setting [10, 11, 12] and delve into discussions regarding methods for continual semantic segmentation.

Generally speaking, prior networks dedicated to continual semantic segmentation [13, 14, 10, 15] commonly employ a foundational learner constructed through the utilization of a CNN-based feature extractor [16] and a DeepLab segmentation head [3]. Then these learners are accompanied through strategies such as a) knowledge distillation methods [10, 15], b) regularization-based methods [17, 18], and c) exemplar-replay methods [19, 20, 21]. These frameworks notably enhance the learner’s resilience against catastrophic forgetting. However, existing CNN-based methods encounter two fundamental challenges. Firstly, the fixed-size convolution kernels impede the long-range interaction capacity of feature extractors, thereby constraining the overall segmentation performance. Secondly, the anti-catastrophic forgetting ability to old classes remains constrained, thereby limiting the applicability of continual semantic segmentation methods.

Refer to caption
Figure 1: Performance comparison between ConSept and state-of-the-art methods [10, 14, 22]. ConSept obtains the best performance on all PASCAL VOC benchmarks with overlapped setting. Best viewed in color.

Inspired by state-of-the-art Vision Transformer (ViT)-based segmentation methods [23, 5, 6], one can utilize the ViT feature extractor to construct ViT-based continual semantic segmentation frameworks [22, 24, 25] for better performance. However, existing state-of-the-art ViT-based continual segmenters face several critical issues, including the reliance on heavy segmentation decoders [22, 25, 24], and the need of extra region proposals [24] to maintain optimal performance. Such drawbacks limit the practical applications of such methods. Consequently, a question arises: could we formulate an effective continual segmentation framework using vision transformers with straightforward linear segmentation heads?

To answer this question, we start from a pretrained ViT [26] with linear segmentation head to conduct preliminary study. Upon integrating such a model into a classical CNN-based continual semantic segmentation method (e.g., SSUL [13]), the ViT-based continual segmenter exhibits comparable performance on base classes but experiences a notable decline in mIoU on novel classes, consequently resulting in an overall performance degeneration. Through empirical investigation into pretrained vision transformers, it is discerned that a frozen pretrained vision transformer tends to overfit on base classes, thereby preserving feasible anti-catastrophic forgetting capability on these classes while concurrently constraining its generalization capacity to novel categories. Furthermore, a comprehensive empirical analysis in Sec. III-C reveals that incorporating adapters into the ViT feature extractor and introducing additional distillation terms are advantageous for both old and novel classes in the context of continual semantic segmentation. These findings naturally motivate us to design an adapter-based vision transformer and integrate it with specifically designed continual learning frameworks. Such an architecture, even in the absence of complex segmentation decoders, could accomplish the dual objectives of: 1) enhancing anti-catastrophic forgetting ability for base classes, and 2) attaining robust segmentation capability for novel classes. With the above motivations, we present ConSept, a straightforward yet efficacious continual semantic segmentation method.

Maintaining a simplistic macro architecture comprising solely ViT [26] and a linear segmentation head, ConSept sheds light on ViT-based continual semantic segmentation. A primary concern on freezing ViT lies in its sub-optimal segmentation performance on novel categories. In response to this challenge, we propose a pivotal strategy, namely fine-tuning with adapters for better generalization, which is the most significant component of ConSept. Inspired by prior works on visual adapters [27, 28, 29] for transfer learning, we integrate a shallow convolution-based stem block and lightweight attention-based adapters into the pretrained ViT [26], resulting in a dual-path feature extractor. The image features refined by these adapters are subsequently input into a simple linear segmentation head for prediction. With less than 10%percent1010\%10 % additional parameters, ConSept notably enhances the generalization ability for novel classes, preserving the anti-catastrophic forgetting capability for base classes effectively and efficiently. Note that conventional approaches [13, 14] often adopt the strategy of freezing feature extractors during training to mitigate the risk of catastrophic forgetting. However, in the context of ViT-based scenarios shown in Table I, this strategy imposes substantial limitations on the segmentation ability of novel classes. To mitigate the adverse effects associated with frozen ViT and leverage the benefits of adapters within ConSept, we advocate fine-tuning the entire feature extractor. This strategic adjustment ensures the network’s capacity to learn representative features for new concepts, reinforcing its generalization ability on novel classes.

In conjunction with the adapter-based macro design in ConSept, we introduce two additional strategies to further harness the efficacy of adapter-based ConSept for continual segmentation. The first strategy involves distillation with deterministic old-classes boundary for better anti-catastrophic forgetting. Since directly fine-tuning ViT feature extractor without any constraints could still lead to catastrophic forgetting for base classes [22, 10], to further enhance the anti-forgetting capability, we propose to employ dense mean-square loss and dense contrastive loss to distill image features between old and new models, and preserve the frozen state of the linear head for old classes to uphold a well-defined decision boundary for segmentation. The second strategy revolves the dual dice losses for better regularization. Specifically, we incorporate the two dice losses to assess errors in segmentation predictions. For the class-specific dice loss term, we adhere to the classical dice loss [30] to enhance overall segmentation proficiency. Meanwhile, for the old-new dice loss term, we binarize the pseudo ground-truth maps into regions corresponding to old classes and new classes, and apply dice loss constraint to improve the segmentation quality for novel classes.

Experiments are conducted on two challenging continual semantic segmentation benchmarks, i.e., PASCAL VOC [1] and ADE20K [31]. As shown in Fig. 1, ConSept achieves leading performance on PASCAL VOC benchmarks under various class-incremental learning steps. Even in the challenging scenarios with more classes, ConSept can still obtain remarkable segmentation performance for both base and novel classes. The promising results demonstrate that, albeit without relying on extra object proposals from external models [14, 24] or heavy segmentation decoder [22], ConSept can still enhance the anti-catastrophic forgetting capability of base classes, meanwhile effectively improve the segmentation ability for novel classes. Our method provides a strong and stable baseline for ViT-based continual semantic segmentation tasks.

II Related Work

II-A Continual Learning

Deep neural networks with the vanilla “pre-training then finetuning” paradigm often face the problem of severe catastrophic forgetting [8] in various tasks (e.g., image classification), resulting in unsatisfying performance on old categories. To tackle this issue, continual learning approaches [9] have been proposed to maintain the recognition ability on old classes, meanwhile obtain promising recognition accuracy on novel categories. Specifically, Li et al. proposed LwF [9], which builds the consistency between old models and new models on confidence scores of seen categories, thus reducing the effect of forgetting during novel classes training. Further works follow the fundamental design of LwF, and manage to solve catastrophic forgetting in four classes of methods. The first class is distillation-based methods, which conduct knowledge distillation on feature representation [32, 33] or output confidence scores [9] between old and new models. The second class is replay-based methods [34, 33], which store a small number of exemplars from seen categories, and utilize them during the training of newly-added tasks to avoid catastrophic forgetting explicitly. The third class is regularization-based methods [35], which introduce explicit constraints on model parameters between two tasks or design implicit regularization on feature representations to avoid forgetting. The last class is architecture-based methods [36], which mainly focus on fine-tuning specially designed new layers or adapters to obtain promising performance on new tasks and maintain accuracy on old tasks. Our work can be categorized into distillation-based method. Different from the above works, we focus on the more challenging continual semantic segmentation task.

II-B Continual Semantic Segmentation

Semantic segmentation [37, 7, 5, 6] aims to identify the category for each pixel in an image. To meet the need of continual learning, Michieli et al. [38] started from DeepLab [3] with CNN-based backbone [16] and proposed the basic structure of continual semantic segmentation framework via knowledge distillation. Methods of continual semantic segmentation can be also divided into three categories. The first category is distillation-based methods [10, 15]. Cermelli et al. [10] proposed the continual semantic segmentation task as well as the baseline method namely MiB. During training of new tasks, MiB reconstructs the confidence score of background class from newly-added classes to maintain the consistency between old and new models to avoid catastrophic forgetting. Douillard et al. proposed PLOP [15], which extends POD feature [32] by using the spatial-pyramid scheme, obtaining better performance on base classes. The second category is regularization-based methods [17, 18]. For example, Michieli et al. proposed SDR [17], which leverages prototype matching to enforce latent space consistency on old classes, thus avoiding catastrophic forgetting. Lin et al. proposed the structure preserving loss to maintain inter-class structure and intra-class structure, respectively. The third category is replay-based methods [19, 20, 21]. Similar to replay-based continual image classification, these methods store a small number of representative samples with old classes annotation, and replay these samples during new tasks training.

Nevertheless, though vision transformers have shown remarkable performance on dense prediction tasks, only a few works [25, 22] have managed to investigate transformer-based continual segmentation. All these methods largely leverage specially-designed transformer-based decoders [5, 4] to obtain promising performance. Our work also focuses on continual semantic segmentation with ViTs. Different from previous works, our work illustrates that, even without heavy decoder for accurate segmentation, we can still ensure high prediction quality for both old and new categories.

II-C Vision Transformers

Benefiting from stacked self-attention mechanisms [39] and feed-forward networks, vision transformers (ViTs) [26, 40] have demonstrated remarkable performance on various vision tasks [41, 42, 5, 6, 43, 44, 45]. Conventional exploration of vision transformers focus on two paradigms. The first is the classical “large-scale pre-training then fine-tuning” paradigm [41, 42, 6, 5, 46, 44]. This paradigm starts from ImageNet-pretrained [47, 48, 49] vision transformers and then conducts full fine-tuning [5, 6, 41, 40] or parameter-efficient tuning [50, 28, 44] on standard downstream tasks (e.g., detection [2] or segmentation [31]), thus obtaining desired performance on corresponding tasks. The second paradigm is transformer-based few-shot learning [51, 43, 52], which focuses on learning robust and generalized image feature representations, such that the learned ViTs are able to recognize novel categories without adaptation or with only fast low-shot adaptation. The above works mainly concern the performance of newly-added tasks while ignoring how to evaluate the capability against catastrophic forgetting. Recent works have explored continual learning with vision transformers in conventional continual image classification settings [53] or learning with pretrained models settings [11, 12, 54, 55, 56]. Different from previous works, we aim to explore the continual learning capability of ViTs on the more challenging downstream segmentation tasks [25, 22]. Our experimental results illustrate that, even without complicated segmentation heads [5, 57, 7], we can still obtain promising results on multiple continual semantic segmentation benchmarks.

III Preliminary Study

III-A Problem Definition

Building upon prior studies [17, 10, 22, 13, 14], we focus on continual semantic segmentation tasks within class-incremental setting spanning T𝑇Titalic_T steps. The definition of this task is provided below. Specifically, within the context of a semantic segmentation dataset 𝒟𝒟\mathcal{D}caligraphic_D and its corresponding class set 𝒞𝒞\mathcal{C}caligraphic_C, we partition 𝒞𝒞\mathcal{C}caligraphic_C into T𝑇Titalic_T non-overlapping subsets of classes, denoted by {𝒞1,,𝒞T}subscript𝒞1subscript𝒞𝑇\{\mathcal{C}_{1},...,\mathcal{C}_{T}\}{ caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. Accordingly, the dataset 𝒟𝒟\mathcal{D}caligraphic_D is partitioned into T𝑇Titalic_T subsets {𝒟1,,𝒟T}subscript𝒟1subscript𝒟𝑇\{\mathcal{D}_{1},...,\mathcal{D}_{T}\}{ caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, representing distinct tasks. For each task t𝑡titalic_t within 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T, the ground-truth segmentation masks corresponding to category 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are accessible. It is conventionally assumed that |𝒞2|==|𝒞T|subscript𝒞2subscript𝒞𝑇|\mathcal{C}_{2}|=\dots=|\mathcal{C}_{T}|| caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | = ⋯ = | caligraphic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT |, yielding X𝑋Xitalic_X-Y𝑌Yitalic_Y continual semantic segmentation benchmarks, where X=|𝒞1|𝑋subscript𝒞1X=|\mathcal{C}_{1}|italic_X = | caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | denotes the number of base classes and Y=|𝒞2|𝑌subscript𝒞2Y=|\mathcal{C}_{2}|italic_Y = | caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | signifies the number of novel classes introduced in each new task. The purpose of continual semantic segmentation is to learn a feature extractor f𝑓fitalic_f as well as a segmentation decoder g𝑔gitalic_g on each 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT sequentially (i.e., from 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 𝒟Tsubscript𝒟𝑇\mathcal{D}_{T}caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT) under the constraint that training annotations from 𝒟1:t1subscript𝒟:1𝑡1\mathcal{D}_{1:t-1}caligraphic_D start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT are not accessible. At each step t>1𝑡1t>1italic_t > 1, the optimized f𝑓fitalic_f with g𝑔gitalic_g should obtain promising segmentation capability for both old categories 𝒞1:t1subscript𝒞:1𝑡1\mathcal{C}_{1:t-1}caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT and newly learned categories 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Distinguished from other tasks in visual recognition, the continual semantic segmentation task confronts two distinctive challenges. First, the complexity of accurately predicting pixel-level classification results exceeds that of obtaining image-level counterparts, and second, owing to the scarcity of abundant segmentation masks for 𝒞1:t1subscript𝒞:1𝑡1\mathcal{C}_{1:t-1}caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT during training at step t𝑡titalic_t, the segmentation network often faces pronounced catastrophic forgetting concerning 𝒞1:t1subscript𝒞:1𝑡1\mathcal{C}_{1:t-1}caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT. Consequently, the formulation of a straightforward yet efficient framework for continual semantic segmentation becomes highly demanding. Based on the remarkable progress in ViT-based dense prediction methods [6, 5, 23], in this work we investigate in-depth how a vanilla ViT works on continual semantic segmentation.

TABLE I: Performance investigation of CNN-based and ViT-based continual semantic segmentation methods on the PASCAL VOC 15-1 task under overlapped setting. (Upper) Replacing the CNN backbone with vanilla ViT in the SSUL framework results in performance deterioration. (Lower) Various ViT-based variants in continual segmentation. Our findings indicate: 1) an extremely lightweight linear decoder reduces forgetting on base classes; 2) fixing the backbone limits ViT’s generalization for continual segmentation; and 3) introducing an adapter enhances anti-catastrophic forgetting and improves segmentation performance.
Base Framework Backbone Decoder Freeze Distill Adapter VOC 15-1 (6 steps)
0-15 16-20 all
SSUL [13] (Joint) ResNet-101 [16] DeepLab V3 [3] - - 82.70 75.00 80.90
SSUL [13] ResNet-101 [16] DeepLab V3 [3] - - 78.40 (-4.30%) 49.00 (-26.00%) 71.40 (-9.50%)
SSUL [13] ViT-B [26] DeepLab V3 [3] - - 76.68 (-5.02%) 43.78 (-31.22%) 68.85 (-12.05%)
SSUL [13] ViT-B [26] Linear - - 80.81 (-1.89%) 26.87 (-48.13%) 67.97 (-12.93%)
SSUL [13] ViT-B [26] Linear - - - 39.06 (-43.66%) 42.25 (-32.75%) 39.82 (-41.08%)
SSUL [13] ViT-B [26] Linear - - 80.43 (-2.27%) 62.67 (-12.33%) 76.20 (-14.70%)
SSUL [13] ViT-B [26] Linear - 81.40 (-1.30%) 53.40 (-21.60%) 74.80 (-6.10%)

III-B Vanilla ViT on Continual Semantic Segmentation

We are motivated by the following two factors to investigate vanilla ViT for continual semantic segmentation: 1) the impressive performance demonstrated by pretrained vision transformers on segmentation tasks [5, 6, 23]; and 2) the limited exploration of ViTs’ anti-catastrophic forgetting capability in dense prediction scenarios. Given these considerations, ViTs are expected to outperform CNN-based approaches for continual semantic segmentation. We adopt the classical continual semantic segmentation framework SSUL [13] as the baseline method, which utilizes a frozen CNN feature extractor and the exemplar replay technique for optimal performance. Specifically, we replace the original ResNet-101 feature extractor in SSUL with the ImageNet-pretrained ViT-B feature extractor [26] and maintain other components unchanged. The baseline SSUL and the ViT-B variant are evaluated on the PASCAL VOC 15-1 benchmark under an overlapped setting [1, 10]. The results are presented in the upper part of Table I. Though introducing larger-range interactions and additional training parameters, the ViT-B variant does not surpass the baseline, achieving only 76.68% mIoU on base classes, 1.72% lower than its CNN counterpart. Compared to the CNN-based SSUL baseline, the ViT-B variant exhibits a significant performance decline by 5.22% in mIoU on all novel classes, indicating challenges in learning new concepts in continual segmentation tasks. These findings prompt an investigation into the limitations and potential solutions for ViT-based continual semantic segmentation.

III-C Analysis

Based on prior ViT works [58, 43], if we perform fine-tuning on 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the segmentation network f𝑓fitalic_f with decoder g𝑔gitalic_g may overfit to base classes 𝒞1subscript𝒞1\mathcal{C}_{1}caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, limiting its generalization performance to novel categories 𝒞2:Tsubscript𝒞:2𝑇\mathcal{C}_{2:T}caligraphic_C start_POSTSUBSCRIPT 2 : italic_T end_POSTSUBSCRIPT and resulting in lower overall mIoU. To test this hypothesis, we replace the DeepLab V3 segmentation head with a simple linear decoder, directly correlating predictions to ViT output features. In the lower part of Table I, we see that introducing a linear head for the ViT-B variant yields a 4.13% mIoU improvement on base classes, surpassing the CNN-based SSUL baseline by 2.4%. This suggests that pretrained ViT possesses sufficient anti-catastrophic forgetting capability on base classes. By removing the complex decoder structure to accommodate features for novel classes, the mIoU of novel classes notably drops to 26.87%, underscoring the limited generalization performance of pretrained ViT to novel classes. This observation substantiates our hypothesis. Consequently, an intuitive question arises: can we employ the pretrained vanilla ViT with a linear decoder to formulate an effective continual segmentation network, achieving sustained strong segmentation for base classes and robust segmentation capability for novel classes?

The answer to the above question unfolds in two aspects. Firstly, drawing inspiration from successful endeavors in parameter-efficient fine-tuning [59, 60, 28, 27, 29], we posit that incorporating lightweight modules, such as adapters [28, 29, 61], into the ViT feature extractor can enhance segmentation quality for novel classes. To validate this hypothesis, we employ the ViT-B feature extractor with a linear decoder and introduce adapters between transformer blocks. Subsequently, we optimize this variant using the SSUL framework on the PASCAL VOC 15-1 benchmark. As illustrated in Table I, the ViT-B variant with adapters significantly achieves 53.40% mIoU for novel classes, indicating substantial improvement in generalization even with a frozen feature extractor f𝑓fitalic_f post base classes training. Surprisingly, additional adapters also benefit base classes, resulting in a 0.7% mIoU improvement. These findings solidly support our first hypothesis.

Secondly, instead of freezing feature extractor f𝑓fitalic_f post base classes training, fully fine-tuning f𝑓fitalic_f with a distillation term in subsequent training stages yields benefits for novel classes. We adapt SSUL to enable continuous parameter updates for f𝑓fitalic_f throughout training, optimizing the corresponding ViT-B with a linear head. Additionally, introducing feature distillation as a regularization term in this SSUL variant enhances anti-catastrophic forgetting. In the lower section of Table I, compared to ViT-B-based SSUL with a linear head, the fully fine-tuned counterpart achieves a significant (similar-to\sim15.4%) improvement in novel classes mIoU. With the added distillation loss, the mIoU for novel classes further increases to 62.67%, maintaining anti-catastrophic forgetting ability. These findings affirm our second observation. Consequently, both the two modifications indicate that leveraging ViT-B with a simple linear head yields promising continual segmentation results, inspiring the design of our continual semantic segmentation framework, i.e., ConSept.

Refer to caption
Figure 2: Overview of our proposed ConSept. The pipeline is primarily grounded on SSUL [13] by replacing the segmentation head with vanilla ViT accompanied with a linear head. To fully harness the anti-catastrophic forgetting capability of ViT and enhance the generalization performance in continual segmentation scenarios, we integrate adapters into ViTs, resulting in a dual-path feature extractor with a fully fine-tuning learning paradigm, which is the key element of ConSept. Additionally, ConSept employs feature distillation with a frozen old-class linear head to enhance its anti-catastrophic forgetting ability and incorporate dual dice losses to regularize the segmentation maps for overall segmentation performance.

IV Proposed Method

Building upon the insights gained in Sec. III-C, we propose to utilize the ImageNet-pretrained vanilla ViT feature extractor and a straightforward linear segmentation decoder to attain robust anti-catastrophic forgetting capability and generalization performance to novel classes within an appropriate training framework. In this section, we present our perspectives on training such a ViT-based continual semantic segmentation network and introduce the proposed solution, namely ConSept.

IV-A Overview of ConSept

Fig. 2 delineates the comprehensive pipeline of our proposed ConSept. In the training phase of step t(t>1)𝑡𝑡1t(t>1)italic_t ( italic_t > 1 ), we consider an input image 𝐈𝐈\mathbf{I}bold_I with the corresponding ground-truth segmentation mask Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT randomly sampled from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The mask annotation 𝐌^^𝐌\mathbf{\hat{M}}over^ start_ARG bold_M end_ARG exclusively encompasses classes in 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The input image 𝐈𝐈\mathbf{I}bold_I undergoes processing through the feature extractor ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT optimized in the preceding t1𝑡1t-1italic_t - 1 learning steps. Here, ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT encompasses both a vision transformer [26] and the associated ViT-Adapter [29], yielding multi-scale features 𝐅t1={𝐅t10,,𝐅t13}subscript𝐅𝑡1subscriptsuperscript𝐅0𝑡1subscriptsuperscript𝐅3𝑡1\mathbf{F}_{t-1}=\{\mathbf{F}^{0}_{t-1},...,\mathbf{F}^{3}_{t-1}\}bold_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = { bold_F start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , bold_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }. For simplicity, we denote by 𝐅t10subscriptsuperscript𝐅0𝑡1\mathbf{F}^{0}_{t-1}bold_F start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT the feature from the shallow layers of ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with the largest spatial resolution, and 𝐅t13subscriptsuperscript𝐅3𝑡1\mathbf{F}^{3}_{t-1}bold_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT conversely. Following the approach of [6], all features in 𝐅t1subscript𝐅𝑡1\mathbf{F}_{t-1}bold_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT are interpolated to a consistent spatial resolution and concatenated into a fused feature 𝐅t1fusesubscriptsuperscript𝐅fuse𝑡1\mathbf{F}^{\text{fuse}}_{t-1}bold_F start_POSTSUPERSCRIPT fuse end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Subsequently, a linear projection layer ht1subscript𝑡1h_{t-1}italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT maps 𝐅t1fusesubscriptsuperscript𝐅fuse𝑡1\mathbf{F}^{\text{fuse}}_{t-1}bold_F start_POSTSUPERSCRIPT fuse end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to 𝐅^t1fusesubscriptsuperscript^𝐅fuse𝑡1\mathbf{\hat{F}}^{\text{fuse}}_{t-1}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT fuse end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Finally, the linear segmentation head g1:t1subscript𝑔:1𝑡1g_{1:t-1}italic_g start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT of old classes is employed to predict segmentation masks 𝐌t1subscript𝐌𝑡1\mathbf{M}_{t-1}bold_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT for classes in 𝒞1:t1subscript𝒞:1𝑡1\mathcal{C}_{1:t-1}caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT. With the same architecture, the input image 𝐈𝐈\mathbf{I}bold_I is also processed through the t𝑡titalic_t-step feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the corresponding projection layer htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, yielding multi-scale features 𝐅tsubscript𝐅𝑡\mathbf{F}_{t}bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and projected fused features 𝐅^tfusesubscriptsuperscript^𝐅fuse𝑡\mathbf{\hat{F}}^{\text{fuse}}_{t}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT fuse end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, to predict the segmentation masks for old and new categories, both the linear head of old classes g1:t1subscript𝑔:1𝑡1g_{1:t-1}italic_g start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT and the counterpart of new classes gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are utilized as follows:

𝐌t1=[g1:t1(𝐅^tfuse),gt(𝐅^tfuse)],subscript𝐌𝑡1subscript𝑔:1𝑡1subscriptsuperscript^𝐅fuse𝑡subscript𝑔𝑡subscriptsuperscript^𝐅fuse𝑡\mathbf{M}_{t-1}=[g_{1:t-1}(\mathbf{\hat{F}}^{\text{fuse}}_{t}),g_{t}(\mathbf{% \hat{F}}^{\text{fuse}}_{t})],bold_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = [ italic_g start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ( over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT fuse end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT fuse end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , (1)

where []delimited-[][\dots][ … ] means the concatenate operation.

During optimization, we first map the predicted segmentation mask 𝐌t1subscript𝐌𝑡1\mathbf{M}_{t-1}bold_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT into the pseudo ground-truth mask 𝐒^1:t1subscript^𝐒:1𝑡1\mathbf{\hat{S}}_{1:t-1}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT as follows:

𝐒^1:t1=max𝒞1:t1σ(𝐌t1).subscript^𝐒:1𝑡1subscriptsubscript𝒞:1𝑡1𝜎subscript𝐌𝑡1\mathbf{\hat{S}}_{1:t-1}=\max_{\mathcal{C}_{1:t-1}}\sigma(\mathbf{M}_{t-1}).over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_σ ( bold_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . (2)

Subsequently, the pseudo ground-truth mask 𝐒^1:t1subscript^𝐒:1𝑡1\mathbf{\hat{S}}_{1:t-1}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT of old classes is merged with the ground-truth mask 𝐒tsubscript𝐒𝑡\mathbf{S}_{t}bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for new classes in 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to form the ultimate pseudo ground-truth mask 𝐒^1:tsubscript^𝐒:1𝑡\mathbf{\hat{S}}_{1:t}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT encompassing classes in 𝒞1:tsubscript𝒞:1𝑡\mathcal{C}_{1:t}caligraphic_C start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT. Specifically, for each pixel i𝑖iitalic_i, the pseudo ground-truth label 𝐒^1:t,isubscript^𝐒:1𝑡𝑖\mathbf{\hat{S}}_{1:t,i}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t , italic_i end_POSTSUBSCRIPT is defined as follows:

𝐒^1:t,i={𝐒t,i(𝐒t,i𝒞t)𝐒^1:t1,i(𝐒t,i𝒞t)\mathbf{\hat{S}}_{1:t,i}=\left\{\begin{aligned} &\mathbf{S}_{t,i}&(\mathbf{S}_% {t,i}\in\mathcal{C}_{t})\\ &\mathbf{\hat{S}}_{1:t-1,i}&(\mathbf{S}_{t,i}\notin\mathcal{C}_{t})\end{% aligned}\right.over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL bold_S start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT end_CELL start_CELL ( bold_S start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t - 1 , italic_i end_POSTSUBSCRIPT end_CELL start_CELL ( bold_S start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∉ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW (3)

Different from [13, 14], ConSept does not introduce the “unknown” class, but it still obtains robust and promising performance. Finally, the segmentation loss bce+dual_dicesubscriptbcesubscriptdual_dice\mathcal{L}_{\text{bce}}+\mathcal{L}_{\text{dual\_dice}}caligraphic_L start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT dual_dice end_POSTSUBSCRIPT between 𝐌tsubscript𝐌𝑡\mathbf{M}_{t}bold_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐒^1:tsubscript^𝐒:1𝑡\mathbf{\hat{S}}_{1:t}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT is minimized while minimizing the distillation loss mse+contrastsubscriptmsesubscriptcontrast\mathcal{L}_{\text{mse}}+\mathcal{L}_{\text{contrast}}caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT between 𝐅tsubscript𝐅𝑡\mathbf{F}_{t}bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐅t1subscript𝐅𝑡1\mathbf{F}_{t-1}bold_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.

IV-B Better Generalization: Fine-tuning with Adapters

Given the notable progress of CNN-based visual adapters [27, 28, 29] in transfer learning, and in accordance with our first observation in Sec. III-C, one may hypothesize that introducing adapters or fine-tuning the feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can enhance the generalization ability of novel classes, thereby improving the overall segmentation performance. Therefore, we propose to integrate lightweight adapters into ConSept. Specifically, ConSept comprises a shallow convolution-based stem block [62] and multiple lightweight cross-attention layers to construct adapters. As there is no interaction between old and new models, we omit the step number t𝑡titalic_t in this context for simplicity. To elaborate, we initially utilize the stem block to extract multi-scale features, followed by flattening these features through a flatten operation, resulting in the initialized adapter feature 𝐱ada0subscriptsuperscript𝐱0ada\mathbf{x}^{0}_{\text{ada}}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT:

𝐱ada0=flatten(stem(𝐈)).subscriptsuperscript𝐱0adaflattenstem𝐈\mathbf{x}^{0}_{\text{ada}}=\text{flatten}(\text{stem}(\mathbf{I})).bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT = flatten ( stem ( bold_I ) ) . (4)

Then for the l𝑙litalic_l-th ViT-based feature 𝐱vitlsubscriptsuperscript𝐱𝑙vit\mathbf{x}^{l}_{\text{vit}}bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vit end_POSTSUBSCRIPT generated by the l𝑙litalic_l-th group of transformer blocks, we aggregate 𝐱adalsubscriptsuperscript𝐱𝑙ada\mathbf{x}^{l}_{\text{ada}}bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT into 𝐱vitlsubscriptsuperscript𝐱𝑙vit\mathbf{x}^{l}_{\text{vit}}bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vit end_POSTSUBSCRIPT by using cross-attention [39],

𝐱vitl=𝐱vitl+Attn(norm(𝐱vitl),norm(𝐱adal),norm(𝐱adal)),subscriptsuperscript𝐱𝑙vitsubscriptsuperscript𝐱𝑙vitAttnnormsubscriptsuperscript𝐱𝑙vitnormsubscriptsuperscript𝐱𝑙adanormsubscriptsuperscript𝐱𝑙ada\mathbf{x}^{l}_{\text{vit}}=\mathbf{x}^{l}_{\text{vit}}+\text{Attn}(\text{norm% }(\mathbf{x}^{l}_{\text{vit}}),\text{norm}(\mathbf{x}^{l}_{\text{ada}}),\text{% norm}(\mathbf{x}^{l}_{\text{ada}})),bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vit end_POSTSUBSCRIPT = bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vit end_POSTSUBSCRIPT + Attn ( norm ( bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vit end_POSTSUBSCRIPT ) , norm ( bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT ) , norm ( bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT ) ) , (5)

where norm()norm\text{norm}(\cdot)norm ( ⋅ ) means LayerNorm [63], and Attn(𝐪,𝐤,𝐯)Attn𝐪𝐤𝐯\text{Attn}(\textbf{q},\textbf{k},\textbf{v})Attn ( q , k , v ) means cross-attention [39]. Subsequently, we refine the adapter feature using a feed-forward network and the cross-attention with the (l+1)𝑙1(l+1)( italic_l + 1 )-th ViT-based feature 𝐱vitl+1subscriptsuperscript𝐱𝑙1vit\mathbf{x}^{l+1}_{\text{vit}}bold_x start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vit end_POSTSUBSCRIPT, yielding 𝐱adal+1subscriptsuperscript𝐱𝑙1ada\mathbf{x}^{l+1}_{\text{ada}}bold_x start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT as follows:

𝐱^adal=𝐱adal+Attn(norm(𝐱adal),norm(𝐱vitl+1),norm(𝐱vitl+1))𝐱adal=𝐱^adal+FFN(norm(𝐱^adal)),subscriptsuperscript^𝐱𝑙adaabsentsubscriptsuperscript𝐱𝑙adaAttnnormsubscriptsuperscript𝐱𝑙adanormsubscriptsuperscript𝐱𝑙1vitnormsubscriptsuperscript𝐱𝑙1vitsubscriptsuperscript𝐱𝑙adaabsentsubscriptsuperscript^𝐱𝑙adaFFNnormsubscriptsuperscript^𝐱𝑙ada\begin{aligned} \mathbf{\hat{x}}^{l}_{\text{ada}}&=\mathbf{x}^{l}_{\text{ada}}% +\text{Attn}(\text{norm}(\mathbf{x}^{l}_{\text{ada}}),\text{norm}(\mathbf{x}^{% l+1}_{\text{vit}}),\text{norm}(\mathbf{x}^{l+1}_{\text{vit}}))\\ \mathbf{x}^{l}_{\text{ada}}&=\mathbf{\hat{x}}^{l}_{\text{ada}}+\text{FFN}(% \text{norm}(\mathbf{\hat{x}}^{l}_{\text{ada}}))\end{aligned},start_ROW start_CELL over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT end_CELL start_CELL = bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT + Attn ( norm ( bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT ) , norm ( bold_x start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vit end_POSTSUBSCRIPT ) , norm ( bold_x start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vit end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL bold_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT end_CELL start_CELL = over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT + FFN ( norm ( over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT ) ) end_CELL end_ROW , (6)

where FFN means the feed-forward network. By feature aggregation with n𝑛nitalic_n individual adapters, we acquire the ultimate adapter feature 𝐱adansubscriptsuperscript𝐱𝑛ada\mathbf{x}^{n}_{\text{ada}}bold_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ada end_POSTSUBSCRIPT, which is then split and reshaped to form 𝐅𝐅\mathbf{F}bold_F. For simplicity, in ConSept, we set n=4𝑛4n=4italic_n = 4 as the number of adapters, striking a balance between performance and additional parameters.

The incorporation of adapters into ConSept offers dual advantages. Firstly, the usage of lightweight and efficient modules (i.e., with fewer than 10% additional parameters) in the feature extractor significantly enhances the generalization ability of ViT in class-incremental scenarios, without adversely affecting the optimization procedure or the final performance. Secondly, our proposed method remains compatible with other techniques that employ more complicated segmentation decoders [14, 22, 24] or additional supervision [14, 24] to achieve superior performance.

Moreover, conventional approaches [13, 14] typically freeze the parameters of the feature extractor during training steps with t>1𝑡1t>1italic_t > 1, updating only the parameters of the segmentation decoder. This strategy aims to preserve the feature representation and segmentation ability of base classes. In contrast to these methods, in the training phases for base and novel classes, we concurrently update parameters for both the feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the segmentation decoder gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Our approach allows ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to accumulate both texture and semantic information related to novel classes 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ensuring ConSept to correctly localize regions from 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and avoid suboptimal segmentation performance on 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The experimental results in Sec. V-D will verify the impact of incorporating adapters for ViT and adopting full fine-tuning in ConSept.

In addition to the pivotal macro design of ConSept, we introduce two additional strategies in the subsequent sections to enhance the anti-catastrophic forgetting ability and overall segmentation performance of ConSept.

IV-C Better Anti-Forgetting: Distillation with Deterministic Old-classes Boundary

As shown in Table I, simple fine-tuning will inevitably lead to forgetting in ConSept, particularly for 𝒞1:t1subscript𝒞:1𝑡1\mathcal{C}_{1:t-1}caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT. We address this issue by introducing a feature distillation constraint on 𝐅tsubscript𝐅𝑡\mathbf{F}_{t}bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through our proposed dual feature distillation loss distillsubscriptdistill\mathcal{L}_{\text{distill}}caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT. The key components of distillsubscriptdistill\mathcal{L}_{\text{distill}}caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT include two aspects. Firstly, to ensure the consistency between features from the t𝑡titalic_t-step network 𝐅tsubscript𝐅𝑡\mathbf{F}_{t}bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and their corresponding old features 𝐅t1subscript𝐅𝑡1\mathbf{F}_{t-1}bold_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, we employ the dense mean-square loss msesubscriptmse\mathcal{L}_{\text{mse}}caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT as the primary distillation loss. Specifically, for each 𝐅tisubscriptsuperscript𝐅𝑖𝑡\mathbf{F}^{i}_{t}bold_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it is defined as follows:

mse=i(𝐅ti𝐅t1i)2.subscriptmsesubscript𝑖superscriptsubscriptsuperscript𝐅𝑖𝑡subscriptsuperscript𝐅𝑖𝑡12\mathcal{L}_{\text{mse}}=\sum_{i}(\mathbf{F}^{i}_{t}-\mathbf{F}^{i}_{t-1})^{2}.caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (7)

Secondly, vanilla msesubscriptmse\mathcal{L}_{\text{mse}}caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT may incorrectly remain regions from novel classes as the same as 𝐅t1subscript𝐅𝑡1\mathbf{F}_{t-1}bold_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, potentially limiting the performance on 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To enhance the discriminative ability of features from different regions, we introduce the contrastive distillation loss contrastsubscriptcontrast\mathcal{L}_{\text{contrast}}caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT between 𝐅tsubscript𝐅𝑡\mathbf{F}_{t}bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐅t1subscript𝐅𝑡1\mathbf{F}_{t-1}bold_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT as follows:

contrast=ijlog𝐅t,ji,𝐅t1,jik𝐅t,ji,𝐅t1,ki,subscriptcontrastsubscript𝑖subscript𝑗subscriptsuperscript𝐅𝑖𝑡𝑗subscriptsuperscript𝐅𝑖𝑡1𝑗subscript𝑘subscriptsuperscript𝐅𝑖𝑡𝑗subscriptsuperscript𝐅𝑖𝑡1𝑘\mathcal{L}_{\text{contrast}}=\sum_{i}\sum_{j}-\log\frac{\langle\mathbf{F}^{i}% _{t,j},\mathbf{F}^{i}_{t-1,j}\rangle}{\sum_{k}\langle\mathbf{F}^{i}_{t,j},% \mathbf{F}^{i}_{t-1,k}\rangle},caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_log divide start_ARG ⟨ bold_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 , italic_j end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ bold_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 , italic_k end_POSTSUBSCRIPT ⟩ end_ARG , (8)

where ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ means cosine similarity, i𝑖iitalic_i means features from the i𝑖iitalic_i-th layer, j𝑗jitalic_j and k𝑘kitalic_k means the indices of pixels. Therefore, our proposed dual distillation loss distillsubscriptdistill\mathcal{L}_{\text{distill}}caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT is formulated as follows:

distill=mse+contrast.subscriptdistillsubscriptmsesubscriptcontrast\mathcal{L}_{\text{distill}}=\mathcal{L}_{\text{mse}}+\mathcal{L}_{\text{% contrast}}.caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT . (9)

During experiments, we exclusively employ distillsubscriptdistill\mathcal{L}_{\text{distill}}caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT on feature 𝐅t3subscriptsuperscript𝐅3𝑡\mathbf{F}^{3}_{t}bold_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to prevent overfitting. The rationale and impact of this specific choice will be discussed in Section V-D.

Moreover, we introduce anti-forgetting constraints into the lightweight segmentation decoder to further maintain performance on base classes. Assuming that during step t1𝑡1t-1italic_t - 1, ConSept has achieved optimal segmentation performance on 𝒞1:t1subscript𝒞:1𝑡1\mathcal{C}_{1:t-1}caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT, then the learned linear segmentation head represents an ideal decision boundary for 𝒞1:t1subscript𝒞:1𝑡1\mathcal{C}_{1:t-1}caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT. We utilize this decision boundary to constrain ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Consequently, we fine-tune the old classes with only a frozen linear head. Specifically, during training for base classes, we update all ConSept parameters to ensure optimal segmentation ability. For t>1𝑡1t>1italic_t > 1, we freeze the parameters of the linear head for old classes g1:t1subscript𝑔:1𝑡1g_{1:t-1}italic_g start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT and keep the other parameters updated during training. Then the predictions 𝐌tsubscript𝐌𝑡\mathbf{M}_{t}bold_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from ConSept in step t𝑡titalic_t will provide confidence scores for old classes in 𝐌tsubscript𝐌𝑡\mathbf{M}_{t}bold_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT constrained by a fixed segmentation boundary. This restriction ensures that the corresponding feature 𝐅tsubscript𝐅𝑡\mathbf{F}_{t}bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT exhibits little or no variation in the region of base classes, thereby enhancing the anti-catastrophic forgetting ability. The effect of the distillation loss and frozen head will be revealed in the experimental results in Section V-D. We will discuss how to further enhance the continual segmentation performance of ConSept through additional regularization in the following sections.

IV-D Better Regularization: Dual Dice Losses for Segmentation

We further enhance the training of ConSept in terms of segmentation loss. While the binary cross-entropy loss with fused pseudo ground-truth masks offers dense supervision for segmentation predictions, two issues remain in the context of continual segmentation tasks. First, the highly imbalanced distribution of pseudo or ground-truth masks between old and new classes constrain the performance of old classes. Second, the limited training data and iterations in continual learning steps hamper the discriminative ability for new classes. Therefore, the introduction of additional regularization for segmentation predictions is crucial for the efficacy of ConSept.

With this motivation, we propose dual dice losses dual_dicesubscriptdual_dice\mathcal{L}_{\text{dual\_dice}}caligraphic_L start_POSTSUBSCRIPT dual_dice end_POSTSUBSCRIPT for ConSept. dual_dicesubscriptdual_dice\mathcal{L}_{\text{dual\_dice}}caligraphic_L start_POSTSUBSCRIPT dual_dice end_POSTSUBSCRIPT comprises two integral components, namely, class-specific dice loss c-dicesubscriptc-dice\mathcal{L}_{\text{c-dice}}caligraphic_L start_POSTSUBSCRIPT c-dice end_POSTSUBSCRIPT and old-new dice loss on-dicesubscripton-dice\mathcal{L}_{\text{on-dice}}caligraphic_L start_POSTSUBSCRIPT on-dice end_POSTSUBSCRIPT. Specifically, to ensure that old classes are well learned and the catastrophic forgetting for new classes can be reduced, we compute the multi-class dice loss [64, 30] between the predictions 𝐌tsubscript𝐌𝑡\mathbf{M}_{t}bold_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in step t𝑡titalic_t and the corresponding pseudo ground-truths 𝐒^1:tsubscript^𝐒:1𝑡\mathbf{\hat{S}}_{1:t}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT, as shown below:

c-dice=1|𝒞1:t|c𝒞1:t2j=1N𝐌[c]t,j𝐒^[c]1:t,ji=1N𝐌[c]t,j2+i=1N𝐒^[c]1:t,j2subscriptc-dice1subscript𝒞:1𝑡subscript𝑐subscript𝒞:1𝑡2superscriptsubscript𝑗1𝑁𝐌subscriptdelimited-[]𝑐𝑡𝑗^𝐒subscriptdelimited-[]𝑐:1𝑡𝑗superscriptsubscript𝑖1𝑁𝐌superscriptsubscriptdelimited-[]𝑐𝑡𝑗2superscriptsubscript𝑖1𝑁^𝐒superscriptsubscriptdelimited-[]𝑐:1𝑡𝑗2\mathcal{L}_{\text{c-dice}}=\frac{1}{|\mathcal{C}_{1:t}|}\sum_{c\in\mathcal{C}% _{1:t}}\frac{2\sum_{j=1}^{N}{\mathbf{M}[c]}_{t,j}{\mathbf{\hat{S}}[c]}_{1:t,j}% }{\sum_{i=1}^{N}{\mathbf{M}[c]}_{t,j}^{2}+\sum_{i=1}^{N}{{\mathbf{\hat{S}}[c]}% _{1:t,j}^{2}}}caligraphic_L start_POSTSUBSCRIPT c-dice end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_C start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 2 ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_M [ italic_c ] start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT over^ start_ARG bold_S end_ARG [ italic_c ] start_POSTSUBSCRIPT 1 : italic_t , italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_M [ italic_c ] start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG bold_S end_ARG [ italic_c ] start_POSTSUBSCRIPT 1 : italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (10)

where [c]delimited-[]𝑐[c][ italic_c ] indicates the confidence score or one-hot ground-truth label for class c𝑐citalic_c in 𝐌tsubscript𝐌𝑡\mathbf{M}_{t}bold_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT or 𝐒^1:tsubscript^𝐒:1𝑡\mathbf{\hat{S}}_{1:t}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT, j𝑗jitalic_j and N𝑁Nitalic_N indicate the index as well as the total number of pixel, respectively.

The old-new dice loss is to enhance the discriminative ability for novel classes in step t𝑡titalic_t. To achieve this goal, an intuitive approach is to binarize the multi-class predictions along with the corresponding pseudo masks into binary counterparts, i.e., “old” class and “new” class. Specifically, for 𝐌tsubscript𝐌𝑡\mathbf{M}_{t}bold_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we aggregate confidence scores of classes in 𝒞1:t1subscript𝒞:1𝑡1\mathcal{C}_{1:t-1}caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT to obtain the confidence score of the “old” class and aggregate confidence scores of classes in 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to obtain the score of the “new” class. For 𝐒^1:tsubscript^𝐒:1𝑡\mathbf{\hat{S}}_{1:t}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT, we first convert it into the corresponding one-hot label formulation and obtain 𝐒^1:t0-1subscriptsuperscript^𝐒0-1:1𝑡\mathbf{\hat{S}}^{0\text{-}1}_{1:t}over^ start_ARG bold_S end_ARG start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT; then, we binarize 𝐒^1:t0-1subscriptsuperscript^𝐒0-1:1𝑡\mathbf{\hat{S}}^{0\text{-}1}_{1:t}over^ start_ARG bold_S end_ARG start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT into old classes and new classes. The overall conversion procedure is summarized as:

𝐌~tsubscript~𝐌𝑡\displaystyle\mathbf{\tilde{M}}_{t}over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =[(𝐌t[𝒞1:t1]),(𝐌t[𝒞t])]absentsubscript𝐌𝑡delimited-[]subscript𝒞:1𝑡1subscript𝐌𝑡delimited-[]subscript𝒞𝑡\displaystyle=[\sum(\mathbf{M}_{t}[\mathcal{C}_{1:t-1}]),\sum(\mathbf{M}_{t}[% \mathcal{C}_{t}])]= [ ∑ ( bold_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ] ) , ∑ ( bold_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) ] (11)
𝐒~1:tsubscript~𝐒:1𝑡\displaystyle\mathbf{\tilde{S}}_{1:t}over~ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT =[max(𝐒^1:t0-1[𝒞1:t1]),max(𝐒^1:t0-1[𝒞t])]absentsubscriptsuperscript^𝐒0-1:1𝑡delimited-[]subscript𝒞:1𝑡1subscriptsuperscript^𝐒0-1:1𝑡delimited-[]subscript𝒞𝑡\displaystyle=[\max(\mathbf{\hat{S}}^{0\text{-}1}_{1:t}[\mathcal{C}_{1:t-1}]),% \max(\mathbf{\hat{S}}^{0\text{-}1}_{1:t}[\mathcal{C}_{t}])]= [ roman_max ( over^ start_ARG bold_S end_ARG start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT [ caligraphic_C start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ] ) , roman_max ( over^ start_ARG bold_S end_ARG start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT [ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) ]

Finally, we apply dice loss for 𝐌~tsubscript~𝐌𝑡\mathbf{\tilde{M}}_{t}over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐒~1:tsubscript~𝐒:1𝑡\mathbf{\tilde{S}}_{1:t}over~ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT as follows:

on-dice=2j=1N𝐌~t,j𝐒~1:t,jj=1N𝐌~t,j2+j=1N𝐒~1:t,j2subscripton-dice2superscriptsubscript𝑗1𝑁subscript~𝐌𝑡𝑗subscript~𝐒:1𝑡𝑗superscriptsubscript𝑗1𝑁superscriptsubscript~𝐌𝑡𝑗2superscriptsubscript𝑗1𝑁superscriptsubscript~𝐒:1𝑡𝑗2\mathcal{L}_{\text{on-dice}}=\frac{2\sum_{j=1}^{N}\mathbf{\tilde{M}}_{t,j}% \mathbf{\tilde{S}}_{1:t,j}}{\sum_{j=1}^{N}\mathbf{\tilde{M}}_{t,j}^{2}+\sum_{j% =1}^{N}{\mathbf{\tilde{S}}_{1:t,j}^{2}}}caligraphic_L start_POSTSUBSCRIPT on-dice end_POSTSUBSCRIPT = divide start_ARG 2 ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT over~ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t , italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG bold_S end_ARG start_POSTSUBSCRIPT 1 : italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (12)

where j𝑗jitalic_j and N𝑁Nitalic_N indicate the index as well as the total number of pixel, respectively.

Finally, we summarize the training loss t(1tT)subscript𝑡1𝑡𝑇\mathcal{L}_{t}(1\leq t\leq T)caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 ≤ italic_t ≤ italic_T ) for optimizing ConSept. During the training of step 1 (i.e., training on base classes), only training data with annotations for base classes are involved in the optimization. The distillation loss distillsubscriptdistill\mathcal{L}_{\text{distill}}caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT is not calculated. Only the segmentation loss (i.e., binary cross-entropy loss bcesubscriptbce\mathcal{L}_{\text{bce}}caligraphic_L start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT and class-specific dice loss [64] c-dicesubscriptc-dice\mathcal{L}_{\text{c-dice}}caligraphic_L start_POSTSUBSCRIPT c-dice end_POSTSUBSCRIPT) between the prediction 𝐌1subscript𝐌1\mathbf{M}_{1}bold_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the corresponding ground-truth mask 𝐒1subscript𝐒1\mathbf{S}_{1}bold_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is considered:

t=1=bce+c-dice.subscript𝑡=1subscriptbcesubscriptc-dice\mathcal{L}_{t\text{=}1}=\mathcal{L}_{\text{bce}}+\mathcal{L}_{\text{c-dice}}.caligraphic_L start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT c-dice end_POSTSUBSCRIPT . (13)

For training at any step t𝑡titalic_t with t>1𝑡1t>1italic_t > 1 (i.e., training on tasks with novel classes), both the distillation loss distillsubscriptdistill\mathcal{L}_{\text{distill}}caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT and the dual dice losses dual_dicesubscriptdual_dice\mathcal{L}_{\text{dual\_dice}}caligraphic_L start_POSTSUBSCRIPT dual_dice end_POSTSUBSCRIPT are considered. Therefore, the training objective at step t𝑡titalic_t (t>1𝑡1t>1italic_t > 1) is expressed as follows:

t>1=bce+c-dice+on-dice+mse+contrast.subscript𝑡1subscriptbcesubscriptc-dicesubscripton-dicesubscriptmsesubscriptcontrast\mathcal{L}_{t>1}=\mathcal{L}_{\text{bce}}+\mathcal{L}_{\text{c-dice}}+% \mathcal{L}_{\text{on-dice}}+\mathcal{L}_{\text{mse}}+\mathcal{L}_{\text{% contrast}}.caligraphic_L start_POSTSUBSCRIPT italic_t > 1 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT c-dice end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT on-dice end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT . (14)

V Experiments

V-A Datasets and Metrics

Datasets. Aligned with prior state-of-the-art continual semantic segmentation methods [14, 10, 22], we assess ConSept on the PASCAL VOC dataset [1] and ADE20K dataset [31]. The PASCAL VOC dataset comprises 10,582 training images and 1,449 validation images, featuring 20 foreground classes and 1 background class. The ADE20K dataset encompasses approximately 20,000 training images and 2,000 validation images, presenting a more challenging scenario with 150 categories compared to PASCAL VOC.

Protocols. As in prior research [14, 10, 22], on the PASCAL VOC dataset, we assess ConSept using 15-1, 15-5, and 19-1 tasks in both overlapped and disjoint settings. Task “X-Y” denotes using the initial X𝑋Xitalic_X categories as base classes and incorporating Y𝑌Yitalic_Y novel classes for each new task. On the ADE20K dataset, we evaluate ConSept using 100-10, 100-50, and 50-50 tasks under the overlapped setting.

Metrics. In alignment with prior methods [14, 10, 22], we employ the mean Intersection-over-Union (mIoU) as the primary metric for performance comparison with state-of-the-art methods.

V-B Implementation Details

Building upon prior works in continual semantic segmentation for both CNNs [14, 10, 15, 17] and ViTs [25, 22], ConSept is initialized with ImageNet-pretrained ViT-B parameters [26, 47]. The newly added convolution layers employ Kaiming normal distribution [16], and linear layers use truncated normal distribution [65, 26] for weight initialization. The input shape is 512×512512512512\times 512512 × 512, and fundamental data augmentation techniques (i.e., random crop and resize, and random horizontal flipping) are used in training and evaluation. During training, we leverage the AdamW optimizer [66] with a weight decay of 0.01 to optimize ConSept. The base task training follows previous ViT-based segmentation methods [4, 5, 6] for 100 epochs, while we halve the training epochs of novel tasks to mitigate catastrophic forgetting. The initial learning rate is 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and the learning rate is multiplied by 10 for linear segmentation head. Following [3], we leverage the polynomial learning rate scheduler with a power of 0.90.90.90.9 to adjust learning rate during training. All codes are implemented by PyTorch toolkit [67], and all experiments are conducted on two NVIDIA RTX A6000 GPUs.

TABLE II: Comparison of diverse continual semantic segmentation methods on PASCAL VOC benchmarks under the overlapped setting. Best results are bolded, and the second-best results are underlined. ConSept achieves competitive performance across all the three tasks compared to state-of-the-art methods.
Method Backbone Decoder 15-1 (6 steps) 15-5 (2 steps) 19-1 (2 steps)
0-15 16-20 all 0-15 16-20 all 0-19 20 all
CNN-based Methods
ILT [38] ResNet-101 DeepLab V3 8.75 7.99 8.56 67.48 39.23 60.45 67.75 10.88 65.05
MiB [10] ResNet-101 DeepLab V3 34.22 13.50 29.29 76.37 49.97 70.08 71.43 23.59 69.15
SDR [17] ResNet-101 DeepLab V3 44.70 21.80 39.20 75.40 52.60 69.90 69.10 32.60 67.40
PLOP [15] ResNet-101 DeepLab V3 65.12 21.11 54.64 75.73 51.71 70.09 75.35 37.35 73.54
RECALL [21] ResNet-101 DeepLab V3 65.70 47.80 62.70 66.60 50.90 64.00 67.90 53.50 68.40
REMIND [68] ResNet-101 DeepLab V3 68.30 27.23 58.52 76.11 50.74 70.07 76.48 32.34 74.38
SSUL [13] ResNet-101 DeepLab V3 78.40 49.00 71.40 78.40 55.80 73.00 77.80 49.80 76.50
MicroSeg [14] ResNet-101 MicroSeg 81.30 52.50 74.40 82.00 59.20 76.60 79.30 62.90 78.50
Joint (CNN) Upper Bound ResNet-101 DeepLab V3 82.70 75.00 80.90 82.70 75.00 80.90 81.0 79.1 80.9
ViT-based Methods
MiB [10] ViT-Base DeepLab V3 72.55 23.14 61.73 78.62 63.10 75.62 79.91 47.70 79.10
RBC [69] ViT-Base DeepLab V3 75.90 40.15 68.24 78.86 62.01 75.53 80.24 38.79 78.99
MicroSeg [14] Swin-Base MicroSeg 82.00 47.30 73.70 82.90 60.10 77.50 81.00 62.40 80.00
CoinSeg [24] Swin-Base MicroSeg 84.10 65.60 79.60 84.10 69.90 80.80 82.70 52.60 79.80
Incrementer [22] ViT-Base Segmenter 79.60 59.56 75.55 82.53 69.25 79.93 82.54 60.95 82.14
ConSept (Ours) ViT-Base Linear 84.53 66.06 80.13 84.94 73.69 82.26 84.03 72.02 83.46
Joint (ViT) Upper Bound ViT-Base Linear 85.66 82.69 84.95 85.66 82.69 84.95 84.97 84.63 84.95
TABLE III: Comparison of diverse continual semantic segmentation methods on PASCAL VOC benchmarks under the disjoint setting. Best results are bolded, and the second-best results are underlined. ConSept achieves competitive performance across all the three tasks compared to state-of-the-art methods.
Method Backbone Decoder 15-1 (6 steps) 15-5 (2 steps) 19-1 (2 steps)
0-15 16-20 all 0-15 16-20 all 0-19 20 all
CNN-based Methods
ILT [38] ResNet-101 DeepLab V3 3.70 5.70 4.20 63.20 39.50 57.30 69.10 16.40 66.40
MiB [10] ResNet-101 DeepLab V3 46.20 12.90 37.90 71.80 43.30 64.70 69.60 25.60 67.40
SDR [17] ResNet-101 DeepLab V3 59.20 12.90 48.10 73.50 47.30 67.20 69.90 37.30 68.40
PLOP [15] ResNet-101 DeepLab V3 57.86 13.67 46.48 71.00 42.82 64.29 75.37 38.89 73.64
RECALL [21] ResNet-101 DeepLab V3 66.00 44.90 62.10 66.30 49.80 63.50 65.20 50.10 65.80
Joint (CNN) Upper Bound ResNet-101 DeepLab V3 82.70 75.00 80.90 82.70 75.00 80.90 81.0 79.1 80.9
ViT-based Methods
MiB [10] ViT-Base DeepLab V3 66.74 26.32 58.28 74.98 59.90 72.27 80.61 45.17 79.61
RBC [69] ViT-Base DeepLab V3 69.03 28.37 60.54 77.70 59.06 74.05 80.94 42.05 79.68
Incrementer [22] ViT-Base Segmenter 81.42 57.05 76.25 81.59 62.17 77.60 82.39 64.18 82.15
ConSept (Ours) ViT-Base Linear 80.54 63.88 76.57 82.06 71.79 79.62 83.92 71.31 83.32
Joint (ViT) Upper Bound ViT-Base Linear 85.66 82.69 84.95 85.66 82.69 84.95 84.97 84.63 84.95

V-C Comparison with State-of-the-Art Methods

Results on PASCAL VOC. We evaluate the competing methods on PASCAL VOC benchmarks [1] with the overlapped setting in Table IV. ConSept achieves state-of-the-art mIoU across all classes. Notably, without introducing intricate segmentation decoders [5], ConSept outperforms Incrementer [22] by 4.58%, 2.33%, and 1.32% in all-classes mIoU for 15-1, 15-5, and 19-1 tasks, respectively. Compared to CoinSeg [24], which leverages a superior pretrained feature extractor (i.e., ImageNet-pretrained [47] Swin-Base [70]) and external foundation models’ region proposal knowledge, ConSept still surpasses it by 0.53%, 1.46%, and 3.66% in all-classes mIoU for 15-1, 15-5, and 19-1 tasks, respectively. Furthermore, to assess our method’s anti-catastrophic forgetting capability, we utilize the architecture of ConSept and retrain the oracle model for all categories as an upper bound of mIoU, depicted in the last row of Table II. ConSept exhibits approximately 1% forgetting on base classes in mIoU, largely outperforming other methods while achieving superior mIoU on novel classes. These encouraging results affirm the effectiveness of ConSept.

In the more challenging disjoint setting on PASCAL VOC [1], competing results are also obtained. As shown in Table III, ConSept shows state-of-the-art performance across all the three benchmarks. Specifically, ConSept achieves 79.62% and 83.32% mIoU for 15-5 and 19-1 tasks, respectively, surpassing Incrementer [22] by 2.02% and 1.17%.

TABLE IV: Comparison of different continual semantic segmentation methods on ADE20K benchmarks with overlapped setting. Best results are bolded and the second best results are underlined. ConSept obtains competing performance against state-of-the-art methods on all the three tasks.
Method Backbone Decoder 100-50 (2 steps) 50-50 (3 steps) 100-10 (6 steps)
0-15 16-20 all 0-15 16-20 all 0-19 20 all
CNN-based Methods
ILT [38] ResNet-101 DeepLab V3 18.30 14.40 17.00 3.50 12.90 9.70 0.10 3.10 1.10
MiB [10] ResNet-101 DeepLab V3 40.52 17.17 32.79 45.57 21.01 29.31 38.21 11.12 29.24
SDR [17] ResNet-101 DeepLab V3 37.40 24.80 33.20 40.90 23.80 29.50 28.90 7.40 21.70
PLOP [15] ResNet-101 DeepLab V3 41.66 15.42 32.97 47.75 21.60 30.43 39.42 13.63 30.88
RBC [69] ResNet-101 DeepLab V3 42.90 21.49 35.81 49.59 26.32 34.18 39.01 21.67 33.27
REMIND [68] ResNet-101 DeepLab V3 41.55 19.16 34.14 47.11 20.35 29.39 38.96 21.28 33.11
SSUL [13] ResNet-101 DeepLab V3 42.80 17.50 34.40 49.10 20.10 29.80 42.90 17.70 34.50
MicroSeg [14] ResNet-101 MicroSeg 43.40 20.90 35.90 49.80 22.00 31.40 43.70 22.20 36.60
Joint (CNN) Upper Bound ResNet-101 DeepLab V3 43.90 27.20 38.30 50.90 32.10 38.30 43.90 27.20 38.30
ViT-based Methods
MiB [10] ViT-Base DeepLab V3 46.40 34.95 42.58 52.21 35.56 41.11 42.95 30.80 38.90
MicroSeg [14] Swin-Base MicroSeg 41.10 24.10 35.40 49.80 23.90 32.50 41.00 22.60 34.80
CoinSeg [24] Swin-Base MicroSeg 41.60 26.70 36.60 49.00 28.90 35.60 42.10 24.50 36.20
Incrementer [22] ViT-Base Segmenter 49.42 35.62 44.82 56.15 37.81 43.92 48.47 34.62 43.85
ConSept (Ours) ViT-Base Linear 51.42 36.58 46.51 56.92 38.63 44.81 49.37 33.00 43.95
Joint (ViT) Upper Bound ViT-Base Linear 51.39 39.02 47.30 57.41 42.14 47.30 51.39 39.02 47.30
Refer to caption
Figure 3: Visual comparison between previous state-of-the-art methods (i.e., SSUL [13], MicroSeg [14]) and ConSept on PASCAL VOC [1] 15-1 benchmark under the overlapped setting. Our method performs the best on both base and novel classes.

Results on ADE20K Benchmarks. Then we compare ConSept with prior state-of-the-art methods on the more challenging ADE20K benchmarks. The results are shown in Table IV. One can see that ConSept showcases competing performance with state-of-art-art methods. Specifically, ConSept attains 46.51% and 44.81% in mIoU for the 100-50 and 50-50 tasks, respectively, surpassing the state-of-the-art ViT-based method Incrementer [22] by 1.7% and 0.9%. For the 100-10 task, ConSept achieves a competitive 43.95% mIoU across all classes. Additionally, ConSept experiences less drop in mIoU on base classes across all the three tasks than Incrementer. These results demonstrate the effectiveness of ConSept on intricate and challenging scenarios.

TABLE V: Comparison of different continual semantic segmentation methods on PASCAL VOC 10-1 task with overlapped setting. Best results are bolded and the second best results are underlined. ConSept obtains competing performance against state-of-the-art methods. Meanwhile ConSept exhibits strong anti-catastrophic forgetting ability for both base classes and novel categories in different training steps.
Method 10-1 (11 steps)
0-10 11-20 all
Joint (CNN) 82.1 79.6 80.9
MiB [10] 20.0 (-62.1%) 20.1 (-59.5%) 20.1 (-60.8%)
SDR [17] 32.4 (-49.7%) 17.1 (-62.5%) 25.1 (-63.8%)
PLOP [15] 44.0 (-38.1%) 15.5 (-64.1%) 30.5 (-50.4%)
RECALL [21] 59.5 (-22.6%) 46.7 (-32.9%) 54.8 (-26.1%)
RCIL [71] 55.4 (-26.7%) 15.1 (-64.5%) 34.3 (-46.6%)
SSUL (CNN) [13] 74.0 (-8.1%) 53.2 (-26.4%) 64.1 (-16.8%)
MicroSeg (CNN) [14] 77.2 (-4.9%) 57.2 (-22.4%) 67.7 (-13.2%)
Joint (Swin-B) 82.4 83.0 82.7
SSUL (Swin-B) [13] 75.3 (-7.1%) 54.1 (-28.9%) 65.2 (-17.5%)
MicroSeg (Swin-B) [14] 78.9 (-3.5%) 59.2 (-23.8%) 70.1 (-12.6%)
CoinSeg [24] 81.3 (-1.1%) 64.4 (-18.6%) 73.7 (-9.0%)
Joint (ViT-B) 84.4 85.5 84.9
Incrementer [22] 77.6 (-6.8%) 60.3 (-25.2%) 70.2 (-14.6%)
ConSept (Ours) 77.5 (-6.9%) 69.5 (-16.0%) 73.7 (-11.2%)

Qualitative Results Analysis. In addition to quantitative results, we visualize the segmentation maps to compare ConSept with other methods. On the PASCAL VOC 15-1 benchmark [1], we compare SSUL [13], MicroSeg [14] and ConSept, all of which are based on the pretrained ViT, under the overlapped setting. As illustrated in Fig. 3, ConSept accurately localizes regions of base classes while generating more precise segmentation masks for novel classes. Specifically, ConSept produces more accurate masks for the novel “train” class and correctly localizes the newly-added “potted plant” and “person” classes.

Furthermore, we explore whether ConSept can learn new concepts without forgetting segmentation capability for old classes. Fig.4 illustrates the per-step visualization results of ConSept on PASCAL VOC 15-1 benchmark[1]. ConSept successfully learns the concepts of “sofa”, “train”, and “tv monitor” after corresponding task learning, preserving segmentation quality on base classes. Additionally, Fig.5 depicts per-step visualization results of ConSept on ADE20K 100-10 benchmark[31]. Even in more complex scenarios, ConSept maintains stable anti-catastrophic forgetting capability for old classes and strong generalization ability for novel classes.

Refer to caption
Figure 4: Visualization of predictions from ConSept on PASCAL VOC 15-1 task with overlapped setting. ConSept exhibits stable anti-catastrophic forgetting ability for old classes and good generalization ability for novel classes.
Refer to caption
Figure 5: Visualization of ConSept on ADE20K 100-10 task with overlapped setting. ConSept performs well on more challenging tasks.

Results on Longer Training Steps. To demonstrate the effectiveness of ConSept in tasks with extended training steps, we evaluate it on the PASCAL VOC 10-1 task (i.e., 11 steps in total) under the overlapped setting. Given that the seen categories are introduced during the exceptionally long training steps, we evaluate the performance and anti-catastrophic forgetting ability of both old and new categories. Table V displays the results. ConSept exhibits state-of-the-art mIoU for all and novel classes, surpassing CoinSeg (Swin-B) by 5.1% in mIoU for novel classes. Regarding the anti-catastrophic forgetting ability, compared to the corresponding joint training model, ConSept experiences only a 6.9% and 11.2% drop in mIoU for novel classes and all classes, respectively. Simultaneously, ConSept sustains less than 7% decrease in mIoU for base classes, indicating a favorable trade-off between anti-catastrophic forgetting ability for base and novel classes. These promising results demonstrate the strong performance of ConSept in continual semantic segmentation with much more training steps.

V-D Ablation Studies and Analysis

To examine each component of ConSept, we perform ablation studies and analyses on the PASCAL VOC 15-1 benchmark under the overlapped setting in this section.

Ablation Study. We systematically examine the impact of each essential component on ConSept in Table VI. Type (a) is our baseline method, which employs ViT-B with a linear segmentation head as the continual segmenter and is optimized via the SSUL [13] framework. The baseline achieves 80.81% base classes mIoU and 67.97% overall mIoU, while reaches 26.87% novel classes mIoU. Introducing adapters to vision transformers in type (b) results in a substantial performance improvement of 26.51% in novel classes mIoU and 8.58% in overall mIoU. Further enhancements in type (c) are achieved by applying full fine-tuning with a frozen old-classes linear head and regularization through distillation loss, leading to an additional similar-to\sim11% improvement in novel classes mIoU while maintaining similar segmentation performance on base classes. These results suggest that ConSept benefits from fine-tuning with distillation to enhance generalization ability on new classes, while preserving sufficient anti-catastrophic forgetting capability on old classes. Finally, introducing the dual dice losses dual_dicesubscriptdual_dice\mathcal{L}_{\text{dual\_dice}}caligraphic_L start_POSTSUBSCRIPT dual_dice end_POSTSUBSCRIPT in the full ConSept yields an overall mIoU of 80.13%. Notably, compared to full ConSept, the type (d) variant without distillsubscriptdistill\mathcal{L}_{\text{distill}}caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT experiences a degradation in overall mIoU to 77.65%, with performance losses of similar-to\sim2.7% and similar-to\sim1.6% in terms of base and novel classes mIoU, respectively. In conclusion, the above evaluation results highlight the effectiveness of our proposed components.

TABLE VI: Ablation study of each key component in ConSept, where “Linear” means using linear segmentation head.
Type Linear Adapter Tuning 𝐝𝐢𝐬𝐭𝐢𝐥𝐥subscript𝐝𝐢𝐬𝐭𝐢𝐥𝐥\mathcal{L}_{\text{distill}}caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT dual-dicesubscriptdual-dice\mathcal{L}_{\text{dual-dice}}caligraphic_L start_POSTSUBSCRIPT dual-dice end_POSTSUBSCRIPT VOC 15-1 (6 steps)
0-15 16-20 all
(a) - - - - 80.81 26.87 67.97
(a) - - - 81.45 (+0.64%) 53.38 (+26.51%) 74.76 (+6.79%)
(c) - 80.31 (-0.50%) 64.54 (+37.77%) 76.55 (+8.58%)
(d) - 81.81 (+1.00%) 64.35 (+37.58%) 77.65 (+8.68%)
(e) 84.53 (+3.72%) 66.06 (+39.19%) 80.13 (+12.16%)
TABLE VII: Ablation study on distillation loss in ConSept.
𝐦𝐬𝐞subscript𝐦𝐬𝐞\mathcal{L}_{\text{mse}}caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT 𝐜𝐨𝐧𝐭𝐫𝐚𝐬𝐭subscript𝐜𝐨𝐧𝐭𝐫𝐚𝐬𝐭\mathcal{L}_{\text{contrast}}caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT 15-1 (6 steps)
0-15 16-20 all
- - 81.81 64.35 77.65
- 84.35 63.77 79.45
84.53 66.06 80.13

Effect of Distillation Loss. Subsequently, we investigate the impact of individual components in the distillation loss employed in ConSept. As depicted in Table VII, ConSept achieves only 81.81% in terms of base classes mIoU when msesubscriptmse\mathcal{L}_{\text{mse}}caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT and contrastsubscriptcontrast\mathcal{L}_{\text{contrast}}caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT are excluded. Upon introducing msesubscriptmse\mathcal{L}_{\text{mse}}caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT, the performance improves to 84.35% and 79.45% in terms of mIoU on base classes and all classes, respectively. Finally, the inclusion of the auxiliary loss contrastsubscriptcontrast\mathcal{L}_{\text{contrast}}caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT results in an additional improvement of similar-to\sim2% in novel classes mIoU. These findings suggest that introducing msesubscriptmse\mathcal{L}_{\text{mse}}caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT as the primary distillation loss contributes to the anti-catastrophic forgetting ability of base classes, while introducing auxiliary loss contrastsubscriptcontrast\mathcal{L}_{\text{contrast}}caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT enhances the generalization capability for novel classes.

Distillation Strategy in ConSept. Given the observed improvement in performance resulting from the introduction of multi-scale features for segmentation, we are prompted to explore the applicability of distillation loss across all scale features. To investigate the impact of candidate layers for distillation, an ablation study is conducted on ConSept with varying selections of distilled features. The corresponding quantitative results are presented in Table VIII. It is noteworthy that applying distillation loss solely on feature 𝐅t3subscriptsuperscript𝐅3𝑡\mathbf{F}^{3}_{t}bold_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the deepest layers yields the optimal performance for both base and novel classes in ConSept. When extending the distillation loss to include features from shallow layers (i.e., 𝐅t02subscriptsuperscript𝐅02𝑡\mathbf{F}^{0-2}_{t}bold_F start_POSTSUPERSCRIPT 0 - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), ConSept maintains similar performance in terms of mIoU on base classes, experiencing only a marginal performance decrease of less than 0.5%. However, the performance on novel classes shows a notable degradation by similar-to\sim2%. These results suggest that applying distillation loss exclusively to 𝐅t3subscriptsuperscript𝐅3𝑡\mathbf{F}^{3}_{t}bold_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is sufficient for ConSept. A plausible explanation is that networks should adapt to the texture and luminance information of novel classes from newly added data, necessitating the allowance for modification of features from regions associated with novel classes. Excessive distillation loss for 𝐅t02subscriptsuperscript𝐅02𝑡\mathbf{F}^{0-2}_{t}bold_F start_POSTSUPERSCRIPT 0 - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT may impede the generalization ability for novel classes.

TABLE VIII: Ablation study on distilled feature selection in ConSept.
Distilled Features 15-1 (6 steps)
0-15 16-20 all
𝐅t3subscriptsuperscript𝐅3𝑡\mathbf{F}^{3}_{t}bold_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 84.53 66.06 80.13
𝐅t23subscriptsuperscript𝐅23𝑡\mathbf{F}^{2-3}_{t}bold_F start_POSTSUPERSCRIPT 2 - 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 84.53 63.99 79.64
𝐅t03subscriptsuperscript𝐅03𝑡\mathbf{F}^{0-3}_{t}bold_F start_POSTSUPERSCRIPT 0 - 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 83.99 63.84 79.19
TABLE IX: Ablation study on deterministic boundary in ConSept.
Frozen Head 15-1 (6 steps)
0-15 16-20 all
- 81.79 66.34 78.11
84.53 66.06 80.13

Deterministic Boundary or Fully Fine-tuning. Additionally, we seek to understand the impact of deterministic boundary on old classes. The frozen linear head for old classes intuitively ensures a well-defined decision boundary among classes, thereby preserving the segmentation capability of base classes and mitigating forgetting. To assess this effect, we compare the segmentation quality between ConSept with a frozen linear segmentation head for old classes and its counterpart, as outlined in Table IX. Freezing the linear head for old classes results in a 2.74% improvement in base classes mIoU and a 2.01% enhancement in overall mIoU for ConSept. Notably, the mIoU of novel classes remains unchanged. These findings suggest that freezing the linear head for old classes contributes to improving the anti-catastrophic forgetting capability for base classes, and consequently enhances the overall segmentation performance.

TABLE X: Ablation study on dual dice losses in ConSept.
c-dicesubscriptc-dice\mathcal{L}_{\text{c-dice}}caligraphic_L start_POSTSUBSCRIPT c-dice end_POSTSUBSCRIPT on-dicesubscripton-dice\mathcal{L}_{\text{on-dice}}caligraphic_L start_POSTSUBSCRIPT on-dice end_POSTSUBSCRIPT 15-1 (6 steps)
0-15 16-20 all
- - 80.31 64.54 76.55
- 81.86 64.71 77.76
84.53 66.06 80.13

Effect of Dice Losses. Finally, we explore the impact of our proposed dual dice losses. The evaluation results are presented in Table X. Without incorporating the class-specific dice loss c-dicesubscriptc-dice\mathcal{L}_{\text{c-dice}}caligraphic_L start_POSTSUBSCRIPT c-dice end_POSTSUBSCRIPT and the old-new dice loss on-dicesubscripton-dice\mathcal{L}_{\text{on-dice}}caligraphic_L start_POSTSUBSCRIPT on-dice end_POSTSUBSCRIPT, ConSept achieves a base classes mIoU of 80.31% and an overall mIoU of 76.55%. Introducing c-dicesubscriptc-dice\mathcal{L}_{\text{c-dice}}caligraphic_L start_POSTSUBSCRIPT c-dice end_POSTSUBSCRIPT results in a 1.55% improvement in base classes mIoU with no impact on novel classes’ performance. Further inclusion of on-dicesubscripton-dice\mathcal{L}_{\text{on-dice}}caligraphic_L start_POSTSUBSCRIPT on-dice end_POSTSUBSCRIPT leads to a notable performance boost of 2.37% in overall mIoU and a 1.35% enhancement in novel classes mIoU. These findings highlight that c-dicesubscriptc-dice\mathcal{L}_{\text{c-dice}}caligraphic_L start_POSTSUBSCRIPT c-dice end_POSTSUBSCRIPT enhances segmentation quality for base classes during new tasks training, while on-dicesubscripton-dice\mathcal{L}_{\text{on-dice}}caligraphic_L start_POSTSUBSCRIPT on-dice end_POSTSUBSCRIPT further improves the segmentation ability for novel classes.

VI Conclusion

In this paper, we investigated continual semantic segmentation with vision transformer. We empirically found that vanilla ViT inherently exhibits viable anti-catastrophic forgetting ability for base classes, and adapters could tackle this issue without extra negative effect. Therefore, we proposed, for the first time to our best knowledge, an adapter-based vision transformer for continual semantic segmentation tasks, namely ConSept. Specifically, ConSept inserted lightweight attention-based adapters into pretrained ViT and adopted a dual-path architecture for segmentation. With less than 10% additional parameters, ConSept obtained better segmentation ability for old classes, and achieved promising segmentation quality on novel classes. To further exploit the anti-catastrophic forgetting ability of ConSept, we proposed a distillation method with deterministic old-classes boundary for better anti-catastrophic forgetting ability, and dual dice losses to regularize segmentation maps for overall segmentation performance. Empirical quantitative and qualitative results illustrated that ConSept obtained new state-of-the-art performance on various continual semantic segmentation tasks. Meanwhile, ConSept demonstrated promising anti-catastrophic forgetting capability for both old classes and novel classes.

References

  • [1] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010.
  • [2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014.
  • [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, 2017.
  • [4] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” 2022.
  • [5] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7262–7272.
  • [6] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 12 077–12 090, 2021.
  • [7] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 418–434.
  • [8] R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.
  • [9] Z. Li and D. Hoiem, “Learning without forgetting,” in European Conference on Computer Vision.   Springer, 2016, pp. 614–629.
  • [10] F. Cermelli, M. Mancini, S. R. Bulo, E. Ricci, and B. Caputo, “Modeling the background for incremental learning in semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9233–9242.
  • [11] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 139–149.
  • [12] Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y. Lee, X. Ren, G. Su, V. Perot, J. Dy et al., “Dualprompt: Complementary prompting for rehearsal-free continual learning,” European Conference on Computer Vision, 2022.
  • [13] S. Cha, Y. Yoo, T. Moon et al., “Ssul: Semantic segmentation with unknown label for exemplar-based class-incremental learning,” Advances in neural information processing systems, vol. 34, 2021.
  • [14] Z. Zhang, G. Gao, Z. Fang, J. Jiao, and Y. Wei, “Mining unseen classes via regional objectness: A simple baseline for incremental segmentation,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 24 340–24 353.
  • [15] A. Douillard, Y. Chen, A. Dapogny, and M. Cord, “Plop: Learning without forgetting for continual semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4040–4050.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
  • [17] U. Michieli and P. Zanuttigh, “Continual semantic segmentation via repulsion-attraction of sparse and disentangled latent representations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1114–1124.
  • [18] Z. Lin, Z. Wang, and Y. Zhang, “Continual semantic segmentation via structure preserving and projected feature alignment,” in European Conference on Computer Vision.   Springer, 2022, pp. 345–361.
  • [19] L. Zhu, T. Chen, J. Yin, S. See, and J. Liu, “Continual semantic segmentation with automatic memory sample selection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3082–3092.
  • [20] S. Yan, J. Zhou, J. Xie, S. Zhang, and X. He, “An em framework for online incremental learning of semantic segmentation,” in Proceedings of the 29th ACM international conference on multimedia, 2021.
  • [21] A. Maracani, U. Michieli, M. Toldo, and P. Zanuttigh, “Recall: Replay-based continual learning in semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021.
  • [22] C. Shang, H. Li, F. Meng, Q. Wu, H. Qiu, and L. Wang, “Incrementer: Transformer for class-incremental semantic segmentation with knowledge distillation focusing on old class,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • [23] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021.
  • [24] Z. Zhang, G. Gao, J. Jiao, C. H. Liu, and Y. Wei, “Coinseg: Contrast inter-and intra-class representations for incremental segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 843–853.
  • [25] F. Cermelli, M. Cord, and A. Douillard, “Comformer: Continual learning in semantic and panoptic segmentation,” IEEE/CVF Computer Vision and Pattern Recognition Conference, 2023.
  • [26] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [27] X. Nie, B. Ni, J. Chang, G. Meng, C. Huo, Z. Zhang, S. Xiang, Q. Tian, and C. Pan, “Pro-tuning: Unified prompt tuning for vision tasks,” 2022.
  • [28] H. Chen, R. Tao, H. Zhang, Y. Wang, W. Ye, J. Wang, G. Hu, and M. Savvides, “Conv-adapter: Exploring parameter efficient transfer learning for convnets,” 2022.
  • [29] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, “Vision transformer adapter for dense predictions,” in The Eleventh International Conference on Learning Representations, 2023.
  • [30] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3.   Springer, 2017.
  • [31] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
  • [32] A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “Podnet: Pooled outputs distillation for small-tasks incremental learning,” in 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16.   Springer, 2020, pp. 86–102.
  • [33] M. Kang, J. Park, and B. Han, “Class-incremental learning by knowledge distillation with adaptive feature consolidation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
  • [34] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017.
  • [35] K. Joseph, S. Khan, F. S. Khan, R. M. Anwer, and V. N. Balasubramanian, “Energy-based latent aligner for incremental learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7452–7461.
  • [36] X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong, “Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting,” in International Conference on Machine Learning, 2019.
  • [37] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [38] U. Michieli and P. Zanuttigh, “Incremental learning techniques for semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0.
  • [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [40] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning.   PMLR, 2021, pp. 10 347–10 357.
  • [41] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision - ECCV 2020 - 16th European Conference, vol. 12346.   Springer, 2020.
  • [42] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu, “You only look at one sequence: Rethinking transformer in vision through object detection,” Advances in Neural Information Processing Systems 34 pre-proceedings, 2021.
  • [43] B. Dong, P. Zhou, S. Yan, and W. Zuo, “Self-promoted supervision for few-shot transformer,” in European Conference on Computer Vision.   Springer, 2022, pp. 329–347.
  • [44] ——, “LPT: Long-tailed prompt tuning for image classification,” in The Eleventh International Conference on Learning Representations, 2023.
  • [45] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [46] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 6881–6890.
  • [47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [48] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the International Conference on Computer Vision, 2021.
  • [49] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” 2021.
  • [50] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in European Conference on Computer Vision (ECCV), 2022.
  • [51] Y. He, W. Liang, D. Zhao, H.-Y. Zhou, W. Ge, Y. Yu, and W. Zhang, “Attribute surrogates learning and spectral tokens pooling in transformers for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9119–9129.
  • [52] W. Chen, C. Si, Z. Zhang, L. Wang, Z. Wang, and T. Tan, “Semantic prompt for few-shot image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 581–23 591.
  • [53] P. Yu, Y. Chen, Y. Jin, and Z. Liu, “Improving vision transformers for incremental learning,” 2021.
  • [54] J. S. Smith, L. Karlinsky, V. Gutta, P. Cascante-Bonilla, D. Kim, A. Arbelle, R. Panda, R. Feris, and Z. Kira, “Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 909–11 919.
  • [55] K. Jeeveswaran, P. Bhat, B. Zonooz, and E. Arani, “Birt: Bio-inspired replay in vision transformers for continual learning,” ICML, 2023.
  • [56] A. Mohamed, R. Grandhe, K. J. Joseph, S. Khan, and F. Khan, “D3former: Debiased dual distilled transformer for incremental learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2023, pp. 2420–2429.
  • [57] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
  • [58] Y. Liu, E. Sangineto, W. Bi, N. Sebe, B. Lepri, and M. D. Nadai, “Efficient training of visual transformers with small datasets,” in Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021.
  • [59] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022.
  • [60] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” in International Conference on Learning Representations, 2022.
  • [61] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” in Proceedings of the 36th International Conference on Machine Learning, vol. 97, 2019, pp. 2790–2799.
  • [62] T. Xiao, P. Dollar, M. Singh, E. Mintun, T. Darrell, and R. Girshick, “Early convolutions help transformers see better,” in Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021.
  • [63] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016.
  • [64] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV).   IEEE, 2016.
  • [65] J. Burkardt, “The truncated normal distribution,” Department of Scientific Computing Website, Florida State University, vol. 1, p. 35, 2014.
  • [66] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019.
  • [67] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32.   Curran Associates, Inc., 2019, pp. 8024–8035.
  • [68] M. H. Phan, S. L. Phung, L. Tran-Thanh, A. Bouzerdoum et al., “Class similarity weighted knowledge distillation for continual semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 866–16 875.
  • [69] H. Zhao, F. Yang, X. Fu, and X. Li, “Rbc: Rectifying the biased context in continual semantic segmentation,” in European Conference on Computer Vision.   Springer, 2022, pp. 55–72.
  • [70] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10 012–10 022.
  • [71] C.-B. Zhang, J.-W. Xiao, X. Liu, Y.-C. Chen, and M.-M. Cheng, “Representation compensation networks for continual semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7053–7064.