DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models
Abstract
Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labelled data is unavailable. Recent research has proposed pseudo-labelling approaches to adapt CLIP in an unsupervised manner using unlabelled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP’s visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.
1 Introduction
Context and background: Vision-language models (VLMs) [1, 2, 3, 4] have shown promising potential in zero-shot image classification. One notable example is CLIP [1]. Despite the impressive zero-shot capabilities of CLIP, its performance can be impacted by the discrepancy between the pretraining image-text pairs and the downstream task images [1, 5]. To address this limitation, several studies have attempted to enhance CLIP transfer performance on downstream tasks by leveraging limited labelled samples from the target domains [6, 7, 8, 9, 10]. However, in many practical applications, such as security and medical diagnostics, collecting labelled samples can be particularly challenging due to high labelling costs and data privacy concerns. In such contexts, unsupervised learning presents a promising alternative. Several contributions for adapting CLIP to a target domain using an unlabelled dataset have been introduced recently [11, 12, 13, 14, 15, 16]. A common approach involves leveraging CLIP to generate pseudo-labels, which are then used to fine-tune CLIP in an unsupervised manner. However, this method encounters significant challenges due to the noisy pseudo-labels generated by CLIP. Despite efforts by recent methods for the unsupervised adaptation of CLIP, a significant modality gap persists between the text and vision representations [16, 17], leading to inaccurate pseudo-labels causing confirmation bias during adaptation.
Motivation: Recent studies highlight a significant factor behind the performance issues of CLIP in unsupervised adaptation scenarios: the visual domain gap between the source images used to train CLIP and target images typically occurs when the target samples originate from an uncommon domain [16]. As shown in Figure 1-(a), the t-SNE projection of visual and textual embeddings on the EuroSAT dataset [18] reveals a significant misalignment, resulting in misclassifications. Contemporary methods (e.g., [16]) attempt to address this misalignment by learning a projection space to mitigate the issue and using label propagation [19] to improve pseudo-labels. However, they could be limited as they are either computationally expensive and/or show suboptimal performance in an inductive setting. Further, on datasets with a large number of classes, they struggle to perform well, necessitating turning off label propagation and using model predictions. Effectively addressing the challenges of unsupervised adaptation of CLIP in an inductive setting is crucial for improving CLIP’s performance in unsupervised adaptation scenarios.
Our proposal: To address these challenges, we propose DPA, a novel unsupervised adaptation method for VLMs (Figure 2). DPA aims at addressing the domain gap between visual and textual representations in downstream tasks. DPA introduces the idea of dual prototypes, namely image and textual prototypes, which act as distinct classifiers and their outputs are fused via convex combination towards generating accurate pseudo-labels. There are two main reasons for introducing dual prototypes. Firstly, the misalignment between the image representation and its textual representation in zero-shot CLIP, due to domain shift, often leads to inaccurate pseudo-labels (see Figure 1-(a)). Secondly, image prototypes, which tend to be less affected by noise, as illustrated in Figure 1, are generally closer to the true image representation than textual prototypes. Additionally, we tackle the challenge of misalignment between visual and textual embeddings by aligning textual prototypes with image prototypes. This alignment process further enhances the performance of unsupervised CLIP adaptation.
Contributions: 1) We propose DPA, a novel unsupervised domain adaptation framework for VLMs which allows adapting these powerful models to new domains without requiring labelled data from the target distribution. 2) We introduce a novel approach for generating accurate pseudo-labels by leveraging two distinct prototypes and fusing their outputs via a convex combination. We also propose to rank pseudo-labels in the classification loss to mitigate their noise, especially during early training. Moreover, we propose to tackle the visual-textual misalignment by aligning textual prototypes with image prototypes. 3) Experiments on 13 downstream vision tasks demonstrate consistent and significant performance enhancements over zero-shot CLIP and the state-of-the-art baselines.
2 Related Work
Large-scale Vision-Language (VL) Models: Among other vision-language models [20, 21, 2, 1, 22, 3, 23], CLIP [1] stands out as a pioneering example, aligning visual and textual features via a contrastive objective on a vast web-crawled collection of image-text pairs. This alignment empowers CLIP to generalize to diverse downstream classification tasks effectively. Building on CLIP’s success, subsequent research has explored methods for efficiently transferring the pre-trained model to handle diverse downstream tasks with limited labelled target data. These approaches often leverage vision-specific adapters [24, 6, 25, 26], language-specific adapters [7, 8, 10], or both [9]. However, these techniques require a minimum number of labelled samples, posing challenges due to data scarcity, annotation expenses, privacy issues, and practical limitations. Recent advancements have seen the emergence of unsupervised adaptation techniques, focusing on tailoring CLIP to target tasks using unlabelled datasets [14, 13, 16, 12, 11, 15]. However, these methods still depend on additional supervision signals, such as fine-tuning classifiers with a large language model like GPT-3, as seen in LaFTer [12]. They also often require significant computational resources, exemplified by MUST [11]. ReCLIP [16] tackle misaligned embeddings through source-free domain learning, utilizing a projection space to align these embeddings and employing pseudo-labels for self-training. Despite progress, there remains significant potential for improving the performance of unsupervised adaptation techniques across various image classification benchmarks. In contrast, our work introduces a fully label-free approach for adapting VLMs to a target task by generating more accurate pseudo-labels grounded on image and text prototypes.
Prototype-based Learning: Prototype-based learning uses a fixed set of distinctive prototypes representing the data, and then the learning is achieved by comparing the test samples directly with these prototypes. Prototype-based learning has been extensively studied in various contexts, including few-shot classification [27, 28], unsupervised learning [29, 30] and supervised classification [31, 32]. Our method innovatively incorporates two types of prototypes, image prototypes and textual prototypes, to harness both prototypes for ultimately generating accurate pseudo-labels and effectively adapting CLIP to the target dataset.
Pseudo-labelling: The success of pseudo-labelling has been demonstrated in various domains, including vision [33, 34] and vision-language tasks [11, 12, 14, 16]. The pseudo-labelling pipeline of DPA leverages consistency regularization [35] to produce consistent predictions across different data augmentations, thereby enhancing the accuracy of pseudo-labels.
3 Methodology
This study addresses the task of unsupervised adaptation of vision-language models (i.e., CLIP) for image classification. Following are the settings for this task:
-
•
Pre-trained CLIP consists of visual encoder and textual encoder .
-
•
Target dataset consists of unlabelled images , where .
-
•
Unique class names for the unlabelled target dataset.
We use a pre-trained CLIP to process the unlabelled target data . Initially, the source model assigns a pseudo-label to each unlabelled target image . During the adaptation phase, we employ two prototypes to generate pseudo-labels for the adaptation of CLIP to the target domain.
3.1 Zero-shot Visual Classification in CLIP
CLIP achieves impressive zero-shot performance on image classification tasks by learning a joint embedding space that aligns image and text representations. During pre-training, CLIP minimizes a symmetric contrastive loss between semantically similar image-text pairs sampled from a large-scale source dataset :
(1) |
where and , with both and . The parameter represents a learned temperature, while indicates the mini-batch size. For inference, given an unlabelled target image and a list of class names , we leverage the pre-trained textual encoder to generate text embeddings for each class. Specifically, a series of well-crafted prompts are fed into , producing a set of text embeddings for each class, denoted as where . The zero-shot prediction is then obtained as follows:
(2) |
3.2 Proposed Framework (DPA)
Motivation: While zero-shot CLIP shows impressive performance across diverse domains, it still exhibits notable limitations compared to models adapted under supervised conditions due to the misalignment between image and corresponding textual representations [1]. As depicted in Figure 1-(a), this discrepancy is evident in CLIP when applied to uncommon domains. Such disparities can lead to inaccurate pseudo-labels, especially when utilizing zero-shot CLIP to generate pseudo-labels for the unsupervised adaptation [16]. Our objective is to adapt CLIP to the target domain using its unlabelled samples by bridging the gap between image representations and their textual representations, all without dependence on ground-truth labels. Simultaneously, we seek to achieve efficient parameter-based unsupervised adaptation for CLIP. Our objective relies on two key components: (1) ensuring the accuracy of the pseudo-labels generated for the unlabelled samples, and (2) aligning textual representations with their corresponding visual representations. To generate the pseudo-labels for adapting CLIP to the target domain, we introduce dual prototypes which act as two different classifiers and fuse their complementary outputs via convex combination. Furthermore, we propose visual-textual prototype alignment towards further improving the adaptation performance.
Textual Prototypes Construction: To create textual prototypes, we obtain the textual representation for each class in the target domain. This is achieved by incorporating each class name into a predefined prompt template. It is then processed by CLIP’s text encoder to produce a textual representation . We utilize distinct prompts for each class, resulting in corresponding textual representations , where is the dimensionality of the embedding space. The textual prototype for class is constructed by averaging these individual textual representations:
(3) |
where . Finally, we define the textual prototypes for the set of classes as . This approach enables us to capture a comprehensive and robust representation of class semantics by consolidating textual representations derived from multiple prompts. Finally, these textual prototypes act as the initialization for a parametric classifier that is updated during training alongside CLIP parameters.
Image Prototypes Construction: To construct the image prototypes, CLIP’s image encoder processes a weakly-augmented unlabelled image and generates a feature representation , where . Without class labels, we utilize Equation 2 to generate pseudo-labels from zero-shot CLIP. To obtain the image prototype for a class , we average the feature representation of the unlabelled images assigned to class as follows:
(4) |
where . For the set of classes, we define the image prototypes as . The image prototypes function as a non-parametric classifier. Given that the pseudo-labels generated by zero-shot CLIP are initially noisy, our approach, inspired by previous works [36, 37, 38], employs a memory bank . This memory bank facilitates the incremental update of image prototypes in a non-parametric way throughout the training process. Specifically, stores the feature representations of all unlabelled target samples along with their corresponding pseudo-labels: . As the image prototypes are non-parametric, we iteratively refine them using the memory bank as follows: (1) We initially calculate the image prototypes using Equation 4, utilizing the image features and the pseudo-labels generated from zero-shot CLIP across the entire target dataset. (2) During training, we continuously update the memory bank by storing the image representations alongside their corresponding pseudo-labels. (3) At the end of each epoch, we update the prototypes using Equation 4, incorporating the image representations and their corresponding pseudo-labels from . (4) We repeat steps (2) and (3) until the completion of the training process. DPA maximizes the utilization of all available training samples to enhance the image prototypes. This iterative process ensures that each prototype continuously evolves, progressively assimilating relevant knowledge from individual training samples throughout the model training phase, while mitigating the error propagation commonly associated with pseudo-labels.
Pseudo-labels Generation: Both image and textual prototypes serve as distinct classifiers for the unlabelled target dataset. We generate a pseudo-label (PL) for the weakly-augmented unlabelled target image from these two prototypes. Initially, we determine the similarity between the image features and the textual prototypes as follows:
(5) |
Then, we find the similarity between the image features and the image prototypes as follows:
(6) |
Finally, we fuse and to form as follows:
(7) |
Following [39], we perform distribution alignment (DA) to prevent the model’s prediction from collapsing onto specific classes. Here, , where represents a running average of during training, and is a hyperparameter. Finally, the PL is obtained as . This process of generating pseudo-labels improves the accuracy of the PL during training, indicating that these PLs may initially be noisy (see Figure 3). Therefore, assigning uniform weights to all pseudo-labels could hinder the adaptation process. To address this challenge, we propose adjusting the classification loss weight for the pseudo-label based on its similarity to the textual and image prototypes, as follows:
(8) |
represents the cosine similarity between the image features and the corresponding prototypes derived from the PLs. Finally, we utilize the pseudo-label , obtained from a weakly-augmented image , as self-supervision for the strongly-augmented counterpart as follows:
(9) |
where represents the weight calculated from Equation 8, and represents the probabilistic output for the strongly-augmented image . To alleviate the confirmation bias induced by CLIP [40], we adopt “fairness” regularization as suggested in [11] as follows:
(10) |
where represents the model’s average prediction from the strongly augmented images across the batch. By incorporating this fairness regularization, we aim to encourage the model to make more uniform predictions across classes, reducing the tendency to overfit the PLs and promoting a more balanced adaptation to the target domain. Equations 9 and 10 are used to update both the image encoder and the textual prototypes.
Prototypes Alignment: During training, the textual prototypes undergo continual updates to align with the associated image representation, leveraging pseudo-labels using Equation 9 and 10. Conversely, the refinement of the image prototypes occurs through averaging image features, as detailed in Section 3.2. We exploit the image prototypes to provide additional supervision for updating textual prototypes beyond pseudo-label alone. These image prototypes, less influenced by erroneous pseudo-labels and closer to image features (as illustrated in Figure 1), offer extra guidance for refining textual prototypes. Our primary objective is to optimize the relationship between image prototypes and textual prototypes by maximizing the (cosine) similarity between these feature sets. To accomplish this optimization, we employ the InfoNCE loss [41] to directly assess the alignment between the corresponding feature vectors in and , as follows:
(11) |
With this alignment, we maximize the (cosine) similarity between the image prototypes and the textual prototypes , encouraging the model to learn more closely aligned representations. The overall loss function used for training on the target data is, , where , , are non-learnable parameters.
4 Experiments
Datasets and baselines: We extensively evaluate our approach on 13 diverse datasets: ImageNet [42], Caltech101 [43], DTD [44], EuroSAT [18], FGVCAircraft [45], Food101 [46], Flowers102 [47], OxfordPets [48], SUN397 [49], StandfordCars [50], CIFAR10\100 [51], and UCF101 [52]. We perform comparative analysis with four state-of-the-art (SOTA) unsupervised adaptation methods: CLIP [1], UPL [14], POUF [13], and LaFTer [12].
Implementation Details: For all experiments, unless otherwise specified, we utilize a ViT/B-32 CLIP pre-trained by OpenAI [1]. In the context of unsupervised fine-tuning, we specifically target the layer-normalization weights of the image encoder and the textual prototypes. This strategy is effective and stable for adapting models with noisy supervision [53, 54]. The text encoder of CLIP will be excluded after constructing the textual prototypes. Input images are standardized to a size of 224 × 224. During training, we apply a series of data augmentation techniques, including RandomResizedCrop, Flip, RandAugment [55] as a strong augmentation and Resize+RandomCrop for generating pseudo-labels. For testing, we extract a Center crop after resizing the image to . We utilize AdamW optimizer [56] with a cosine learning rate schedule. For fair comparisons, we reproduce the results of UPL, POUF, and LaFTer using their publicly available code. For more implementation details, please refer to the supplementary materials.
Results: Table 1 reports the top-1 accuracy of our proposed method DPA and four SOTA baselines: CLIP, UPL, POUF, and LaFTer on 13 datasets. Our approach consistently outperforms zero-shot CLIP, improving top-1 accuracy by on average. Compared to UPL, which optimizes prompts, DPA still achieves consistent improvements with an average gain of with minimal additional computational overhead despite slightly larger parameters. DPA also surpasses POUF and LaFTer, achieving average increases of and , respectively, across the 13 datasets. Notably, DPA outperforms LaFTer without requiring additional text corpora or further pre-training.
Method | ImgNet | Caltech | DTD | ESAT | FGVCA | Food | Flower | OxPets | SUN | StCars | CIFAR10 | CIFAR100 | UCF | Avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Zero-shot CLIP [1] | 63.30 | 90.69 | 44.42 | 43.84 | 19.50 | 82.40 | 66.46 | 87.50 | 61.99 | 58.74 | 89.80 | 65.10 | 64.20 | 64.46 |
UPL [14] | 58.22 | 92.36 | 45.37 | 51.88 | 17.07 | 84.25 | 67.40 | 83.84 | 62.12 | 49.41 | 91.26 | 67.41 | 62.04 | 64.05 |
POUF [13] | 52.20 | 94.10 | 46.10 | 62.90 | 18.20 | 82.10 | 67.80 | 87.80 | 60.00 | 57.70 | 90.50 | 62.00 | 61.20 | 64.82 |
LaFTer [12] | 61.63 | 94.39 | 50.32 | 69.96 | 19.86 | 82.45 | 72.43 | 84.93 | 65.87 | 57.44 | 94.57 | 69.79 | 65.08 | 68.36 |
DPA | 64.64 | 96.06 | 55.69 | 80.04 | 20.67 | 84.76 | 75.56 | 90.71 | 68.13 | 62.62 | 95.97 | 76.47 | 68.49 | 72.29 |
For ablation studies, we use 11 out of 13 datasets, excluding ImageNet and SUN397 due to their large size. Excluding these large-scale datasets allows us to perform more extensive experiments and analyses while maintaining computational feasibility.
Analysis of Model Components: We investigate the impact of different components through the training of four distinct models: Base: This model iteratively generates pseudo-labels using textual prototypes and incorporates regularization loss through self-training. Center: CLIP undergoes training with PLs generation using image and text prototypes, as detailed in Section 3.2, in addition to building upon the foundation established in the Base model. Center+: introduces the weighting strategy for Center model. Center++Align: introduces the alignment loss, as detailed in Section 3.2. As shown in Table 2, Base model, which relies solely on self-training, demonstrates a notable improvement compared to the zero-shot CLIP. However, as training progresses, the impact of confirmation bias becomes evident, leading to a decrease in the accuracy of the pseudo-labels, as illustrated in Figure 3-(a). To mitigate the effects of confirmation bias, we introduce the pseudo-label generation strategy in Center model. This approach enhances the quality of pseudo-labels throughout the training process, leading to a significant gain of over zero-shot CLIP in average accuracy. Incorporating the weighting strategy in Center+ model results in an additional increase of over zero-shot CLIP in accuracy. By assigning appropriate weights to the self-training loss, the model can better balance the contributions of high-confidence and low-confidence pseudo-labels, leading to more stable and practical training. Finally, including the alignment loss in Center++Align model leads to a further increase of over zero-shot CLIP in average accuracy. By explicitly encouraging the alignment between image and text prototypes, the model learns more robust and transferable representations, which are crucial for effective unsupervised adaptation. Finally, we conduct a detailed analysis of the evolution of pseudo-label accuracy during training epochs for our method (DPA) compared to Base model (Figure 3-(a,b,c)).
Method | Caltech | DTD | ESAT | FGVCA | Food | Flower | OxPets | StCars | CIFAR10 | CIFAR100 | UCF | Avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Zero-shot CLIP | 90.69 | 44.42 | 43.84 | 19.50 | 82.40 | 66.46 | 87.50 | 58.74 | 89.80 | 65.10 | 64.20 | 64.79 |
Base | 93.57 | 48.99 | 61.94 | 19.20 | 84.10 | 68.45 | 90.24 | 59.25 | 95.95 | 73.55 | 65.45 | 69.15 |
Center | 95.44 | 55.53 | 70.56 | 19.80 | 84.65 | 75.27 | 90.71 | 61.53 | 95.96 | 76.01 | 67.30 | 72.07 |
Center+ | 95.46 | 54.54 | 80.06 | 19.56 | 84.63 | 75.44 | 90.49 | 61.19 | 95.97 | 75.92 | 67.51 | 72.80 |
Center++Align(DPA) | 96.06 | 55.69 | 80.04 | 20.67 | 84.76 | 75.56 | 90.71 | 62.62 | 95.97 | 76.47 | 68.49 | 73.37 |
Incorporating Image Prototypes as Pseudo-Label Source: We analyse the impact of incorporating image prototypes as an additional classifier. We exclude Equation 7 and generate PLs solely based on text prototypes, while preserving the alignment of image-text prototypes and the weighting strategy. We update solely using the pseudo-labels derived from the textual prototypes. Table 3 shows that the alignment and weighting strategies generally perform well across datasets, with minor differences compared to the entire model, except for EuroSAT and Flowers-102.
Method | Caltech | DTD | ESAT | FGVCA | Food | Flower | OxPets | StCars | CIFAR10 | CIFAR100 | UCF | Avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Zero-shot CLIP | 90.69 | 44.42 | 43.84 | 19.50 | 82.40 | 66.46 | 87.50 | 58.74 | 89.8 | 65.10 | 64.20 | 64.79 |
DPA (w/o PLs Generation) | 95.41 | 54.20 | 61.94 | 19.44 | 84.49 | 69.22 | 91.17 | 61.95 | 95.93 | 76.12 | 67.96 | 70.71 |
DPA | 96.06 | 55.69 | 80.04 | 20.67 | 84.76 | 75.56 | 90.71 | 62.62 | 95.97 | 76.47 | 68.49 | 73.37 |
Impact of Image Prototype Initialization on Performance: We evaluate three initializations for image prototype methods: mean, weighted mean, and similarity weighted mean [28] (Table 4). The results demonstrate significant improvements over the baselines for all initialization strategies.
Method | Caltech | DTD | ESAT | FGVCA | Food | Flower | OxPets | StCars | CIFAR10 | CIFAR100 | UCF | Avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Zero-shot CLIP | 90.69 | 44.42 | 43.84 | 19.50 | 82.40 | 66.46 | 87.50 | 58.74 | 89.80 | 65.10 | 64.20 | 64.79 |
DPA (Weighted Mean) | 96.00 | 54.52 | 68.34 | 20.31 | 84.70 | 73.81 | 90.49 | 63.14 | 95.96 | 75.81 | 68.04 | 71.92 |
DPA (Similarity Weighted Mean) | 94.52 | 55.16 | 79.76 | 21.12 | 84.70 | 77.06 | 90.84 | 62.98 | 95.93 | 76.33 | 68.28 | 73.33 |
DPA (Mean) | 96.06 | 55.69 | 80.04 | 20.67 | 84.76 | 75.56 | 90.71 | 62.62 | 95.97 | 76.47 | 68.49 | 73.37 |
Robustness Analysis under Distribution Shifts: To evaluate robustness against natural distribution shifts, we show results on ImageNet-Sketch [57], ImageNet-Adversarial [58], and ImageNet-Rendition [59]. Initially, we evaluate a fine-tuned ViT-B/32 model on ImageNet using our method, achieving an accuracy of on the ImageNet validation set. We explore transductive transfer learning [60, 61] with our approach. Transductive learning involves assuming access to unlabelled test images from the new distribution (Table 5). We observe a modest enhancement in performance under distribution shift in 2 out of 3 datasets. This slight improvement is attributed to utilizing only of ImageNet dataset, a deliberate choice made to manage computational resources effectively and address the substantial size of ImageNet training dataset.
Method | ImageNet | ImageNet-S | ImageNet-A | ImageNet-R |
---|---|---|---|---|
Zero-shot CLIP | 63.30 | 42.10 | 68.60 | 32.10 |
DPA (Transductive) | 64.64 | 42.60 | 68.40 | 32.20 |
Full Fine-tuning: We compare full fine-tuning of the models instead of solely fine-tuning the layer-normalization weights (Table 6) under noisy PLs. The last row shows the PLs of zero-shot CLIP used at the start of the training for constructing the initial image prototypes. In datasets with noisy PLs, Table 6 highlights the importance of solely fine-tuning layer-normalization weights during adaptation to prevent overfitting and maximize performance.
Method | Caltech | DTD | ESAT | FGVCA | Food | Flower | OxPets | StCars | CIFAR10 | CIFAR100 | UCF | Avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Zero-shot CLIP | 90.69 | 44.42 | 43.84 | 19.50 | 82.40 | 66.46 | 87.50 | 58.74 | 89.80 | 65.10 | 64.20 | 64.46 |
DPA (LayerNorm Tuning) | 96.06 | 55.69 | 80.04 | 20.67 | 84.76 | 75.56 | 90.71 | 62.62 | 95.97 | 76.47 | 68.49 | 73.31 |
DPA (Full Tuning) | 94.57 | 53.03 | 62.92 | 17.37 | 82.62 | 75.23 | 91.14 | 54.06 | 97.44 | 78.91 | 68.57 | 70.53 |
DPA (PLs accuracy of Zero-shot CLIP) | 88.83 | 43.54 | 44.90 | 17.79 | 77.91 | 66.53 | 83.80 | 57.81 | 89.64 | 65.60 | 64.10 | 63.68 |
With other VLMs: We evaluate the efficacy of DPA with other VLM models in Table 7. To assess this, we conducted experiments on SLIP [62], a version of CLIP incorporating self-supervised learning objectives during pre-training. Additionally, we evaluated DPA using a different CLIP-based architecture, rather than ViT-B/32. As depicted in Table 7, DPA consistently demonstrates substantial and consistent improvements across diverse VLMs and architectures.
Method | Caltech | DTD | ESAT | FGVCA | Food | Flower | OxPets | StCars | CIFAR10 | CIFAR100 | UCF |
---|---|---|---|---|---|---|---|---|---|---|---|
Init Adapt | Init Adapt | Init Adapt | Init Adapt | Init Adapt | Init Adapt | Init Adapt | Init Adapt | Init Adapt | Init Adapt | Init Adapt | |
SLIP (ViT-L/16) | 83.90 90.20 | 25.53 36.28 | 20.90 31.50 | 8.40 9.18 | 60.45 68.08 | 64.23 75.07 | 33.22 44.02 | 8.00 9.03 | 81.67 88.10 | 48.70 59.88 | 38.09 52.76 |
DPA (ViT-B/16) | 92.60 96.09 | 44.70 50.69 | 49.00 81.22 | 23.97 25.14 | 87.80 89.83 | 70.89 78.68 | 89.00 93.19 | 64.70 68.45 | 90.80 96.43 | 68.22 78.10 | 69.10 74.62 |
Comparison with ReCLIP: We provide a comparison with a very recent method for unsupervised adaptation of VLMs, namely ReCLIP [16] (Table 8). DPA consistently outperforms ReCLIP [16] (Table 8) in both inductive and transductive settings. Also, DPA has superior training efficiency and performance with fewer parameters and reduced computational complexity (Figure 4).
Method | Caltech | DTD | ESAT | FGVCA | Flower | OxPets | StCars | UCF | Avg |
---|---|---|---|---|---|---|---|---|---|
Zero-shot CLIP | 90.69 | 44.42 | 43.84 | 19.50 | 66.46 | 87.50 | 58.74 | 64.20 | 59.42 |
Inductive | |||||||||
ReCLIP† [16] | 95.94 | 53.88 | 70.80 | 18.87 | 72.63 | 87.49 | 59.22 | 67.01 | 65.73 |
DPA | 96.06 | 55.69 | 80.04 | 20.67 | 75.56 | 90.71 | 62.62 | 68.49 | 68.73 |
Transductive | |||||||||
ReCLIP [16] | 92.43 | 52.50 | 59.30 | 20.34 | 70.65 | 88.42 | 59.06 | 69.13 | 63.98 |
DPA | 91.62 | 57.13 | 71.84 | 20.76 | 74.14 | 91.11 | 62.58 | 68.44 | 67.20 |
Comparison with Few-Shot methods: We compare with the few-shot methods (CoOp [7] and MaPLe [9]) in Table 9. DPA match or surpass few-shot methods in 6 of 11 datasets.
Method | Caltech | DTD | ESAT | FGVCA | Food | Flower | OxPets | StCars | CIFAR10 | CIFAR100 | UCF | Avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Zero-shot CLIP | 90.69 | 44.42 | 43.84 | 19.50 | 82.40 | 66.46 | 87.50 | 58.74 | 89.80 | 65.10 | 64.20 | 64.79 |
DPA | 96.06 | 55.69 | 80.04 | 20.67 | 84.76 | 75.56 | 90.71 | 62.62 | 95.97 | 76.47 | 68.49 | 73.37 |
8-shot | ||||||||||||
CoOp | 93.96 | 61.82 | 77.17 | 26.73 | 78.84 | 88.71 | 85.61 | 66.20 | 94.63 | 75.91 | 77.66 | 75.20 |
MaPLe | 93.23 | 34.99 | 65.25 | 21.96 | 80.51 | 68.78 | 81.85 | 58.51 | 88.23 | 71.14 | 69.18 | 66.69 |
16-shot | ||||||||||||
CoOp | 95.09 | 67.55 | 76.78 | 30.93 | 79.12 | 93.34 | 86.70 | 70.73 | 94.50 | 75.78 | 79.51 | 77.28 |
MaPLe | 93.35 | 53.43 | 71.49 | 21.48 | 81.01 | 72.47 | 86.54 | 60.28 | 91.94 | 71.74 | 70.26 | 70.36 |
Limitations and broader impact: While our method shows promising results, there is still room for improvement in mitigating zero-shot CLIP’s strong confirmation biases, as illustrated in Figure 5, which was not the primary focus of this work. This figure underscores the remarkable efficacy of our approach in source-free adaptation using dual classifiers to generate pseudo-labels to reduce false positive rates across all categories. However, zero-shot CLIP has strong confirmation biases, resulting in our method consistently achieving false positive predictions for certain classes. The confusion between classes is also evident in ReCLIP, indicating that addressing these strong biases requires more robust de-biasing strategies.
5 Conclusion
We present DPA, an unsupervised domain adaptation method for VLMs that aims at bridging the domain gap between visual and textual representations. DPA introduces a novel idea of dual prototypes, which work as two distinct classifiers, and their outputs are fused via a convex combination. It also ranks pseudo-labels for robust self-training during early training. DPA also enhances the alignment of visual-textual prototypes to improve adaptation performance further. DPA shows improved pseudo-labelling accuracy, which leads to notable performance improvements on various downstream tasks. DPA significantly outperforms zero-shot CLIP and the state-of-the-art baselines across 13 downstream vision tasks.
References
- [1] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- [2] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 4904–4916.
- [3] J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 671–15 680.
- [4] Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=zq1iJkNk3uN
- [5] B. An, S. Zhu, M.-A. Panaitescu-Liess, C. K. Mummadi, and F. Huang, “PerceptionCLIP: Visual classification by inferring and conditioning on contexts,” in The Twelfth International Conference on Learning Representations, 2024.
- [6] R. Zhang, W. Zhang, R. Fang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” in European Conference on Computer Vision. Springer, 2022, pp. 493–510.
- [7] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
- [8] ——, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 816–16 825.
- [9] M. U. khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- [10] M. U. Khattak, S. T. Wasim, M. Naseer, S. Khan, M.-H. Yang, and F. S. Khan, “Self-regulating prompts: Foundational model adaptation without forgetting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 15 190–15 200.
- [11] J. Li, S. Savarese, and S. C. H. Hoi, “Masked unsupervised self-training for label-free image classification,” in ICLR, 2023.
- [12] M. J. Mirza, L. Karlinsky, W. Lin, M. Kozinski, H. Possegger, R. Feris, and H. Bischof, “Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections,” in Conference on Neural Information Processing Systems (NeurIPS), 2023.
- [13] K. Tanwisuth, S. Zhang, H. Zheng, P. He, and M. Zhou, “Pouf: Prompt-oriented unsupervised fine-tuning for large pre-trained models,” in International Conference on Machine Learning. PMLR, 2023, pp. 33 816–33 832.
- [14] T. Huang, J. Chu, and F. Wei, “Unsupervised prompt learning for vision-language models,” arXiv preprint arXiv:2204.03649, 2022.
- [15] Q. Qian, Y. Xu, and J. Hu, “Intra-modal proxy learning for zero-shot visual categorization with clip,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- [16] X. Hu, K. Zhang, L. Xia, A. Chen, J. Luo, Y. Sun, K. Wang, N. Qiao, X. Zeng, M. Sun et al., “Reclip: Refine contrastive language image pre-training with source free domain adaptation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2994–3003.
- [17] V. W. Liang, Y. Zhang, Y. Kwon, S. Yeung, and J. Y. Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 612–17 625, 2022.
- [18] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
- [19] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation for deep semi-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- [20] K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 162–11 173.
- [21] M. B. Sariyildiz, J. Perez, and D. Larlus, “Learning visual representations with caption annotations,” in European Conference on Computer Vision. Springer, 2020, pp. 153–170.
- [22] Q. Cui, B. Zhou, Y. Guo, W. Yin, H. Wu, O. Yoshie, and Y. Chen, “Contrastive vision-language pre-training with limited resources,” in European Conference on Computer Vision. Springer, 2022, pp. 236–253.
- [23] X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang, “Scaling up vision-language pre-training for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 980–17 989.
- [24] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024.
- [25] R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao, and H. Li, “Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 211–15 222.
- [26] X. Zhu, R. Zhang, B. He, A. Zhou, D. Wang, B. Zhao, and P. Gao, “Not all features matter: Enhancing few-shot clip with adaptive prior refinement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2605–2615.
- [27] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
- [28] K. J. Liang, S. B. Rangrej, V. Petrovic, and T. Hassner, “Few-shot learning with noisy labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9089–9098.
- [29] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742.
- [30] W. Xu, Y. Xian, J. Wang, B. Schiele, and Z. Akata, “Attribute prototype network for zero-shot learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 969–21 980, 2020.
- [31] H.-M. Yang, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Robust classification with convolutional prototype learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3474–3482.
- [32] P. Mettes, E. Van der Pol, and C. Snoek, “Hyperspherical prototype networks,” Advances in neural information processing systems, vol. 32, 2019.
- [33] X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer, “Lit: Zero-shot transfer with locked-image text tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 123–18 133.
- [34] A. Sahito, E. Frank, and B. Pfahringer, “Better self-training for image classification through self-supervision,” in Australasian Joint Conference on Artificial Intelligence. Springer, 2022, pp. 645–657.
- [35] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” Advances in neural information processing systems, vol. 33, pp. 596–608, 2020.
- [36] L. Zhou, N. Li, M. Ye, X. Zhu, and S. Tang, “Source-free domain adaptation with class prototype discovery,” Pattern recognition, vol. 145, p. 109974, 2024.
- [37] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in Computer vision–ECCV 2016: 14th European conference, amsterdam, the netherlands, October 11–14, 2016, proceedings, part VII 14. Springer, 2016, pp. 499–515.
- [38] S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic representations for unsupervised domain adaptation,” in International conference on machine learning. PMLR, 2018, pp. 5423–5432.
- [39] J. Li, C. Xiong, and S. C. Hoi, “Comatch: Semi-supervised learning with contrastive graph regularization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9475–9484.
- [40] X. Wang, Z. Wu, L. Lian, and S. X. Yu, “Debiased learning from naturally imbalanced pseudo-labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 647–14 657.
- [41] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- [42] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- [43] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in 2004 conference on computer vision and pattern recognition workshop. IEEE, 2004, pp. 178–178.
- [44] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.
- [45] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
- [46] L. Bossard, M. Guillaumin, and L. V. Gool, “Food-101–mining discriminative components with random forests,” in European conference on computer vision. Springer, 2014, pp. 446–461.
- [47] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 2008, pp. 722–729.
- [48] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3498–3505.
- [49] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE Computer Society Conference on computer vision and Pattern Recognition. IEEE, 2010, pp. 3485–3492.
- [50] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE International Conference on computer vision workshops, 2013, pp. 554–561.
- [51] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
- [52] K. Soomro, A. R. Zamir, and M. Shah, “A dataset of 101 human action classes from videos in the wild,” Center for Research in Computer Vision, vol. 2, no. 11, pp. 1–7, 2012.
- [53] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- [54] D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,” arXiv preprint arXiv:2006.10726, 2020.
- [55] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 702–703.
- [56] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- [57] H. Wang, S. Ge, Z. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- [58] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 262–15 271.
- [59] D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 8340–8349.
- [60] Y. Xian, B. Schiele, and Z. Akata, “Zero-shot learning-the good, the bad and the ugly,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4582–4591.
- [61] M. Rohrbach, S. Ebert, and B. Schiele, “Transfer learning in a transductive setting,” Advances in neural information processing systems, vol. 26, 2013.
- [62] N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” in European conference on computer vision. Springer, 2022, pp. 529–544.