DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models

Eman Ali
Mohamed Bin Zayed University of Artificial Intelligence
Abu Dhabi, UAE
[email protected]
&Sathira Silva
Mohamed Bin Zayed University of Artificial Intelligence
Abu Dhabi, UAE
[email protected]
Muhammad Haris Khan
Mohamed Bin Zayed University of Artificial Intelligence
Abu Dhabi, UAE
[email protected]
Abstract

Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labelled data is unavailable. Recent research has proposed pseudo-labelling approaches to adapt CLIP in an unsupervised manner using unlabelled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP’s visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.

1 Introduction

Context and background: Vision-language models (VLMs) [1, 2, 3, 4] have shown promising potential in zero-shot image classification. One notable example is CLIP [1]. Despite the impressive zero-shot capabilities of CLIP, its performance can be impacted by the discrepancy between the pretraining image-text pairs and the downstream task images [1, 5]. To address this limitation, several studies have attempted to enhance CLIP transfer performance on downstream tasks by leveraging limited labelled samples from the target domains [6, 7, 8, 9, 10]. However, in many practical applications, such as security and medical diagnostics, collecting labelled samples can be particularly challenging due to high labelling costs and data privacy concerns. In such contexts, unsupervised learning presents a promising alternative. Several contributions for adapting CLIP to a target domain using an unlabelled dataset have been introduced recently [11, 12, 13, 14, 15, 16]. A common approach involves leveraging CLIP to generate pseudo-labels, which are then used to fine-tune CLIP in an unsupervised manner. However, this method encounters significant challenges due to the noisy pseudo-labels generated by CLIP. Despite efforts by recent methods for the unsupervised adaptation of CLIP, a significant modality gap persists between the text and vision representations [16, 17], leading to inaccurate pseudo-labels causing confirmation bias during adaptation.

Motivation: Recent studies highlight a significant factor behind the performance issues of CLIP in unsupervised adaptation scenarios: the visual domain gap between the source images used to train CLIP and target images typically occurs when the target samples originate from an uncommon domain [16]. As shown in Figure 1-(a), the t-SNE projection of visual and textual embeddings on the EuroSAT dataset [18] reveals a significant misalignment, resulting in misclassifications. Contemporary methods (e.g., [16]) attempt to address this misalignment by learning a projection space to mitigate the issue and using label propagation [19] to improve pseudo-labels. However, they could be limited as they are either computationally expensive and/or show suboptimal performance in an inductive setting. Further, on datasets with a large number of classes, they struggle to perform well, necessitating turning off label propagation and using model predictions. Effectively addressing the challenges of unsupervised adaptation of CLIP in an inductive setting is crucial for improving CLIP’s performance in unsupervised adaptation scenarios.

Refer to caption
Figure 1: Comparison of t-SNE projections for zero-shot CLIP [1], ReCLIP [16], and DPA visual embeddings, along with their corresponding visual (circle 🌑🌑\newmoon🌑) and textual (star \bigstar) prototypes on EuroSAT dataset. The visual prototypes are computed as the mean of each cluster. For the purpose of enhancing the clarity of visualizations, prior to applying the t-SNE projection, class-agnostic and redundant features are removed from all the embeddings using a fixed projection following [16]. The cosine similarities between visual and textual prototypes and the well-separated visual clusters, as illustrated in (c), demonstrate the superior performance of our method in capturing both inter-modal and intra-modal alignment compared to existing approaches depicted in (a) and (b).

Our proposal: To address these challenges, we propose DPA, a novel unsupervised adaptation method for VLMs (Figure 2). DPA aims at addressing the domain gap between visual and textual representations in downstream tasks. DPA introduces the idea of dual prototypes, namely image and textual prototypes, which act as distinct classifiers and their outputs are fused via convex combination towards generating accurate pseudo-labels. There are two main reasons for introducing dual prototypes. Firstly, the misalignment between the image representation and its textual representation in zero-shot CLIP, due to domain shift, often leads to inaccurate pseudo-labels (see Figure 1-(a)). Secondly, image prototypes, which tend to be less affected by noise, as illustrated in Figure 1, are generally closer to the true image representation than textual prototypes. Additionally, we tackle the challenge of misalignment between visual and textual embeddings by aligning textual prototypes with image prototypes. This alignment process further enhances the performance of unsupervised CLIP adaptation.

Contributions: 1) We propose DPA, a novel unsupervised domain adaptation framework for VLMs which allows adapting these powerful models to new domains without requiring labelled data from the target distribution. 2) We introduce a novel approach for generating accurate pseudo-labels by leveraging two distinct prototypes and fusing their outputs via a convex combination. We also propose to rank pseudo-labels in the classification loss to mitigate their noise, especially during early training. Moreover, we propose to tackle the visual-textual misalignment by aligning textual prototypes with image prototypes. 3) Experiments on 13 downstream vision tasks demonstrate consistent and significant performance enhancements over zero-shot CLIP and the state-of-the-art baselines.

Refer to caption

Figure 2: The overall framework of DPA. (a) Given a target dataset, DPA utilizes a set of carefully designed prompts to initialize the textual prototypes using the CLIP’s zero-shot textual encoder Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. (b) To achieve effective self-training, DPA introduces dual prototypes, namely image and textual prototypes, that behave like two distinct classifiers, and it fuses their outputs via convex combination to form accurate pseudo-labels (PLs). Moreover, it ranks PLs for classification loss to alleviate noisy PLs impact during early self-training. Finally, DPA aligns textual and visual prototypes to adeptly adjust to the target feature semantic relations learned by the visual encoder. (c) During inference, DPA discards the visual prototypes and relies solely on the textual prototypes for prediction.

2 Related Work

Large-scale Vision-Language (VL) Models: Among other vision-language models [20, 21, 2, 1, 22, 3, 23], CLIP [1] stands out as a pioneering example, aligning visual and textual features via a contrastive objective on a vast web-crawled collection of image-text pairs. This alignment empowers CLIP to generalize to diverse downstream classification tasks effectively. Building on CLIP’s success, subsequent research has explored methods for efficiently transferring the pre-trained model to handle diverse downstream tasks with limited labelled target data. These approaches often leverage vision-specific adapters [24, 6, 25, 26], language-specific adapters [7, 8, 10], or both [9]. However, these techniques require a minimum number of labelled samples, posing challenges due to data scarcity, annotation expenses, privacy issues, and practical limitations. Recent advancements have seen the emergence of unsupervised adaptation techniques, focusing on tailoring CLIP to target tasks using unlabelled datasets [14, 13, 16, 12, 11, 15]. However, these methods still depend on additional supervision signals, such as fine-tuning classifiers with a large language model like GPT-3, as seen in LaFTer [12]. They also often require significant computational resources, exemplified by MUST [11]. ReCLIP [16] tackle misaligned embeddings through source-free domain learning, utilizing a projection space to align these embeddings and employing pseudo-labels for self-training. Despite progress, there remains significant potential for improving the performance of unsupervised adaptation techniques across various image classification benchmarks. In contrast, our work introduces a fully label-free approach for adapting VLMs to a target task by generating more accurate pseudo-labels grounded on image and text prototypes.

Prototype-based Learning: Prototype-based learning uses a fixed set of distinctive prototypes representing the data, and then the learning is achieved by comparing the test samples directly with these prototypes. Prototype-based learning has been extensively studied in various contexts, including few-shot classification [27, 28], unsupervised learning [29, 30] and supervised classification [31, 32]. Our method innovatively incorporates two types of prototypes, image prototypes and textual prototypes, to harness both prototypes for ultimately generating accurate pseudo-labels and effectively adapting CLIP to the target dataset.

Pseudo-labelling: The success of pseudo-labelling has been demonstrated in various domains, including vision [33, 34] and vision-language tasks [11, 12, 14, 16]. The pseudo-labelling pipeline of DPA leverages consistency regularization [35] to produce consistent predictions across different data augmentations, thereby enhancing the accuracy of pseudo-labels.

3 Methodology

This study addresses the task of unsupervised adaptation of vision-language models (i.e., CLIP) for image classification. Following are the settings for this task:

  • Pre-trained CLIP consists of visual encoder Evsubscript𝐸𝑣E_{v}italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and textual encoder Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

  • Target dataset 𝒟t={𝒳t}subscript𝒟𝑡subscript𝒳𝑡\mathcal{D}_{t}=\left\{\mathcal{X}_{t}\right\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } consists of unlabelled images {xi}i=1Nsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑁\left\{x_{i}\right\}_{i=1}^{N}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where xi𝒳tsubscript𝑥𝑖subscript𝒳𝑡x_{i}\in\mathcal{X}_{t}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

  • Unique class names 𝒞𝒞\mathcal{C}caligraphic_C ={cj}j=1Cabsentsuperscriptsubscriptsubscript𝑐𝑗𝑗1𝐶=\{c_{j}\}_{j=1}^{C}= { italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPTfor the unlabelled target dataset.

We use a pre-trained CLIP to process the unlabelled target data 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Initially, the source model assigns a pseudo-label to each unlabelled target image xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. During the adaptation phase, we employ two prototypes to generate pseudo-labels for the adaptation of CLIP to the target domain.

3.1 Zero-shot Visual Classification in CLIP

CLIP achieves impressive zero-shot performance on image classification tasks by learning a joint embedding space that aligns image and text representations. During pre-training, CLIP minimizes a symmetric contrastive loss between semantically similar image-text pairs (xis,𝐭is)i=1nsuperscriptsubscriptsuperscriptsubscript𝑥𝑖𝑠superscriptsubscript𝐭𝑖𝑠𝑖1𝑛{\left(x_{i}^{s},\mathbf{t}_{i}^{s}\right)}_{i=1}^{n}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT sampled from a large-scale source dataset 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT:

ilogexp(fisTzi/τ)jmexp(fisTzj/τ)logexp(ziTfis/τ)jmexp(zjTfis/τ),\sum_{i}-\log\frac{\exp(f_{i}^{s}{}^{T}\cdot{z}_{i}/\tau)}{\sum_{j}^{m}\exp(f_% {i}^{s}{}^{T}\cdot{z}_{j}/\tau)}-\log\frac{\exp({z}_{i}^{T}\cdot f_{i}^{s}/% \tau)}{\sum_{j}^{m}\exp({z}_{j}^{T}\cdot f_{i}^{s}/\tau)},∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log divide start_ARG roman_exp ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_exp ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) end_ARG - roman_log divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT / italic_τ ) end_ARG , (1)

where fis=Ev(xis)superscriptsubscript𝑓𝑖𝑠subscript𝐸𝑣superscriptsubscript𝑥𝑖𝑠f_{i}^{s}=E_{v}(x_{i}^{s})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) and zi=Et(𝐭is)subscript𝑧𝑖subscript𝐸𝑡subscriptsuperscript𝐭𝑠𝑖z_{i}=E_{t}(\mathbf{t}^{s}_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), with both fis1×dsuperscriptsubscript𝑓𝑖𝑠superscript1𝑑f_{i}^{s}\in\mathbb{R}^{1\times d}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT and zi1×dsubscript𝑧𝑖superscript1𝑑z_{i}\in\mathbb{R}^{1\times d}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT. The parameter τ𝜏\tauitalic_τ represents a learned temperature, while m𝑚mitalic_m indicates the mini-batch size. For inference, given an unlabelled target image x𝑥xitalic_x and a list of class names 𝒞𝒞\mathcal{C}caligraphic_C, we leverage the pre-trained textual encoder Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to generate text embeddings for each class. Specifically, a series of well-crafted prompts are fed into Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, producing a set of text embeddings for each class, denoted as 𝐙𝐙\mathbf{Z}bold_Z where 𝐙C×d𝐙superscript𝐶𝑑\mathbf{Z}\in\mathbb{R}^{C\times d}bold_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_d end_POSTSUPERSCRIPT. The zero-shot prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is then obtained as follows:

y^=argmaxc(f𝐙T)^𝑦subscriptargmax𝑐𝑓superscript𝐙𝑇\hat{y}=\operatornamewithlimits{argmax}_{c}(f\cdot\mathbf{Z}^{T})over^ start_ARG italic_y end_ARG = roman_argmax start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_f ⋅ bold_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (2)

3.2 Proposed Framework (DPA)

Motivation: While zero-shot CLIP shows impressive performance across diverse domains, it still exhibits notable limitations compared to models adapted under supervised conditions due to the misalignment between image and corresponding textual representations [1]. As depicted in Figure 1-(a), this discrepancy is evident in CLIP when applied to uncommon domains. Such disparities can lead to inaccurate pseudo-labels, especially when utilizing zero-shot CLIP to generate pseudo-labels for the unsupervised adaptation [16]. Our objective is to adapt CLIP to the target domain using its unlabelled samples by bridging the gap between image representations and their textual representations, all without dependence on ground-truth labels. Simultaneously, we seek to achieve efficient parameter-based unsupervised adaptation for CLIP. Our objective relies on two key components: (1) ensuring the accuracy of the pseudo-labels generated for the unlabelled samples, and (2) aligning textual representations with their corresponding visual representations. To generate the pseudo-labels for adapting CLIP to the target domain, we introduce dual prototypes which act as two different classifiers and fuse their complementary outputs via convex combination. Furthermore, we propose visual-textual prototype alignment towards further improving the adaptation performance.

Textual Prototypes Construction: To create textual prototypes, we obtain the textual representation for each class in the target domain. This is achieved by incorporating each class name cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT into a predefined prompt template. It is then processed by CLIP’s text encoder to produce a textual representation zjsubscript𝑧𝑗z_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We utilize k𝑘kitalic_k distinct prompts for each class, resulting in k𝑘kitalic_k corresponding textual representations zjk×dsubscript𝑧𝑗superscript𝑘𝑑z_{j}\in\mathbb{R}^{k\times d}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the dimensionality of the embedding space. The textual prototype 𝐙jsubscript𝐙𝑗\mathbf{Z}_{j}bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for class cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is constructed by averaging these k𝑘kitalic_k individual textual representations:

𝐙j=1ki=1kzjisubscript𝐙𝑗1𝑘subscriptsuperscript𝑘𝑖1subscript𝑧𝑗𝑖\mathbf{Z}_{j}=\frac{1}{k}\sum^{k}_{i=1}{z_{ji}}bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT (3)

where 𝐙j1×dsubscript𝐙𝑗superscript1𝑑\mathbf{Z}_{j}\in\mathbb{R}^{1\times d}bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT. Finally, we define the textual prototypes for the set of C𝐶Citalic_C classes as 𝐙C×d𝐙superscript𝐶𝑑\mathbf{Z}\in\mathbb{R}^{C\times d}bold_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_d end_POSTSUPERSCRIPT. This approach enables us to capture a comprehensive and robust representation of class semantics by consolidating textual representations derived from multiple prompts. Finally, these textual prototypes act as the initialization for a parametric classifier that is updated during training alongside CLIP parameters.

Image Prototypes Construction: To construct the image prototypes, CLIP’s image encoder processes a weakly-augmented unlabelled image α(x)𝛼𝑥\alpha(x)italic_α ( italic_x ) and generates a feature representation f=Ev(α(x))𝑓subscript𝐸𝑣𝛼𝑥f=E_{v}(\alpha(x))italic_f = italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_α ( italic_x ) ), where f1×d𝑓superscript1𝑑f\in\mathbb{R}^{1\times d}italic_f ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT. Without class labels, we utilize Equation 2 to generate pseudo-labels y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG from zero-shot CLIP. To obtain the image prototype 𝐏jsubscript𝐏𝑗\mathbf{P}_{j}bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for a class cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we average the feature representation f𝑓fitalic_f of the unlabelled images assigned to class cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as follows:

𝐏j=1Nji=1N𝕀yi^=jfisubscript𝐏𝑗1subscript𝑁𝑗superscriptsubscript𝑖1𝑁subscript𝕀^subscript𝑦𝑖𝑗subscript𝑓𝑖\mathbf{P}_{j}=\frac{1}{N_{j}}\sum_{i=1}^{N}\mathbb{I}_{\hat{y_{i}}=j}f_{i}bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (4)

where Nj=i=1N𝕀y^i=jsubscript𝑁𝑗superscriptsubscript𝑖1𝑁subscript𝕀subscript^𝑦𝑖𝑗N_{j}=\sum_{i=1}^{N}\mathbb{I}_{\hat{y}_{i}=j}italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j end_POSTSUBSCRIPT. For the set of C𝐶Citalic_C classes, we define the image prototypes as 𝐏C×d𝐏superscript𝐶𝑑\mathbf{P}\in\mathbb{R}^{C\times d}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_d end_POSTSUPERSCRIPT. The image prototypes function as a non-parametric classifier. Given that the pseudo-labels generated by zero-shot CLIP are initially noisy, our approach, inspired by previous works [36, 37, 38], employs a memory bank 𝐌𝐁𝐌𝐁\mathbf{MB}bold_MB. This memory bank facilitates the incremental update of image prototypes in a non-parametric way throughout the training process. Specifically, 𝐌𝐁𝐌𝐁\mathbf{MB}bold_MB stores the feature representations of all unlabelled target samples along with their corresponding pseudo-labels: 𝐌𝐁={(fi,y^i)}i=1N𝐌𝐁superscriptsubscriptsubscript𝑓𝑖subscript^𝑦𝑖𝑖1𝑁\mathbf{MB}=\{(f_{i},\hat{y}_{i})\}_{i=1}^{N}bold_MB = { ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. As the image prototypes are non-parametric, we iteratively refine them using the memory bank as follows: (1) We initially calculate the image prototypes using Equation 4, utilizing the image features and the pseudo-labels generated from zero-shot CLIP across the entire target dataset. (2) During training, we continuously update the memory bank by storing the image representations alongside their corresponding pseudo-labels. (3) At the end of each epoch, we update the prototypes 𝐏jsubscript𝐏𝑗\mathbf{P}_{j}bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using Equation 4, incorporating the image representations and their corresponding pseudo-labels from 𝐌𝐁𝐌𝐁\mathbf{MB}bold_MB. (4) We repeat steps (2) and (3) until the completion of the training process. DPA maximizes the utilization of all available training samples to enhance the image prototypes. This iterative process ensures that each prototype continuously evolves, progressively assimilating relevant knowledge from individual training samples throughout the model training phase, while mitigating the error propagation commonly associated with pseudo-labels.

Pseudo-labels Generation: Both image and textual prototypes serve as distinct classifiers for the unlabelled target dataset. We generate a pseudo-label (PL) for the weakly-augmented unlabelled target image α(x)𝛼𝑥\alpha(x)italic_α ( italic_x ) from these two prototypes. Initially, we determine the similarity between the image features and the textual prototypes as follows:

pt=f𝐙Tsubscript𝑝𝑡𝑓superscript𝐙𝑇p_{t}=f\cdot\mathbf{Z}^{T}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ⋅ bold_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (5)

Then, we find the similarity between the image features and the image prototypes as follows:

pv=f𝐏Tsubscript𝑝𝑣𝑓superscript𝐏𝑇p_{v}=f\cdot\mathbf{P}^{T}italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_f ⋅ bold_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (6)

Finally, we fuse ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and pvsubscript𝑝𝑣p_{v}italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to form p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG as follows:

p^=βDA(pt)+(1β)pv^𝑝𝛽𝐷𝐴subscript𝑝𝑡1𝛽subscript𝑝𝑣\hat{p}=\beta\cdot DA(p_{t})+(1-\beta)\cdot p_{v}over^ start_ARG italic_p end_ARG = italic_β ⋅ italic_D italic_A ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ( 1 - italic_β ) ⋅ italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT (7)

Following [39], we perform distribution alignment (DA) to prevent the model’s prediction from collapsing onto specific classes. Here, DA(pt)=pt/p¯t𝐷𝐴subscript𝑝𝑡subscript𝑝𝑡subscript¯𝑝𝑡DA(p_{t})=p_{t}/\bar{p}_{t}italic_D italic_A ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where p¯tsubscript¯𝑝𝑡\bar{p}_{t}over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents a running average of ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT during training, and β𝛽\betaitalic_β is a hyperparameter. Finally, the PL is obtained as y^=argmaxcp^^𝑦subscriptargmax𝑐^𝑝\hat{y}=\operatornamewithlimits{argmax}_{c}\hat{p}over^ start_ARG italic_y end_ARG = roman_argmax start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG. This process of generating pseudo-labels improves the accuracy of the PL during training, indicating that these PLs may initially be noisy (see Figure 3). Therefore, assigning uniform weights to all pseudo-labels could hinder the adaptation process. To address this challenge, we propose adjusting the classification loss weight for the pseudo-label y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG based on its similarity to the textual and image prototypes, as follows:

wx=f,𝐙(y^)f,𝐏(y^)subscript𝑤𝑥𝑓𝐙^𝑦𝑓𝐏^𝑦w_{x}=\langle f,\mathbf{Z}(\hat{y})\rangle\langle f,\mathbf{P}(\hat{y})\rangleitalic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = ⟨ italic_f , bold_Z ( over^ start_ARG italic_y end_ARG ) ⟩ ⟨ italic_f , bold_P ( over^ start_ARG italic_y end_ARG ) ⟩ (8)

,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩represents the cosine similarity between the image features and the corresponding prototypes derived from the PLs. Finally, we utilize the pseudo-label y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, obtained from a weakly-augmented image α(x)𝛼𝑥\alpha(x)italic_α ( italic_x ), as self-supervision for the strongly-augmented counterpart 𝒜(x)𝒜𝑥\mathcal{A}(x)caligraphic_A ( italic_x )as follows:

st=𝔼x𝒳t[wxj=1C𝕀y^=jlog(p𝒜(x))],subscript𝑠𝑡subscript𝔼𝑥subscript𝒳𝑡delimited-[]subscript𝑤𝑥superscriptsubscript𝑗1𝐶subscript𝕀^𝑦𝑗𝑙𝑜𝑔subscript𝑝𝒜𝑥\mathcal{L}_{st}=-\;\mathbb{E}_{x\in\mathcal{X}_{t}}\;\left[w_{x}\cdot\sum_{j=% 1}^{C}\mathbb{I}_{\hat{y}=j}log\,\left(p_{\mathcal{A}(x)}\right)\right],caligraphic_L start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG = italic_j end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p start_POSTSUBSCRIPT caligraphic_A ( italic_x ) end_POSTSUBSCRIPT ) ] , (9)

where wxsubscript𝑤𝑥w_{x}italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT represents the weight calculated from Equation 8, and p𝒜(x)=Ev(𝒜(x))𝐙Tsubscript𝑝𝒜𝑥subscript𝐸𝑣𝒜𝑥superscript𝐙𝑇p_{\mathcal{A}(x)}=E_{v}(\mathcal{A}(x))\cdot\mathbf{Z}^{T}italic_p start_POSTSUBSCRIPT caligraphic_A ( italic_x ) end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( caligraphic_A ( italic_x ) ) ⋅ bold_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT represents the probabilistic output for the strongly-augmented image 𝒜(x)𝒜𝑥\mathcal{A}(x)caligraphic_A ( italic_x ). To alleviate the confirmation bias induced by CLIP [40], we adopt “fairness” regularization as suggested in [11] as follows:

reg=1Cj=1Clog(p¯𝒜(x),j),subscript𝑟𝑒𝑔1𝐶superscriptsubscript𝑗1𝐶subscript¯𝑝𝒜𝑥𝑗\mathcal{L}_{reg}=-\frac{1}{C}\sum_{j=1}^{C}\log\left(\bar{p}_{\mathcal{A}(x),% j}\right),caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_log ( over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_A ( italic_x ) , italic_j end_POSTSUBSCRIPT ) , (10)

where p¯𝒜(x)subscript¯𝑝𝒜𝑥\bar{p}_{\mathcal{A}(x)}over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_A ( italic_x ) end_POSTSUBSCRIPT represents the model’s average prediction from the strongly augmented images across the batch. By incorporating this fairness regularization, we aim to encourage the model to make more uniform predictions across classes, reducing the tendency to overfit the PLs and promoting a more balanced adaptation to the target domain. Equations 9 and 10 are used to update both the image encoder and the textual prototypes.

Prototypes Alignment: During training, the textual prototypes undergo continual updates to align with the associated image representation, leveraging pseudo-labels using Equation 9 and 10. Conversely, the refinement of the image prototypes occurs through averaging image features, as detailed in Section 3.2. We exploit the image prototypes to provide additional supervision for updating textual prototypes beyond pseudo-label alone. These image prototypes, less influenced by erroneous pseudo-labels and closer to image features (as illustrated in Figure 1), offer extra guidance for refining textual prototypes. Our primary objective is to optimize the relationship between image prototypes 𝐏𝐏\mathbf{P}bold_P and textual prototypes 𝐙𝐙\mathbf{Z}bold_Z by maximizing the (cosine) similarity between these feature sets. To accomplish this optimization, we employ the InfoNCE loss [41] to directly assess the alignment between the corresponding feature vectors in 𝐏𝐏\mathbf{P}bold_P and 𝐙𝐙\mathbf{Z}bold_Z, as follows:

align=j=1Clogexp(𝐏j𝐙jT/τ)rexp(𝐏j𝐙rT/τ)subscriptalignsuperscriptsubscript𝑗1𝐶subscript𝐏𝑗superscriptsubscript𝐙𝑗𝑇𝜏subscript𝑟subscript𝐏𝑗superscriptsubscript𝐙𝑟𝑇𝜏\mathcal{L}_{\text{align}}=-\sum_{j=1}^{C}\log\dfrac{\exp(\mathbf{P}_{j}\cdot% \mathbf{Z}_{j}^{T}/\tau)}{\displaystyle\sum_{r}\exp(\mathbf{P}_{j}\cdot\mathbf% {Z}_{r}^{T}/\tau)}caligraphic_L start_POSTSUBSCRIPT align end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_log divide start_ARG roman_exp ( bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT roman_exp ( bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / italic_τ ) end_ARG (11)

With this alignment, we maximize the (cosine) similarity between the image prototypes 𝐏𝐏\mathbf{P}bold_P and the textual prototypes 𝐙𝐙\mathbf{Z}bold_Z, encouraging the model to learn more closely aligned representations. The overall loss function used for training on the target data is, =λ1st+λ2reg+λ3alignsubscript𝜆1subscript𝑠𝑡subscript𝜆2subscript𝑟𝑒𝑔subscript𝜆3subscript𝑎𝑙𝑖𝑔𝑛\mathcal{L}=\lambda_{1}\mathcal{L}_{st}+\lambda_{2}\mathcal{L}_{reg}+\lambda_{% 3}\mathcal{L}_{align}caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT, where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are non-learnable parameters.

4 Experiments

Datasets and baselines: We extensively evaluate our approach on 13 diverse datasets: ImageNet [42], Caltech101 [43], DTD [44], EuroSAT [18], FGVCAircraft [45], Food101 [46], Flowers102 [47], OxfordPets [48], SUN397 [49], StandfordCars [50], CIFAR10\100 [51], and UCF101 [52]. We perform comparative analysis with four state-of-the-art (SOTA) unsupervised adaptation methods: CLIP [1], UPL [14], POUF [13], and LaFTer [12].

Implementation Details: For all experiments, unless otherwise specified, we utilize a ViT/B-32 CLIP pre-trained by OpenAI [1]. In the context of unsupervised fine-tuning, we specifically target the layer-normalization weights of the image encoder and the textual prototypes. This strategy is effective and stable for adapting models with noisy supervision [53, 54]. The text encoder of CLIP will be excluded after constructing the textual prototypes. Input images are standardized to a size of 224 × 224. During training, we apply a series of data augmentation techniques, including RandomResizedCrop, Flip, RandAugment [55] as a strong augmentation and Resize+RandomCrop for generating pseudo-labels. For testing, we extract a Center crop after resizing the image to 224224224224. We utilize AdamW optimizer [56] with a cosine learning rate schedule. For fair comparisons, we reproduce the results of UPL, POUF, and LaFTer using their publicly available code. For more implementation details, please refer to the supplementary materials.

Results: Table 1 reports the top-1 accuracy of our proposed method DPA and four SOTA baselines: CLIP, UPL, POUF, and LaFTer on 13 datasets. Our approach consistently outperforms zero-shot CLIP, improving top-1 accuracy by +7.83%percent7.83+7.83\%+ 7.83 % on average. Compared to UPL, which optimizes prompts, DPA still achieves consistent improvements with an average gain of +8.24%percent8.24+8.24\%+ 8.24 % with minimal additional computational overhead despite slightly larger parameters. DPA also surpasses POUF and LaFTer, achieving average increases of +7.47%percent7.47+7.47\%+ 7.47 % and +3.93%percent3.93+3.93\%+ 3.93 %, respectively, across the 13 datasets. Notably, DPA outperforms LaFTer without requiring additional text corpora or further pre-training.

Table 1: Comparison with SOTA unsupervised adaption methods.
Method ImgNet Caltech DTD ESAT FGVCA Food Flower OxPets SUN StCars CIFAR10 CIFAR100 UCF Avg
Zero-shot CLIP [1] 63.30 90.69 44.42 43.84 19.50 82.40 66.46 87.50 61.99 58.74 89.80 65.10 64.20 64.46
UPL [14] 58.22 92.36 45.37 51.88 17.07 84.25 67.40 83.84 62.12 49.41 91.26 67.41 62.04 64.05
POUF [13] 52.20 94.10 46.10 62.90 18.20 82.10 67.80 87.80 60.00 57.70 90.50 62.00 61.20 64.82
LaFTer [12] 61.63 94.39 50.32 69.96 19.86 82.45 72.43 84.93 65.87 57.44 94.57 69.79 65.08 68.36
DPA 64.64 96.06 55.69 80.04 20.67 84.76 75.56 90.71 68.13 62.62 95.97 76.47 68.49 72.29

For ablation studies, we use 11 out of 13 datasets, excluding ImageNet and SUN397 due to their large size. Excluding these large-scale datasets allows us to perform more extensive experiments and analyses while maintaining computational feasibility.

Analysis of Model Components: We investigate the impact of different components through the training of four distinct models: Base: This model iteratively generates pseudo-labels using textual prototypes and incorporates regularization loss through self-training. Center: CLIP undergoes training with PLs generation using image and text prototypes, as detailed in Section 3.2, in addition to building upon the foundation established in the Base model. Center+w𝑤witalic_w: introduces the weighting strategy for Center model. Center+w𝑤witalic_w+Align: introduces the alignment loss, as detailed in Section 3.2. As shown in Table 2, Base model, which relies solely on self-training, demonstrates a notable improvement compared to the zero-shot CLIP. However, as training progresses, the impact of confirmation bias becomes evident, leading to a decrease in the accuracy of the pseudo-labels, as illustrated in Figure 3-(a). To mitigate the effects of confirmation bias, we introduce the pseudo-label generation strategy in  Center model. This approach enhances the quality of pseudo-labels throughout the training process, leading to a significant gain of +7.28%percent7.28+7.28\%+ 7.28 % over zero-shot CLIP in average accuracy. Incorporating the weighting strategy in Center+w𝑤witalic_w model results in an additional increase of +8.01%percent8.01+8.01\%+ 8.01 % over zero-shot CLIP in accuracy. By assigning appropriate weights to the self-training loss, the model can better balance the contributions of high-confidence and low-confidence pseudo-labels, leading to more stable and practical training. Finally, including the alignment loss in Center+w𝑤witalic_w+Align model leads to a further increase of +8.58%percent8.58+8.58\%+ 8.58 % over zero-shot CLIP in average accuracy. By explicitly encouraging the alignment between image and text prototypes, the model learns more robust and transferable representations, which are crucial for effective unsupervised adaptation. Finally, we conduct a detailed analysis of the evolution of pseudo-label accuracy during training epochs for our method (DPA) compared to Base model (Figure 3-(a,b,c)).

Table 2: Ablation study. See text for details.
Method Caltech DTD ESAT FGVCA Food Flower OxPets StCars CIFAR10 CIFAR100 UCF Avg
Zero-shot CLIP 90.69 44.42 43.84 19.50 82.40 66.46 87.50 58.74 89.80 65.10 64.20 64.79
Base 93.57 48.99 61.94 19.20 84.10 68.45 90.24 59.25 95.95 73.55 65.45 69.15
Center 95.44 55.53 70.56 19.80 84.65 75.27 90.71 61.53 95.96 76.01 67.30 72.07
Center+w𝑤witalic_w 95.46 54.54 80.06 19.56 84.63 75.44 90.49 61.19 95.97 75.92 67.51 72.80
Center+w𝑤witalic_w+Align(DPA) 96.06 55.69 80.04 20.67 84.76 75.56 90.71 62.62 95.97 76.47 68.49 73.37
Refer to caption
Figure 3: Base vs. DPA in (a) PLs, (b) training and (c) testing accuracy on EuroSAT dataset.

Incorporating Image Prototypes as Pseudo-Label Source: We analyse the impact of incorporating image prototypes as an additional classifier. We exclude Equation 7 and generate PLs solely based on text prototypes, while preserving the alignment of image-text prototypes and the weighting strategy. We update 𝐌𝐁𝐌𝐁\mathbf{MB}bold_MB solely using the pseudo-labels derived from the textual prototypes. Table 3 shows that the alignment and weighting strategies generally perform well across datasets, with minor differences compared to the entire model, except for EuroSAT and Flowers-102.

Table 3: Image prototypes vs. text prototypes for pseudo-label generation.
Method Caltech DTD ESAT FGVCA Food Flower OxPets StCars CIFAR10 CIFAR100 UCF Avg
Zero-shot CLIP 90.69 44.42 43.84 19.50 82.40 66.46 87.50 58.74 89.8 65.10 64.20 64.79
DPA (w/o PLs Generation) 95.41 54.20 61.94 19.44 84.49 69.22 91.17 61.95 95.93 76.12 67.96 70.71
DPA 96.06 55.69 80.04 20.67 84.76 75.56 90.71 62.62 95.97 76.47 68.49 73.37

Impact of Image Prototype Initialization on Performance: We evaluate three initializations for image prototype methods: mean, weighted mean, and similarity weighted mean [28] (Table 4). The results demonstrate significant improvements over the baselines for all initialization strategies.

Table 4: Performance of DPA with three different initializations for image prototypes.
Method Caltech DTD ESAT FGVCA Food Flower OxPets StCars CIFAR10 CIFAR100 UCF Avg
Zero-shot CLIP 90.69 44.42 43.84 19.50 82.40 66.46 87.50 58.74 89.80 65.10 64.20 64.79
DPA (Weighted Mean) 96.00 54.52 68.34 20.31 84.70 73.81 90.49 63.14 95.96 75.81 68.04 71.92
DPA (Similarity Weighted Mean) 94.52 55.16 79.76 21.12 84.70 77.06 90.84 62.98 95.93 76.33 68.28 73.33
DPA (Mean) 96.06 55.69 80.04 20.67 84.76 75.56 90.71 62.62 95.97 76.47 68.49 73.37

Robustness Analysis under Distribution Shifts: To evaluate robustness against natural distribution shifts, we show results on ImageNet-Sketch [57], ImageNet-Adversarial [58], and ImageNet-Rendition [59]. Initially, we evaluate a fine-tuned ViT-B/32 model on ImageNet using our method, achieving an accuracy of 64.64%percent64.6464.64\%64.64 % on the ImageNet validation set. We explore transductive transfer learning [60, 61] with our approach. Transductive learning involves assuming access to unlabelled test images from the new distribution (Table 5). We observe a modest enhancement in performance under distribution shift in 2 out of 3 datasets. This slight improvement is attributed to utilizing only 25%percent2525\%25 % of ImageNet dataset, a deliberate choice made to manage computational resources effectively and address the substantial size of ImageNet training dataset.

Table 5: Transductive adaptation to distribution shift
Method ImageNet ImageNet-S ImageNet-A ImageNet-R
Zero-shot CLIP 63.30 42.10 68.60 32.10
DPA (Transductive) 64.64 42.60 68.40 32.20

Full Fine-tuning: We compare full fine-tuning of the models instead of solely fine-tuning the layer-normalization weights (Table 6) under noisy PLs. The last row shows the PLs of zero-shot CLIP used at the start of the training for constructing the initial image prototypes. In datasets with noisy PLs, Table 6 highlights the importance of solely fine-tuning layer-normalization weights during adaptation to prevent overfitting and maximize performance.

Table 6: Comparison of full fine-tuning vs. layer-normalization weights fine-tuning for DPA
Method Caltech DTD ESAT FGVCA Food Flower OxPets StCars CIFAR10 CIFAR100 UCF Avg
Zero-shot CLIP 90.69 44.42 43.84 19.50 82.40 66.46 87.50 58.74 89.80 65.10 64.20 64.46
DPA (LayerNorm Tuning) 96.06 55.69 80.04 20.67 84.76 75.56 90.71 62.62 95.97 76.47 68.49 73.31
DPA (Full Tuning) 94.57 53.03 62.92 17.37 82.62 75.23 91.14 54.06 97.44 78.91 68.57 70.53
DPA (PLs accuracy of Zero-shot CLIP) 88.83 43.54 44.90 17.79 77.91 66.53 83.80 57.81 89.64 65.60 64.10 63.68

With other VLMs: We evaluate the efficacy of DPA with other VLM models in Table 7. To assess this, we conducted experiments on SLIP [62], a version of CLIP incorporating self-supervised learning objectives during pre-training. Additionally, we evaluated DPA using a different CLIP-based architecture, rather than ViT-B/32. As depicted in Table 7, DPA consistently demonstrates substantial and consistent improvements across diverse VLMs and architectures.

Table 7: Effectiveness of DPA on different model architectures and pre-training strategies.
Method Caltech DTD ESAT FGVCA Food Flower OxPets StCars CIFAR10 CIFAR100 UCF
Init \rightarrow Adapt Init \rightarrow Adapt Init \rightarrow Adapt Init \rightarrow Adapt Init \rightarrow Adapt Init \rightarrow Adapt Init \rightarrow Adapt Init \rightarrow Adapt Init \rightarrow Adapt Init \rightarrow Adapt Init \rightarrow Adapt
SLIP (ViT-L/16) 83.90 \rightarrow 90.20 25.53\rightarrow 36.28 20.90\rightarrow 31.50 8.40\rightarrow 9.18 60.45\rightarrow 68.08 64.23\rightarrow 75.07 33.22\rightarrow 44.02 8.00\rightarrow 9.03 81.67\rightarrow 88.10 48.70\rightarrow 59.88 38.09\rightarrow 52.76
DPA (ViT-B/16) 92.60\rightarrow 96.09 44.70\rightarrow 50.69 49.00\rightarrow 81.22 23.97\rightarrow 25.14 87.80\rightarrow 89.83 70.89\rightarrow 78.68 89.00\rightarrow 93.19 64.70\rightarrow 68.45 90.80\rightarrow 96.43 68.22\rightarrow 78.10 69.10 \rightarrow 74.62

Comparison with ReCLIP: We provide a comparison with a very recent method for unsupervised adaptation of VLMs, namely ReCLIP [16] (Table 8). DPA consistently outperforms ReCLIP [16] (Table 8) in both inductive and transductive settings. Also, DPA has superior training efficiency and performance with fewer parameters and reduced computational complexity (Figure 4).

Table 8: Comparison of top-1 classification accuracy (%) of DPA with ReCLIP [16], a source-free unsupervised domain adaptation method for CLIP. indicates that the model was originally trained in a transductive setting, but we trained it in an inductive setting for fair comparison.
Method Caltech DTD ESAT FGVCA Flower OxPets StCars UCF Avg
Zero-shot CLIP 90.69 44.42 43.84 19.50 66.46 87.50 58.74 64.20 59.42
Inductive
ReCLIP [16] 95.94 53.88 70.80 18.87 72.63 87.49 59.22 67.01 65.73
DPA 96.06 55.69 80.04 20.67 75.56 90.71 62.62 68.49 68.73
Transductive
ReCLIP [16] 92.43 52.50 59.30 20.34 70.65 88.42 59.06 69.13 63.98
DPA 91.62 57.13 71.84 20.76 74.14 91.11 62.58 68.44 67.20
Refer to caption
Figure 4: Efficiency comparison of DPA with baselines. The radius of each circle represents trainable parameters in each method.

Comparison with Few-Shot methods: We compare with the few-shot methods (CoOp [7] and MaPLe [9]) in Table 9. DPA match or surpass few-shot methods in 6 of 11 datasets.

Table 9: Comparison with few-shot methods across multiple datasets.
Method Caltech DTD ESAT FGVCA Food Flower OxPets StCars CIFAR10 CIFAR100 UCF Avg
Zero-shot CLIP 90.69 44.42 43.84 19.50 82.40 66.46 87.50 58.74 89.80 65.10 64.20 64.79
DPA 96.06 55.69 80.04 20.67 84.76 75.56 90.71 62.62 95.97 76.47 68.49 73.37
8-shot
CoOp 93.96 61.82 77.17 26.73 78.84 88.71 85.61 66.20 94.63 75.91 77.66 75.20
MaPLe 93.23 34.99 65.25 21.96 80.51 68.78 81.85 58.51 88.23 71.14 69.18 66.69
16-shot
CoOp 95.09 67.55 76.78 30.93 79.12 93.34 86.70 70.73 94.50 75.78 79.51 77.28
MaPLe 93.35 53.43 71.49 21.48 81.01 72.47 86.54 60.28 91.94 71.74 70.26 70.36

Limitations and broader impact: While our method shows promising results, there is still room for improvement in mitigating zero-shot CLIP’s strong confirmation biases, as illustrated in Figure 5, which was not the primary focus of this work. This figure underscores the remarkable efficacy of our approach in source-free adaptation using dual classifiers to generate pseudo-labels to reduce false positive rates across all categories. However, zero-shot CLIP has strong confirmation biases, resulting in our method consistently achieving false positive predictions for certain classes. The confusion between classes is also evident in ReCLIP, indicating that addressing these strong biases requires more robust de-biasing strategies.

Refer to caption
((a)) Zero-shot CLIP
Refer to caption
((b)) ReCLIP (inductive training)
Refer to caption
((c)) DPA (Ours)
Figure 5: Confusion matrices of zero-shot CLIP, ReCLIP [16], and our method (DPA).

5 Conclusion

We present DPA, an unsupervised domain adaptation method for VLMs that aims at bridging the domain gap between visual and textual representations. DPA introduces a novel idea of dual prototypes, which work as two distinct classifiers, and their outputs are fused via a convex combination. It also ranks pseudo-labels for robust self-training during early training. DPA also enhances the alignment of visual-textual prototypes to improve adaptation performance further. DPA shows improved pseudo-labelling accuracy, which leads to notable performance improvements on various downstream tasks. DPA significantly outperforms zero-shot CLIP and the state-of-the-art baselines across 13 downstream vision tasks.

References

  • [1] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [2] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 4904–4916.
  • [3] J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 671–15 680.
  • [4] Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=zq1iJkNk3uN
  • [5] B. An, S. Zhu, M.-A. Panaitescu-Liess, C. K. Mummadi, and F. Huang, “PerceptionCLIP: Visual classification by inferring and conditioning on contexts,” in The Twelfth International Conference on Learning Representations, 2024.
  • [6] R. Zhang, W. Zhang, R. Fang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” in European Conference on Computer Vision.   Springer, 2022, pp. 493–510.
  • [7] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
  • [8] ——, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 816–16 825.
  • [9] M. U. khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • [10] M. U. Khattak, S. T. Wasim, M. Naseer, S. Khan, M.-H. Yang, and F. S. Khan, “Self-regulating prompts: Foundational model adaptation without forgetting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 15 190–15 200.
  • [11] J. Li, S. Savarese, and S. C. H. Hoi, “Masked unsupervised self-training for label-free image classification,” in ICLR, 2023.
  • [12] M. J. Mirza, L. Karlinsky, W. Lin, M. Kozinski, H. Possegger, R. Feris, and H. Bischof, “Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections,” in Conference on Neural Information Processing Systems (NeurIPS), 2023.
  • [13] K. Tanwisuth, S. Zhang, H. Zheng, P. He, and M. Zhou, “Pouf: Prompt-oriented unsupervised fine-tuning for large pre-trained models,” in International Conference on Machine Learning.   PMLR, 2023, pp. 33 816–33 832.
  • [14] T. Huang, J. Chu, and F. Wei, “Unsupervised prompt learning for vision-language models,” arXiv preprint arXiv:2204.03649, 2022.
  • [15] Q. Qian, Y. Xu, and J. Hu, “Intra-modal proxy learning for zero-shot visual categorization with clip,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [16] X. Hu, K. Zhang, L. Xia, A. Chen, J. Luo, Y. Sun, K. Wang, N. Qiao, X. Zeng, M. Sun et al., “Reclip: Refine contrastive language image pre-training with source free domain adaptation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2994–3003.
  • [17] V. W. Liang, Y. Zhang, Y. Kwon, S. Yeung, and J. Y. Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 612–17 625, 2022.
  • [18] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
  • [19] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation for deep semi-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [20] K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 162–11 173.
  • [21] M. B. Sariyildiz, J. Perez, and D. Larlus, “Learning visual representations with caption annotations,” in European Conference on Computer Vision.   Springer, 2020, pp. 153–170.
  • [22] Q. Cui, B. Zhou, Y. Guo, W. Yin, H. Wu, O. Yoshie, and Y. Chen, “Contrastive vision-language pre-training with limited resources,” in European Conference on Computer Vision.   Springer, 2022, pp. 236–253.
  • [23] X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang, “Scaling up vision-language pre-training for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 980–17 989.
  • [24] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024.
  • [25] R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao, and H. Li, “Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 211–15 222.
  • [26] X. Zhu, R. Zhang, B. He, A. Zhou, D. Wang, B. Zhao, and P. Gao, “Not all features matter: Enhancing few-shot clip with adaptive prior refinement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2605–2615.
  • [27] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
  • [28] K. J. Liang, S. B. Rangrej, V. Petrovic, and T. Hassner, “Few-shot learning with noisy labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9089–9098.
  • [29] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742.
  • [30] W. Xu, Y. Xian, J. Wang, B. Schiele, and Z. Akata, “Attribute prototype network for zero-shot learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 969–21 980, 2020.
  • [31] H.-M. Yang, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Robust classification with convolutional prototype learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3474–3482.
  • [32] P. Mettes, E. Van der Pol, and C. Snoek, “Hyperspherical prototype networks,” Advances in neural information processing systems, vol. 32, 2019.
  • [33] X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer, “Lit: Zero-shot transfer with locked-image text tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 123–18 133.
  • [34] A. Sahito, E. Frank, and B. Pfahringer, “Better self-training for image classification through self-supervision,” in Australasian Joint Conference on Artificial Intelligence.   Springer, 2022, pp. 645–657.
  • [35] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” Advances in neural information processing systems, vol. 33, pp. 596–608, 2020.
  • [36] L. Zhou, N. Li, M. Ye, X. Zhu, and S. Tang, “Source-free domain adaptation with class prototype discovery,” Pattern recognition, vol. 145, p. 109974, 2024.
  • [37] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in Computer vision–ECCV 2016: 14th European conference, amsterdam, the netherlands, October 11–14, 2016, proceedings, part VII 14.   Springer, 2016, pp. 499–515.
  • [38] S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic representations for unsupervised domain adaptation,” in International conference on machine learning.   PMLR, 2018, pp. 5423–5432.
  • [39] J. Li, C. Xiong, and S. C. Hoi, “Comatch: Semi-supervised learning with contrastive graph regularization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9475–9484.
  • [40] X. Wang, Z. Wu, L. Lian, and S. X. Yu, “Debiased learning from naturally imbalanced pseudo-labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 647–14 657.
  • [41] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  • [42] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [43] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in 2004 conference on computer vision and pattern recognition workshop.   IEEE, 2004, pp. 178–178.
  • [44] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.
  • [45] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
  • [46] L. Bossard, M. Guillaumin, and L. V. Gool, “Food-101–mining discriminative components with random forests,” in European conference on computer vision.   Springer, 2014, pp. 446–461.
  • [47] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.   IEEE, 2008, pp. 722–729.
  • [48] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition.   IEEE, 2012, pp. 3498–3505.
  • [49] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE Computer Society Conference on computer vision and Pattern Recognition.   IEEE, 2010, pp. 3485–3492.
  • [50] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE International Conference on computer vision workshops, 2013, pp. 554–561.
  • [51] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • [52] K. Soomro, A. R. Zamir, and M. Shah, “A dataset of 101 human action classes from videos in the wild,” Center for Research in Computer Vision, vol. 2, no. 11, pp. 1–7, 2012.
  • [53] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  • [54] D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,” arXiv preprint arXiv:2006.10726, 2020.
  • [55] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 702–703.
  • [56] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [57] H. Wang, S. Ge, Z. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [58] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 262–15 271.
  • [59] D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 8340–8349.
  • [60] Y. Xian, B. Schiele, and Z. Akata, “Zero-shot learning-the good, the bad and the ugly,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4582–4591.
  • [61] M. Rohrbach, S. Ebert, and B. Schiele, “Transfer learning in a transductive setting,” Advances in neural information processing systems, vol. 26, 2013.
  • [62] N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” in European conference on computer vision.   Springer, 2022, pp. 529–544.