AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning

Xin Wang    Kai Chen    Xingjun Ma    Zhineng Chen    Jingjing Chen    Yu-Gang Jiang
Abstract

Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks even under a black-box setting where the adversary can only query the model. Particularly, query-based black-box adversarial attacks estimate adversarial gradients based on the returned probability vectors of the target model for a sequence of queries. During this process, the queries made to the target model are intermediate adversarial examples crafted at the previous attack step, which share high similarities in the pixel space. Motivated by this observation, stateful detection methods have been proposed to detect and reject query-based attacks. While demonstrating promising results, these methods either have been evaded by more advanced attacks or suffer from low efficiency in terms of the number of shots (queries) required to detect different attacks. Arguably, the key challenge here is to assign high similarity scores for any two intermediate adversarial examples perturbed from the same clean image. To address this challenge, we propose a novel Adversarial Contrastive Prompt Tuning (ACPT) method to robustly fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. With ACPT, we further introduce a detection framework AdvQDet that can detect 7 state-of-the-art query-based attacks with >99%absentpercent99>99\%> 99 % detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks. Code is available at https://github.com/xinwong/AdvQDet.


1 Introduction

In the past decade, deep neural networks (DNNs) have made remarkable achievements across a wide range of fields, such as computer vision (He et al., 2016; Dosovitskiy et al., 2010), natural language processing (Vaswani et al., 2017; Devlin et al., 2018), and multimodal learning (Radford et al., 2021; Ramesh et al., 2022). Despite these advancements, studies have shown that DNNs are extremely vulnerable to small adversarial perturbations at the inference stage (Szegedy et al., 2013), which are input perturbations generated to maximize the prediction error of the model. The adversarially perturbed inputs are known as adversarial examples (attacks) and the weakness of DNNs to adversarial attacks is known as the adversarial vulnerability. This has raised serious security concerns on the development of DNNs in safety-critical scenarios, such as autonomous driving (Eykholt et al., 2018; Cao et al., 2019) and medial diagnosis (Finlayson et al., 2019; Ma et al., 2021).

An adversary could generate adversarial attacks in either a white-box or a black-box setting according to the threat model. In the white-box setting, the adversary has full access to the model’s parameters and thus can directly compute the adversarial gradients to generate adversarial examples (Szegedy et al., 2013; Madry et al., 2017; Ma et al., 2024). In the black-box setting, however, the adversary can only query the target model to estimate the adversarial gradients based on the model returns (probability vectors or hard labels) (Chen et al., 2017; Brendel et al., 2018; Jiang et al., 2019; Tong et al., 2023; Zhao et al., 2024). Black-box attacks can also be achieved by transfer-based attacks, i.e., generating the attacks based on a surrogate model that is similar to the target model and then applying the generated adversarial examples to attack the target model (Dong et al., 2018; Wu et al., 2020; Chen et al., 2023; Wei et al., 2023). Compared to white-box attacks, black-box attacks pose a more practical threat as most commercial models are kept secret from the users except their APIs. In this work, we focus on query-based black-box adversarial attacks and study the detectability of the malicious queries made by these attacks to the target model.

Refer to caption
Figure 1: Query-based attack and stateful detection.

Existing defense approaches against adversarial attacks can be categorized into adversarial training methods (Madry et al., 2017; Wang et al., 2019; Zhang et al., 2019; Wang et al., 2020; Bai et al., 2021) and adversarial example detection methods (Xu et al., 2017; Ma et al., 2018). Although adversarial training has been demonstrated to be one of the most effective defense methods against white-box attacks(Croce & Hein, 2020), it relies on an expensive min-max training of the model. This reduces its utility on large models as even standard training could cost millions of dollars (Sharir et al., 2020). Adversarial example detection methods, on the other hand, were mostly developed for white-box attacks and thus cannot be applied to detect the intermediate queries made by a black-box adversary to the target model.

One inherent weakness of query-based attacks is that they have to query the target model many times with similar (and partially adversarial) examples generated during the attack process. And those similar queries may be easily detected and rejected by the defender, ideally making the attack fail at the first few attempts. This is known as the stateful detection against black-box attacks (Chen et al., 2020b; Li et al., 2022a; Choi et al., 2023). As depicted in Fig.1, by maintaining a list of historical queries, stateful detection works to find the most similar historical query to the current query to determine whether the current query is an adversarial example. If the similarity exceeds a certain threshold, then the current query is detected as an adversarial example. Here, the length of the list introduces a tradeoff between defense effectiveness and storage cost, i.e., a longer list will make the defense more reliable and the attack more expensive but incurs more storage (for each user).

As pixel space detection is sensitive to non-adversarial transformations (e.g., rotation and translation), Chen et al. (Chen et al., 2020b) proposed to leverage a pre-trained CNN to extract features and compare the mean feature similarity between the current query and the last 50 queries from the same user to identify potential attacks. This method can be easily bypassed by Sybil attacks in which the adversary creates multiple fake accounts to evade detection. The Blacklight detection method (Li et al., 2022a) computes the feature similarity (according to the hamming distance) between the current query and each of the historical queries from all users, and detects if any similarity score is above a certain threshold. Blacklight is thus robust to Sybil attacks. However, it has been shown that existing stateful detection methods all suffer from a poor tradeoff between the detection rate and false positive rate (Hooda et al., 2023), i.e., their thresholds set for a high detection rate tends to cause a high false positive rate. This will greatly harm the experience of benign users. Furthermore, the above detection methods have been bypassed by an adaptive attack that attempts to generate dissimilar queries using adaptive step sizes (Feng et al., 2023).

Arguably, the key to the reliable detection of query-based attacks is training a robust feature extractor that always produces similar feature vectors for any two adversarial queries crafted from the same image, even for adaptive attacks. In light of this, we propose a simple yet effective framework, Adversarial Contrastive Prompt Tuning (ACPT), to train reliable feature extractors for accurate and robust detection of query-based attacks. Specifically, ACPT finetunes the CLIP image encoder on ImageNet via prompt tuning using two types of losses: 1) contrastive losses to pull together the representations of a clean image and all its adversarial counterparts under data augmentations, and 2) adversarial losses to make it robust to adaptive attacks. Although only finetuned on ImageNet, ACPT demonstrates superb zero-shot capability and achieves the best detection performance across a wide range of datasets.

In summary, our main contributions are:

  • We propose a novel Adversarial Contrastive Prompt Tuning (ACPT) framework that can train robust feature extractors for stateful detection of query-based attacks.

  • We conduct extensive experiments on 5 benchmark datasets against 7 query-based attacks, and show that ACPT can achieve an average 97%percent9797\%97 % and 99%percent9999\%99 % detection rates under 3-shot and 5-shot detection, surpassing the best baseline by >48%absentpercent48>48\%> 48 % and >49%absentpercent49>49\%> 49 %, respectively.

  • We also show that ACPT is robust to adaptive attacks created by either plugging in an adaptive strategy to existing attacks or a new adaptive strategy that exploits the CLIP image encoder backbone to evade the detection.

Refer to caption
Figure 2: An overview of our proposed AdvQDet framework. The current query (e.g., an image) is compared in the embedding space of the CLIP image encoder (finetuned by our ACPT method) with all past queries to detect whether there exists a similar historical embedding. Once the query is detected as an attack (i.e., a similar historical embedding is found), a cashed output from its last queries can be directly returned to avoid returning new information to the adversary.

2 Related Work

Here, we briefly review related works on query-based attacks and stateful detection. We also review existing adversarial contractive learning techniques which are closely related to our adversarial contraction prompt tuning approach.

Query-based Attacks. These attacks query the target model repetitively with adversarial examples generated at intermediate steps to obtain more information to enhance the attack. Based on the return type of the target model, query-based attacks can be categorized into score-based attacks (the target model returns confidence scores) and decision-based attacks (the target model returns category labels). The zeroth order optimization (ZOO) (Chen et al., 2017) attack is one classic score-based attack that exploits finite difference to estimate the adversarial gradients. Compared to ZOO, the autoencoder-based ZOOM (AutoZOOM) (Tu et al., 2019) attack effectively lowers the average query count required to find successful adversarial examples. IIyas et al. (Ilyas et al., 2018) explored a variant of Natural Evolutionary Strategies (NES) to estimate the adversarial gradient under more restrictive threat models. Andriushchenko et al. (Andriushchenko et al., 2020) further introduced a set of query-efficient score-based black-box attack methods, Square attack, using a randomized search scheme.

For decision-based attacks, the confidence scores are no longer accessible to the adversary, which can only use the label information as a substitute. The Boundary attack (Brendel et al., 2018) and the label-only version of the NES attack (Ilyas et al., 2018) are pioneering works in this field. Cheng et al. (Cheng et al., 2019) proposed a novel OPT approach to formulate decision-based attacks as real-valued optimization problems. By using the gradients sign rather than the raw gradients, Cheng et al. (Cheng et al., 2020) further introduced a query-efficient Sign-OPT method to overcome the query limitations faced by all query-based attacks. Another notable method HopSkipJumpAttack (HSJA) (Chen et al., 2020a) employs unbiased gradient estimation at the decision boundary to make the attack more efficient. Following this, an array of decision-based attacks, such as QEBA (Li et al., 2020) and SurFree (Maho et al., 2021), have been developed to reduce the number of queries required to attack unseen DNNs, or decrease the maximum allowed perturbation strength (Chen & Gu, 2020).

Table 1: A summary of different stateful detection methods.
Method Encoder Metric Action
SD CNN Encoder L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Norm Ban Account
Blacklight Pixel-SHA Hamming Reject Query
PIHA Percept. Hash Hamming Reject Query
Ours ACPT Cosine Return Cache

Stateful Detection. The intuition behind stateful detection is the fact that query-based attacks need to query the target model many times with highly similar queries, as part of the exploration process to find successful adversarial examples. It is thus expected that malicious queries with high similarities can be easily detected in either the pixel or representation space. The stateful detection (SD) method introduced in (Chen et al., 2020b) was the first to examine the users’ historical queries to detect query-based attacks. Specifically, SD first extracts the feature of the current query (e.g., an image) using an image encoder and then computes the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between the query feature and its k𝑘kitalic_k-nearest neighbors found in historical queries of the same user. SD is not robust to Sybil attacks where the adversary creates many fake accounts to distribute the queries and evade user-wise detection. Unlike SD, the Blacklight (Li et al., 2022a) detection method replaces the feature extractor with the Pixel-SHA probabilistic hash function, which calculates the hash representation for the input image. By further creating a global hash-table to store the historical queries of all users, it establishes a lightweight detection module that can efficiently address the problem of Sybil attacks. Based on Blacklight, PIHA (Choi et al., 2023) adopts the perceptual image hashing scheme as its feature extractor. It has been shown that stateful detection is also effective against model extraction attacks, which also require a large number of queries to the target model. For example, the PRADA (Juuti et al., 2019) method detects model extraction attacks by analyzing the distribution of consecutive API queries from a user and its deviation from a Gaussian distribution. The SEAT (Zhang et al., 2021) method acquires a similarity encoder via adversarial training, which enables the identification of accounts conducting model extraction attacks. A summary of these methods can be found in Table 1.

Adversarial Contrastive Learning. Contrastive learning (CL) (Oord et al., 2018; Chen et al., 2020c; He et al., 2020) is a self-supervised representation learning technique that leverages large-scale unlabeled datasets to train powerful feature extractors. Recently, the concepts of adversarial contrastive learning (ACL) (Jiang et al., 2020; Kim et al., 2020; Ho & Nvasconcelos, 2020; Fan et al., 2021; Luo et al., 2023; Xu et al., 2024b, a) and adversarial prompt tuning (APT) (Zhang et al., 2024, 2024) have been explored as a robust representation learning technique to combine adversarial training with contrastive learning or prompt tuning. Inspired by SimCLR (Chen et al., 2020c), Jiang et al. (Jiang et al., 2020) introduced an unsupervised robust pre-training framework that effectively combines adversarial learning with contrastive pre-training. To avoid implicit knowledge of invariance caused by static augmentation, Dynamic Adversarial Contrastive Learning (DYNACL) (Luo et al., 2023) employs a dynamic augmentation schedule to bridge the gap between training and test data distributions. Xu et al. (Xu et al., 2024a) further incorporated causal reasoning and robustness-aware coreset selection (RCS) to help interpret ACL and improve its performance.

3 Proposed Detection Framework

We first describe our threat model, formulate the detection problem, and then introduce the proposed AdvQDet framework.

3.1 Threat Model

In this work, we assume a query-based black-box threat model where the adversary generates adversarial examples to attack a target model by making multiple queries to the model and using the model returns to optimize the adversarial examples iteratively. Here, the defender is the owner of the target model who can deploy any defense strategies to defend against potential attacks. In this work, we focus on detection-based defense, which can be deployed in parallel with other defense strategies. However, the defender does not know which user is the attacker nor when the malicious query will arrive. Therefore, the defender may have to store a large number of historical queries of all users to allow a long-range detection of malicious queries. As such, there exists a tradeoff between query storage and detection range. The goal of the defender is to detect any query-based attacks within a minimum number of attempts by the attacker, which forms a few-shot detection setting. There may also exist adaptive attacks that exploit adaptive strategies to evade detection.

Refer to caption
Figure 3: Our proposed ACPT method. It finetunes the CLIP image encoder using two contrastive losses defined on cleanly and adversarially paired images obtained from the same clean image via data augmentation followed by the PGD attack.

3.2 Problem Formulation

We denote fθ(𝐱)ysubscript𝑓𝜃𝐱𝑦f_{\theta}({\mathbf{x}})\to yitalic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) → italic_y as a DNN parameterized by θ𝜃\thetaitalic_θ, where 𝐱𝒳𝐱𝒳{\mathbf{x}}\in{\mathcal{X}}bold_x ∈ caligraphic_X is a clean sample and y𝒴𝑦𝒴y\in{\mathcal{Y}}italic_y ∈ caligraphic_Y is its ground-truth label. In image classification tasks, 𝐱𝐱{\mathbf{x}}bold_x represents a clean image, and y{y1,y2,,yk}𝑦subscript𝑦1subscript𝑦2subscript𝑦𝑘y\in\{y_{1},y_{2},\ldots,y_{k}\}italic_y ∈ { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } is its categorization label; whereas in image captioning tasks, 𝐱𝐱{\mathbf{x}}bold_x is a clean image and y𝑦yitalic_y is its associated caption. Given a clean sample 𝐱[0,1]d𝐱superscript01𝑑{\mathbf{x}}\in[0,1]^{d}bold_x ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and a target model fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ), a query-based adversarial attack aims to generate an adversarial example 𝐱superscript𝐱{\mathbf{x}}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that maximizes the loss of the model as follows:

𝐱=argmax𝐱𝐱ϵ(f(𝐱),y),superscript𝐱subscriptargmaxsubscriptnormsuperscript𝐱𝐱italic-ϵ𝑓superscript𝐱𝑦{\mathbf{x}}^{\prime}=\operatorname*{arg\,max}_{\|{\mathbf{x}}^{\prime}-{% \mathbf{x}}\|_{\infty}\leq\epsilon}\ell(f({\mathbf{x}}^{\prime}),y),bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT ∥ bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ϵ end_POSTSUBSCRIPT roman_ℓ ( italic_f ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_y ) , (1)

where ()\ell(\cdot)roman_ℓ ( ⋅ ) is the loss function, 𝐱superscript𝐱{\mathbf{x}}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is an intermediate-step adversarial example, and ϵitalic-ϵ\epsilonitalic_ϵ is the perturbation budget. An adversarial attack can either be untargeted as formulated above or targeted toward a target label ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Please note that our work does not differentiate between targeted and untargeted attacks.

A query-based black-box attack solves the above adversarial optimization problem by estimating the adversarial gradients iteratively as follows:

𝐱t+1=𝐱t+ηsign(𝒈^),subscriptsuperscript𝐱𝑡1subscriptsuperscript𝐱𝑡𝜂sign^𝒈{\mathbf{x}}^{\prime}_{t+1}={\mathbf{x}}^{\prime}_{t}+\eta\operatorname{sign}(% \hat{{\bm{g}}}),bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_η roman_sign ( over^ start_ARG bold_italic_g end_ARG ) , (2)

where 𝐱tsubscriptsuperscript𝐱𝑡{\mathbf{x}}^{\prime}_{t}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the intermediate adversarial example obtained at the t𝑡titalic_t-th iteration, η𝜂\etaitalic_η is the perturbation step size, sign()sign\operatorname{sign}(\cdot)roman_sign ( ⋅ ) is the sign function, and 𝒈^^𝒈\hat{{\bm{g}}}over^ start_ARG bold_italic_g end_ARG is the estimated gradient based on target model output f(𝐱t)𝑓subscriptsuperscript𝐱𝑡f({\mathbf{x}}^{\prime}_{t})italic_f ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) using a black-box optimization method such as finite difference (Chen et al., 2017) or NES (Ilyas et al., 2018).

For a current query 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the task of stateful detection is to determine whether there exists a historical query 𝐱ksubscript𝐱𝑘{\mathbf{x}}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that their similarity exceeds a certain threshold μ𝜇\muitalic_μ. Formally, it is:

det(𝐱t)={1,ifsim(E(𝐱t),E(𝐱k))>μ,𝐱kQ0,otherwise,𝑑𝑒𝑡subscript𝐱𝑡cases1formulae-sequenceif𝑠𝑖𝑚𝐸subscript𝐱𝑡𝐸subscript𝐱𝑘𝜇subscript𝐱𝑘𝑄0otherwisedet({\mathbf{x}}_{t})=\begin{cases}1,&\text{if}\ sim(E({\mathbf{x}}_{t}),E({% \mathbf{x}}_{k}))>\mu,\exists{\mathbf{x}}_{k}\in Q\\ 0,&\text{otherwise},\end{cases}italic_d italic_e italic_t ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 , end_CELL start_CELL if italic_s italic_i italic_m ( italic_E ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_E ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) > italic_μ , ∃ bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_Q end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise , end_CELL end_ROW (3)

where det()𝑑𝑒𝑡det(\cdot)italic_d italic_e italic_t ( ⋅ ) is the detection function, sim(,)𝑠𝑖𝑚sim(\cdot,\cdot)italic_s italic_i italic_m ( ⋅ , ⋅ ) is the similarity function, E()𝐸E(\cdot)italic_E ( ⋅ ) is an encoder (feature extractor) that extracts the embedding of 𝐱t/𝐱ksubscript𝐱𝑡subscript𝐱𝑘{\mathbf{x}}_{t}/{\mathbf{x}}_{k}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, μ𝜇\muitalic_μ is a threshold hyper-parameter, and Q𝑄Qitalic_Q is an embedding bank that stores the embeddings of historical queries from all users. Here, a det()𝑑𝑒𝑡det(\cdot)italic_d italic_e italic_t ( ⋅ ) value of 1111 indicates an attack. Note that E()𝐸E(\cdot)italic_E ( ⋅ ) is a different model from the target model f()𝑓f(\cdot)italic_f ( ⋅ ) and is an adversarially finetuned CLIP (Radford et al., 2021) image encoder by our ACPT method.

3.3 AdvQDet Framework

3.3.1 Overview

As illustrated in Figure 2, AdvQDet consists of 2 main components: 1) the ACPT finetuned image encoder and 2) a similarity calculation module. The detection procedure of AdvQDet is as follows. For a current query 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it first feeds the image into the ACPT finetuned image encoder to extract its embedding. The similarity calculation module then compares the embedding with N1𝑁1N-1italic_N - 1 historical embeddings of the past queries (from all users) to compute the similarity scores. If any of the N1𝑁1N-1italic_N - 1 similarity scores say 𝐱ksubscript𝐱𝑘{\mathbf{x}}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is above a pre-defined threshold μ𝜇\muitalic_μ, query 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will be determined as a potential attack. Instead of rejecting 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, one plausible defense action is to just return the cached output for 𝐱ksubscript𝐱𝑘{\mathbf{x}}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Note that existing detection methods employ two types of strategies for embedding bank Q𝑄Qitalic_Q. The SD method (Chen et al., 2020b) creates a local bank for each user, while later methods Blacklight (Li et al., 2022a) and PIHA (Choi et al., 2023) maintain a global bank for all users. Our AdvQDet also adopts the global bank strategy as it is robust to Sybil attacks. Next, we will introduce the two components in detail.

3.3.2 Adversarial Contrastive Prompt Tuning

As depicted in Figure 3, ACPT adopts a two-stream contrastive prompt tuning paradigm (Jiang et al., 2020): a clean stream and a adversarial stream. In the clean stream, the two augmented views (e.g., 𝐱~isubscript~𝐱𝑖\tilde{{\mathbf{x}}}_{i}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐱~jsubscript~𝐱𝑗\tilde{{\mathbf{x}}}_{j}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) of a clean image 𝐱𝐱{\mathbf{x}}bold_x form a Clean-to-Clean (C2C) pair. The purpose of the clean stream is to pull together the augmented versions of the same image, making it robust to different types of image transformations. In the adversarial stream, the adversarial examples of the two augmented images are generated using PGD (Madry et al., 2017) to form an Adversarial-to-Adversarial (A2A) pair. The purpose of the adversarial stream is to make it robust to adaptive attacks that exploit adversarial perturbation to bypass the detection. Together, the two streams robustify the image encoder against both regular transformations and adversarial perturbations. Note that the clean stream itself is the standard SimCLR (Chen et al., 2020c).

To exploit the superb feature extraction capability of large-scale pre-trained models, we adopt the image encoder of CLIP (Radford et al., 2021) and apply ACPT to finetune the encoder on ImageNet. ACPT adopts visual prompt tuning with learnable prompt tokens concatenated to the original input tokens. The tuning loss of ACPT is defined as follows:

NT(𝐱~i,𝐱~j;p)subscript𝑁𝑇subscript~𝐱𝑖subscript~𝐱𝑗𝑝\displaystyle\ell_{NT}(\tilde{{\mathbf{x}}}_{i},\tilde{{\mathbf{x}}}_{j};p)roman_ℓ start_POSTSUBSCRIPT italic_N italic_T end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_p ) =logexp(sim(E(𝐱~i,p),E(𝐱~j,p))/τ)k=12Nexp(sim(E(𝐱~i,p),E(𝐱~k,p))/τ),absent𝑠𝑖𝑚𝐸subscript~𝐱𝑖𝑝𝐸subscript~𝐱𝑗𝑝𝜏superscriptsubscript𝑘12𝑁𝑠𝑖𝑚𝐸subscript~𝐱𝑖𝑝𝐸subscript~𝐱𝑘𝑝𝜏\displaystyle=-\log\frac{\exp(sim(E(\tilde{{\mathbf{x}}}_{i},p),E(\tilde{{% \mathbf{x}}}_{j},p))/\tau)}{\sum_{k=1}^{2N}\exp(sim(E(\tilde{{\mathbf{x}}}_{i}% ,p),E(\tilde{{\mathbf{x}}}_{k},p))/\tau)},= - roman_log divide start_ARG roman_exp ( italic_s italic_i italic_m ( italic_E ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p ) , italic_E ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_p ) ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( italic_E ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p ) , italic_E ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p ) ) / italic_τ ) end_ARG , (4)
ANT(𝐱~i,𝐱~j;p)subscript𝐴𝑁𝑇subscriptsuperscript~𝐱𝑖subscriptsuperscript~𝐱𝑗𝑝\displaystyle\ell_{ANT}(\tilde{{\mathbf{x}}}^{\prime}_{i},\tilde{{\mathbf{x}}}% ^{\prime}_{j};p)roman_ℓ start_POSTSUBSCRIPT italic_A italic_N italic_T end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_p ) =logexp(sim(E(𝐱~i,p),E(𝐱~j,p))/τ)k=12Nexp(sim(E(𝐱~i,p),E(𝐱~k,p))/τ),absent𝑠𝑖𝑚𝐸subscriptsuperscript~𝐱𝑖𝑝𝐸subscriptsuperscript~𝐱𝑗𝑝𝜏superscriptsubscript𝑘12𝑁𝑠𝑖𝑚𝐸subscriptsuperscript~𝐱𝑖𝑝𝐸subscriptsuperscript~𝐱𝑘𝑝𝜏\displaystyle=-\log\frac{\exp(sim(E(\tilde{{\mathbf{x}}}^{\prime}_{i},p),E(% \tilde{{\mathbf{x}}}^{\prime}_{j},p))/\tau)}{\sum_{k=1}^{2N}\exp(sim(E(\tilde{% {\mathbf{x}}}^{\prime}_{i},p),E(\tilde{{\mathbf{x}}}^{\prime}_{k},p))/\tau)},= - roman_log divide start_ARG roman_exp ( italic_s italic_i italic_m ( italic_E ( over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p ) , italic_E ( over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_p ) ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( italic_E ( over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p ) , italic_E ( over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p ) ) / italic_τ ) end_ARG , (5)
ACPTsubscriptACPT\displaystyle\ell_{\text{ACPT}}roman_ℓ start_POSTSUBSCRIPT ACPT end_POSTSUBSCRIPT =αNT(𝐱~i,𝐱~j;p)+(1α)ANT(𝐱~i,𝐱~j;p),absent𝛼subscript𝑁𝑇subscript~𝐱𝑖subscript~𝐱𝑗𝑝1𝛼subscript𝐴𝑁𝑇subscriptsuperscript~𝐱𝑖subscriptsuperscript~𝐱𝑗𝑝\displaystyle=\alpha\ell_{NT}(\tilde{{\mathbf{x}}}_{i},\tilde{{\mathbf{x}}}_{j% };p)+(1-\alpha)\ell_{ANT}(\tilde{{\mathbf{x}}}^{\prime}_{i},\tilde{{\mathbf{x}% }}^{\prime}_{j};p),= italic_α roman_ℓ start_POSTSUBSCRIPT italic_N italic_T end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_p ) + ( 1 - italic_α ) roman_ℓ start_POSTSUBSCRIPT italic_A italic_N italic_T end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_p ) , (6)

where p𝑝pitalic_p is the prompt token, E()𝐸E(\cdot)italic_E ( ⋅ ) is the CLIP image encoder, sim(,)𝑠𝑖𝑚sim(\cdot,\cdot)italic_s italic_i italic_m ( ⋅ , ⋅ ) is the cosine similarity function, τ𝜏\tauitalic_τ is the temperature, and α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 is a hyperparameter balancing the two loss terms.

Comparing the definition of ACPTsubscriptACPT\ell_{\text{ACPT}}roman_ℓ start_POSTSUBSCRIPT ACPT end_POSTSUBSCRIPT and Eq. (3), one might find that ACPTsubscriptACPT\ell_{\text{ACPT}}roman_ℓ start_POSTSUBSCRIPT ACPT end_POSTSUBSCRIPT directly optimizes the feature similarity between the clean and adversarial image pairs. This effectively reduces the difference between variants of the same image in the latent space, making the detection of query attacks much easier.

3.3.3 Similarity Calculation

Following prior works (Li et al., 2022a; Choi et al., 2023), we extract and save the embedding of each query image into an embedding bank Q𝑄Qitalic_Q. The embedding bank is maintained globally for all users so as to be robust to Sybil attacks. Two problems arise with the embedding bank: 1) the storage cost and 2) the computational cost. The two costs can be reduced by using the techniques introduced in (Chen et al., 2020b). Next, we will provide an analysis of the two costs and show that it is practically feasible to store a global embedding bank and perform the similarity search efficiently.

In terms of the storage cost, each query results in a vector embedding with dimension d=512𝑑512d=512italic_d = 512, which takes 2048 bytes for float32 precision. Suppose there are 1 million users with each user querying 100 times, the storage it takes to store all these query embeddings is  190.73 GB. By switching to float16 precision, the storage can be reduced to  95.37 GB.

In terms of computational cost, one can use the Automatic Mixed Precision (AMP) technique to reduce the memory cost and accelerate computations without sacrificing the detection performance. AMP automatically determines the appropriate precision—single or half—for each operation. When calculating the cosine similarity between an individual embedding vector and each embedding in the embedding bank, the computational complexity is O(n×d)𝑂𝑛𝑑O(n\times d)italic_O ( italic_n × italic_d ), where n𝑛nitalic_n is the number of embeddings in the bank and d𝑑ditalic_d is the dimension of the embedding vector. There are established techniques we can use to speed up high-dimensional similarity searches, such as product quantization (PQ), hierarchical navigable small worlds (HNSW), and locality-sensitive hashing (LSH). Popular similarity search tools like clip retrieval (Beaumont, 2022), Faiss (Johnson et al., 2019), and AutoFaiss all provide efficient solutions for searching over a large-scale vector database. Here, we conduct an efficiency test to compute the cosine similarity between two vectors of dimensions (1,512)1512(1,512)( 1 , 512 ) and (1m,512)1𝑚512(1m,512)( 1 italic_m , 512 ) using an NVIDIA RTX 3090 GPU, CUDA 11.3, and Pytorch v1.12.0. It takes 8.29 and 2.63 milliseconds for float32 and float16, respectively. These costs are manageable for an AI company to run a commercial product/service that supports up to 1 million users.

3.3.4 Defense Action.

Once a query is detected to be an attack, there are a few possible defense actions that can be taken by the defender: 1) rejecting the query, which is applicable when the false positive rate is low as otherwise may harm user experience; 2) limiting the query number and frequency of the user which will cause the attacker’s attention; 3) returned intentionally perturbed outputs to the user which still has the risk to leak gradient (or other) information; 4) banning accounts or blocking IP addresses which is an aggressive action that should be taken only in extreme cases; and 5) simply returning the cashed output for the previous similar query which is a plausible action that does not expose new information to the user nor harm the user experience.

4 Experiments

We evaluated our detection method against 7 state-of-the-art query-based attacks and 3 types of adaptive attacks. We first describe our experimental setting and then present the results of 1) defense effectiveness across different datasets, 2) robustness to adaptive attacks, and 3) ablation study.

4.1 Experimental Setup

Datasets and Models. We experiment on 5 benchmark datasets: CIFAR-10 (Krizhevsky et al., 2009), GTSRB (Stallkamp et al., 2012), ImageNet (Russakovsky et al., 2015), Flowers (Nilsback & Zisserman, 2008), Pets (Parkhi et al., 2012). We utilize ImageNet pre-trained models (such as ResNet20, ResNet101, and ViT-B/16) and then fine-tune them on the other four datasets. A summary of these datasets and the corresponding models can be found in the Appendix.

Attack Configuration. We evaluate against 7 query-based attacks, including Boundary (Brendel et al., 2018), HSJA (Chen et al., 2020a), NESS (Ilyas et al., 2018), QEBA (Li et al., 2020), Square (Andriushchenko et al., 2020), SurFree (Maho et al., 2021), and ZOO (Chen et al., 2017), as described in Section §2. We also apply an adaptive strategy called Oracle-guided Adaptive Rejection Sampling (OARS) (Feng et al., 2023) to enhance the above query-based attacks and evaluate against these enhanced attacks. OARS utilizes an adapting distribution and resampling technique for gradient estimation, aiming to evade stateful defenses during the generation of adversarial examples. Throughout the experiment, we execute each attack until an adversarial example is successfully crafted or the maximum query limit is reached, whichever occurs first. The hyperparameters for these attacks are set following the Adversarial-Robustness-Toolbox(ART) library (Nicolae et al., 2018). For the attacks, we set the perturbation budget to ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05 and limit the query budget to 100, 000. For CIFAR-10 and GTSRB datasets, we randomly choose 1,000 images from their respective test sets, uniformly across all categories. For ImageNet, Flowers, and Pets datasets, due to the high computational costs of query-based attacks, we select 100 images randomly from the validation/test sets.

Defense Configuration. For existing stateful detection methods, we use their originally proposed configurations, as detailed in Table 1. Specifically, for SD (Chen et al., 2020b) defense, we set the number of neighbors to k=50𝑘50k=50italic_k = 50 and the detection threshold to μ=10𝜇10\mu=10italic_μ = 10. For Blacklight (Li et al., 2022a), the quantization step is set to 50, with window sizes of 20 for CIFAR-10 and 50 for ImageNet. PIHA (Choi et al., 2023) adopts a block size of 7x7 and a detection threshold of μ=0.05𝜇0.05\mu=0.05italic_μ = 0.05.

Table 2: The ASR (\downarrow), 3/5-shot detection rate (\uparrow), and mean detection counts (\downarrow) of different detection methods against 7 query-based attacks across 5 datasets. The best and second-best results are boldfaced and underscored, respectively.
Dataset Attack Method Stateful Detection Method
w/o Defense Blacklight PIHA AdvQDet (Ours)
ASR Query ASR 3/5-shot DR mDC ASR 3/5-shot DR mDC ASR 3/5-shot DR mDC
CIFAR-10 Boundary 100% 591.97 0% 94%/97% 3.23 0% 75%/93% 3.87 0% 100%/100% 3.00
HSJA 100% 265.11 0% 0%/0% 7.28 0% 1%/14% 7.77 0% 76%/100% 2.90
NESS 100% 15144.82 0% 100%/100% 3.00 0% 89%/97% 3.64 0% 98%/98% 2.81
QEBA 100% 316.41 0% 0%/0% 7.28 0% 1%/14% 7.77 0% 76%/100% 2.90
Square 100% 17.37 0% 100%/100% 2.00 28% 61%/64% 2.96 0% 100%/100% 2.00
SurFree 100% 77.13 0% 0%/0% 8.66 0% 3%/10% 8.85 0% 100%/100% 2.00
ZOO 71% 16649.93 0% 100%/100% 2.00 0% 100%/100% 2.00 0% 100%/100% 2.00
ImageNet Boundary 100% 5776.94 4% 16%/19% 238.21 8% 0%/0% 228.88 0% 100%/100% 3.00
HSJA 74% 79621.63 0% 0%/0% 8.51 0% 0%/1% 9.56 0% 83%/100% 3.86
NESS 99% 13276.7 0% 100%/100% 3.07 10% 19%/21% 266.88 0% 99%/100% 2.51
QEBA 59% 55173.28 0% 0%/0% 8.51 0% 0%/1% 9.56 0% 83%/100% 3.86
Square 100% 108.2 0% 100%/100% 2.00 30% 22%/24% 9.1 0% 100%/100% 2.00
SurFree 100% 534.95 0% 0%/0% 9.02 0% 0%/1% 9.68 0% 100%/100% 2.04
ZOO 75% 9986.08 0% 100%/100% 2.00 0% 99%/99% 4.26 0% 100%/100% 2.00
GTSRB Boundary 100% 1908.37 0% 100%/100% 3.03 0% 81%/93% 3.97 0% 100%/100% 3.00
HSJA 100% 1808.87 0% 0%/0% 7.29 0% 11%/56% 6.47 0% 100%/100% 2.56
NESS 49% 51501.31 0% 100%/100% 3.00 0% 50%/77% 5.16 0% 95%/96% 4.84
QEBA 100% 780.26 0% 0%/0% 7.29 0% 11%/56% 6.47 0% 100%/100% 2.58
Square 100% 2577.15 0% 100%/100% 2.00 7% 71%/71% 3.68 0% 100%/100% 2.00
SurFree 75% 225.77 0% 0%/5% 7.84 0% 16%/51% 6.56 0% 100%/100% 2.00
ZOO 42% 18708.50 0% 100%/100% 2.00 0% 100%/100% 2.00 0% 100%/100% 2.00
Flowers Boundary 96% 5118.87 15% 6%/9% 297.24 25% 0%/0% 375.63 0% 100%/100% 3.00
HSJA 56% 59574.49 0% 0%/0% 8.67 0% 0%/0% 9.26 0% 99%/100% 3.77
NESS 95% 17092.08 0% 100%/100% 3.01 6% 53%/64% 101.58 0% 99%/99% 2.56
QEBA 100% 54968.15 0% 0%/0% 8.67 0% 0%/0% 9.26 0% 99%/100% 3.77
Square 99% 324.59 0% 100%/100% 2.00 29% 48%/50% 5.49 0% 100%/100% 2.00
SurFree 99% 1704.45 0% 0%/0% 9.98 0% 0%/0% 10.71 0% 100%/100% 2.00
ZOO 87% 9197.09 0% 100%/100% 2.00 0% 98%/99% 2.07 0% 100%/100% 2.00
Pets Boundary 95% 7958.55 3% 16%/18% 245.19 2% 0%/0% 197.13 0% 100%/100% 3.00
HSJA 97% 2277.19 0% 0%/0% 8.61 0% 0%/1% 9.45 0% 100%/100% 3.61
NESS 94% 23424.64 0% 100%/100% 3.07 12% 6%/10% 425.60 0% 100%/100% 2.00
QEBA 97% 1061.13 0% 0%/0% 8.61 0% 0%/1% 9.45 0% 100%/100% 3.61
Square 100% 148.85 0% 100%/100% 2.00 8% 14%/14% 10.45 0% 100%/100% 2.00
SurFree 100% 754.22 0% 0%/0% 10.88 0% 0%/2% 11.08 0% 100%/100% 2.03
ZOO 86% 7919.80 0% 100%/100% 2.00 0% 100%/100% 2.00 0% 100%/100% 2.00
Average 90% 17671.10 1% 49%/50% 27.12 5% 32%/39% 51.03 0% 97%/99% 2.66
Table 3: The ASR(\downarrow) and mean detection counts (\uparrow) of different detection methods against 6 enhanced query-based attacks by the OARS adaptive strategy. The results are shown for CIFAR-10 and ImageNet datasets with the best results being boldfaced.
Dataset Attack Method Stateful Detection Method
w/o defense SD Blacklight PIHA AdvQDet
ASR Query ASR mDC ASR mDC ASR mDC ASR mDC
CIFAR-10 Boundary-OARS 100% 610.31 100% 51.00 100% 3.17 94% 4.01 0% 3.00
HSJA-OARS 100% 439.78 100% 51.00 100% 7.28 93% 7.77 0% 2.90
NESS-OARS 100% 969.14 53% 51.00 97% 596.50 97% 381.78 0% 3.00
QEBA-OARS 100% 457.14 100% 51.00 98% 7.28 93% 7.77 0% 2.90
Square-OARS 100% 183.64 100% 51.00 98% 64.94 100% 83.85 0% 3.20
SurFree-OARS 100% 170.52 65% 51.41 92% 8.66 61% 8.85 0% 2.00
ImageNet Boundary-OARS 100% 5743.65 N/A N/A 37% 194.75 39% 208.10 0% 3.00
HSJA-OARS 100% 1908.77 N/A N/A 93% 9.00 98% 9.56 0% 3.86
NESS-OARS 100% 5207.24 N/A N/A 89% 282.51 55% 428.31 0% 3.01
QEBA-OARS 100% 1040.41 N/A N/A 73% 8.51 100% 11.00 0% 3.86
Square-OARS 99% 840.77 N/A N/A 83% 40.88 99% 70.87 0% 2.53
SurFree-OARS 100% 1519.29 N/A N/A 87% 9.02 100% 9.68 0% 2.04
Average 99% 1590.89 86.33% 51.07% 87.25 102.71 85.75% 102.63 0% 2.94

Implementation Details. For our AdvQDet, we finetune the CLIP image encoder using ACPT for 20 epochs with a batch size of bs=1024𝑏𝑠1024bs=1024italic_b italic_s = 1024 and a learning rate of 0.04 on ImageNet. To generate a batch of positive pairs for finetuning, we sample bs𝑏𝑠bsitalic_b italic_s images from the training set and then follow SimCLR to obtain two augmented views (𝐱~i,𝐱~j)subscript~𝐱𝑖subscript~𝐱𝑗(\tilde{{\mathbf{x}}}_{i},\tilde{{\mathbf{x}}}_{j})( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). We apply PGD attack to craft the adversarial views (𝐱~i,𝐱~j)subscriptsuperscript~𝐱𝑖subscriptsuperscript~𝐱𝑗(\tilde{{\mathbf{x}}}^{\prime}_{i},\tilde{{\mathbf{x}}}^{\prime}_{j})( over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with a perturbation budget of 8/255 for 5 steps. After obtaining the four views (𝐱~i,𝐱~j,𝐱~i,𝐱~j)subscriptsuperscript~𝐱𝑖subscriptsuperscript~𝐱𝑗subscript~𝐱𝑖subscript~𝐱𝑗(\tilde{{\mathbf{x}}}^{\prime}_{i},\tilde{{\mathbf{x}}}^{\prime}_{j},\tilde{{% \mathbf{x}}}_{i},\tilde{{\mathbf{x}}}_{j})( over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), we fine-tune the prompt token by minimizing the adversarial contrastive loss described in Section §3.3.2. There are K=20𝐾20K=20italic_K = 20 learnable prompt tokens, optimized by SGD and adjusted by cosine annealing. For detection, we set a similarity threshold of μ=0.95𝜇0.95\mu=0.95italic_μ = 0.95 for low-resolution datasets CIFAR-10 and GTSRB, and μ=0.9𝜇0.9\mu=0.9italic_μ = 0.9 for high-resolution datasets ImageNet, FLowers, and Pets.

Performance Metrics. We consider three performance metrics: 1) attack success rate (ASR), which is the percentage of successful adversarial examples under the attack budget; 2) 3/5-shots (queries) detection rate (DR) which is the successful detection rate when the defender sees 3/5 of the queries (i.e., a clean query followed by a sequence of adversarial queries), and 3) mean detection counts (mDC) which calculates the average number of queries required for the defender to detect each attack.

4.2 Main Results

We compare our AdvQDet method with existing stateful detection methods. For a fair comparison, we adopt the same defense pipeline for all methods. I.e., we detect each query based on the historical queries from all users, with only the similarity score computed by different detection methods. The detection performance results are reported in Table 2, where the 3-4 columns report the results of no defense. It is evident that, although most attacks can achieve a high ASR (nearly 100%) in the absence of detection, they often require a large number of queries to succeed. According to the results, the Square attack is the most efficient and effective as it requires the minimum number of queries and achieves an ASR of 100% across all datasets.

For the detection methods, our AdvQDet achieves the best average performance of 0% ASR, 97%/99% 3/5-shot detection rate, and an average of 2.66 query counts for successful detection, surpassing existing methods Blacklight and PIHA by a huge margin. Moreover, AdvQDet demonstrates the best performance and almost 100% 3/5-shot detection rates in most scenarios. However, it is not always the best, for example, the Blacklight detection method works better against the NESS attack than AdvQDet in terms of 3/5-shot detection rates. This is because the NESS attack uses a large Gaussian noise distribution to estimate the adversarial gradients which tend to cause large distortion to the query images and thus the features. However, Blacklight extracts the hashing of the image which is relatively robust to large perturbations. However, Blacklight fails badly against HSJA, QEBA, and SurFree attacks with almost 0% 3/5-shot detection rates. It is worth mentioning that AdvQDet is very close to Blacklight against NESS but can detect attacks with fewer queries.

Although query-based attacks generally require many queries while detention only needs a few queries, there are still attacks that can bypass existing detection methods Blacklight and PIHA. For example, the Square, NESS, and Boundary attacks on high-resolution datasets ImageNet, Flowers, and Pets. By contrast, not a single existing query-based attack can evade our detection, leaving an ASR of 0% in all scenarios. Efficiency is another advantage of our AdvQDet method, i.e., it only takes 2.66 queries on average to detect all 7 attacks. Note that, the first query made by most attacks is a clean image, the second query is often an initialized image with Gaussian noise, and the third query is an adversarial query. This means that our method can detect most of the attacks based on the first two queries, for example, against HSJA and QEBA attacks.

4.3 Robustness Against Adaptive Attacks

Here, we evaluate the robustness of our method to adaptive attacks where the attackers are aware of our detection pipeline. Particularly, we consider three adaptive attacks: 1) using OARS (Feng et al., 2023) adaptive strategy to boost existing attacks; 2) the attacker knows the backbone (CLIP image encoder) of our AdvQDet; and 3) white-box attacks where the attacker knows every detail of our detector (but the target model is still black-box).

OARS Adaptive Attack. OARS employs step size adaptation and resampling mechanisms to evade stateful detection. We boost existing attacks including Boundary, HSJA, NESS, QEBA, Square, and SurFree using the OARS adaptive strategy. We did not consider the ZOO attack as its adaptive strategy is not compatible with OARS and it is also omitted from the OARS paper (Feng et al., 2023). The robustness results on CIFAR-10 and ImageNet datasets are shown in Table 3. It is clear that when there is no defense, all adaptive attacks achieve an ASR of 99absent99\geq 99≥ 99 with the query number increasing significantly on high-resolution images (ImageNet).

Our AdvQDet is robust to OARS adaptive attacks and can successfully detect all 6 adaptive attacks within an average of 3 shots while reducing the ASR to 0%. The SD detection method however fails on ImageNet as its feature extractor is dataset-dependent and thus does not apply to ImageNet images. Since SD requires the last 50 queries to detect the current, the mean detection counts are all above 50. The Blacklight and PIHA have both been bypassed by all adaptive attacks, where the ASR jumps up to 37% - 100%. Interestingly, Blacklight is more susceptible to adaptive attacks on low-resolution dataset CIFAR-10 while PIHA is more vulnerable on both low and high-resolution datasets CIFAR-10 and ImageNet.

The Backbone is Compromised. Here, we test when the attacker knows the CLIP image encoder used in AdvQDet (but not the visual prompt token). In this case, the attacker can white-box attack the CLIP image encoder while query attacking the target model. Specifically, the attacker adopts an alternating optimize strategy to first perform one step (query) black-box attack and then 10 steps of white-box PGD attack. As shown in Figure 4, AdvQDet is also robust to this adaptive attack, maintaining a high similarity score for the first 50 steps of queries. Moreover, AdvQDet becomes more robust when we increase the token length of ACPT.

White-box Attack. In this case, we follow a similar adaptive pipeline as in the above backbone adaptive attack setting, but the attacker directly attacks our ACPT-tuned image encoder. The results are also presented in Figure 4. The result indicates that AdvQDet is moderately robust to white-box attacks with a slightly reduced similarity score, and increasing the token length of ACPT can effectively increase the chance of the attack being detected. Note that in both experiments, the detection is deemed to be successful whenever the similarity score is above the threshold which occurs within the first 5 queries. We also observed that white-box attacks against our AdvQDet took roughly 100x more queries to converge. These results suggest that with ACPT, we can have a reliable query attack detector with good effectiveness, efficiency, and robustness.

Refer to caption
Figure 4: The similarity score of the first 50 queries for backbone adaptive attacks (“BAA-x”) and white-box attacks (“WB-x”) on ImageNet, with x denoting the token length. The black dashed line marks the detection threshold.

5 Limitation

As a stateful detection method, our AdvQDet also faces certain limitations that deserve further research. Notably, it cannot defend against transfer-based attacks as they do not need querying the target model. This limitation can potentially be addressed by incorporating white-box adversarial example detection methods into the pipeline of AdvQDet. The storage and computational costs are another limitation of AdvQDet. More effective partitioning and acceleration techniques can be developed in future work to facilitate the industrial deployment of AdvQDet. On the other hand, besides its effectiveness, efficiency, and robustness, AdvQDet has the potential to be applied to detect multimodal query-based attacks against vision language models (VLMs) like GPT-4V (OpenAI, 2023). Although there is still much room for improvement, we believe AdvQDet offers a reliable solution for detecting real-world adversarial attacks.

6 Conclusion

In this paper, we proposed a novel stateful detection framework to detect query-based black-box adversarial attacks. Our work is motivated by the observation that query-based attacks launch multiple visually similar queries to the target model, which might be easily detected by a robust feature extractor (image encoder). To this end, we propose an efficient tuning-based method called Adversarial Contrastive Prompt Tuning (ACPT) to robustify the CLIP image encoder on ImageNet. The ACPT-tuned serves as a general-purpose encoder for the detection of query-based attacks and demonstrates strong zero-shot generalization capability across different datasets. With ACPT, we introduce the AdvQDet framework that extracts and saves the embeddings of the query images and maintains a global embedding bank for all users. AdvQDet computes the embedding similarity between the current query and all historical queries to identify whether the query is malicious (similar to an existing one). We demonstrated the effectiveness, efficiency, and robustness of AdvQDet against existing query-based attacks, adaptive attacks, and even white-box attacks. Our work showcases the possibility of achieving strong and consistent defense against query-based adversarial attacks.

References

  • Andriushchenko et al. (2020) Andriushchenko, M., Croce, F., Flammarion, N., and Hein, M. Square attack: a query-efficient black-box adversarial attack via random search. In European conference on computer vision, pp.  484–501. Springer, 2020.
  • Bai et al. (2021) Bai, Y., Zeng, Y., Jiang, Y., Xia, S.-T., Ma, X., and Wang, Y. Improving adversarial robustness via channel-wise activation suppressing. In International Conference on Learning Representations, 2021.
  • Beaumont (2022) Beaumont, R. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip-retrieval, 2022.
  • Brendel et al. (2018) Brendel, W., Rauber, J., and Bethge, M. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. In International Conference on Learning Representations, 2018.
  • Cao et al. (2019) Cao, Y., Xiao, C., Yang, D., Fang, J., Yang, R., Liu, M., and Li, B. Adversarial objects against lidar-based autonomous driving systems. arXiv preprint arXiv:1907.05418, 2019.
  • Chen & Gu (2020) Chen, J. and Gu, Q. Rays: A ray searching method for hard-label adversarial attack. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1739–1747, 2020.
  • Chen et al. (2020a) Chen, J., Jordan, M. I., and Wainwright, M. J. Hopskipjumpattack: A query-efficient decision-based attack. In 2020 ieee symposium on security and privacy (sp), pp.  1277–1294. IEEE, 2020a.
  • Chen et al. (2023) Chen, K., Wei, Z., Chen, J., Wu, Z., and Jiang, Y.-G. Gcma: Generative cross-modal transferable adversarial attacks from images to videos. In Proceedings of the 31st ACM International Conference on Multimedia, pp.  698–708, 2023.
  • Chen et al. (2017) Chen, P.-Y., Zhang, H., Sharma, Y., Yi, J., and Hsieh, C.-J. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pp.  15–26, 2017.
  • Chen et al. (2020b) Chen, S., Carlini, N., and Wagner, D. Stateful detection of black-box adversarial attacks. In Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence, pp.  30–39, 2020b.
  • Chen et al. (2020c) Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020c.
  • Cheng et al. (2019) Cheng, M., Zhang, H., Hsieh, C.-J., Le, T., Chen, P.-Y., and Yi, J. Query-efficient hard-label black-box attack: An optimization-based approach. In International Conference on Learning Representations. International Conference on Learning Representations, ICLR, 2019.
  • Cheng et al. (2020) Cheng, M., Singh, S., Chen, P. H., Chen, P.-Y., Liu, S., and Hsieh, C.-J. Sign-opt: A query-efficient hard-label adversarial attack. In International Conference on Learning Representations, 2020.
  • Choi et al. (2023) Choi, S.-H., Shin, J., and Choi, Y.-H. Piha: Detection method using perceptual image hashing against query-based adversarial attacks. Future Generation Computer Systems, 145:563–577, 2023.
  • Croce & Hein (2020) Croce, F. and Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International Conference on Machine Learning, 2020.
  • Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dong et al. (2018) Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., and Li, J. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  9185–9193, 2018.
  • Dosovitskiy et al. (2010) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arxiv 2020. arXiv preprint arXiv:2010.11929, 2010.
  • Eykholt et al. (2018) Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., and Song, D. Robust physical-world attacks on deep learning visual classification. In CVPR, pp.  1625–1634, 2018.
  • Fan et al. (2021) Fan, L., Liu, S., Chen, P.-Y., Zhang, G., and Gan, C. When does contrastive learning preserve adversarial robustness from pretraining to finetuning? Advances in neural information processing systems, 34:21480–21492, 2021.
  • Feng et al. (2023) Feng, R., Hooda, A., Mangaokar, N., Fawaz, K., Jha, S., and Prakash, A. Stateful defenses for machine learning models are not yet secure against black-box attacks. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp.  786–800, 2023.
  • Finlayson et al. (2019) Finlayson, S. G., Bowers, J. D., Ito, J., Zittrain, J. L., Beam, A. L., and Kohane, I. S. Adversarial attacks on medical machine learning. Science, 363(6433):1287–1289, 2019.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • He et al. (2020) He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9729–9738, 2020.
  • Ho & Nvasconcelos (2020) Ho, C.-H. and Nvasconcelos, N. Contrastive learning with adversarial examples. Advances in Neural Information Processing Systems, 33:17081–17093, 2020.
  • Hooda et al. (2023) Hooda, A., Mangaokar, N., Feng, R., Fawaz, K., Jha, S., and Prakash, A. Theoretically principled trade-off for stateful defenses against query-based black-box attacks. arXiv preprint arXiv:2307.16331, 2023.
  • Ilyas et al. (2018) Ilyas, A., Engstrom, L., Athalye, A., and Lin, J. Black-box adversarial attacks with limited queries and information. In International conference on machine learning, pp.  2137–2146. PMLR, 2018.
  • Jiang et al. (2019) Jiang, L., Ma, X., Chen, S., Bailey, J., and Jiang, Y.-G. Black-box adversarial attacks on video recognition models. In ACM International Conference on Multimedia, pp.  864–872, 2019.
  • Jiang et al. (2020) Jiang, Z., Chen, T., Chen, T., and Wang, Z. Robust pre-training by adversarial contrastive learning. Advances in neural information processing systems, 33:16199–16210, 2020.
  • Johnson et al. (2019) Johnson, J., Douze, M., and Jégou, H. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  • Juuti et al. (2019) Juuti, M., Szyller, S., Marchal, S., and Asokan, N. Prada: protecting against dnn model stealing attacks. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P), pp.  512–527. IEEE, 2019.
  • Kim et al. (2020) Kim, M., Tack, J., and Hwang, S. J. Adversarial self-supervised contrastive learning. Advances in Neural Information Processing Systems, 33:2983–2994, 2020.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
  • Li et al. (2020) Li, H., Xu, X., Zhang, X., Yang, S., and Li, B. Qeba: Query-efficient boundary-based blackbox attack. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1221–1230, 2020.
  • Li et al. (2022a) Li, H., Shan, S., Wenger, E., Zhang, J., Zheng, H., and Zhao, B. Y. Blacklight: Scalable defense for neural networks against {{\{{Query-Based}}\}}{{\{{Black-Box}}\}} attacks. In 31st USENIX Security Symposium (USENIX Security 22), pp.  2117–2134, 2022a.
  • Li et al. (2022b) Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pp.  12888–12900. PMLR, 2022b.
  • Li et al. (2023) Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp.  19730–19742. PMLR, 2023.
  • Liu et al. (2024) Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  • Luo et al. (2023) Luo, R., Wang, Y., and Wang, Y. Rethinking the effect of data augmentation in adversarial contrastive learning. In The Eleventh International Conference on Learning Representations, 2023.
  • Ma et al. (2018) Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S., Schoenebeck, G., Song, D., Houle, M. E., and Bailey, J. Characterizing adversarial subspaces using local intrinsic dimensionality. In International Conference on Learning Representations, 2018.
  • Ma et al. (2021) Ma, X., Niu, Y., Gu, L., Wang, Y., Zhao, Y., Bailey, J., and Lu, F. Understanding adversarial attacks on deep learning based medical image analysis systems. Pattern Recognition, 110:107332, 2021.
  • Ma et al. (2024) Ma, X., Jiang, L., Huang, H., Weng, Z., Bailey, J., and Jiang, Y.-G. Imbalanced gradients: a subtle cause of overestimated adversarial robustness. Machine Learning, 113(5):2301–2326, 2024.
  • Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Maho et al. (2021) Maho, T., Furon, T., and Le Merrer, E. Surfree: a fast surrogate-free black-box attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10430–10439, 2021.
  • Nicolae et al. (2018) Nicolae, M.-I., Sinn, M., Tran, M. N., Buesser, B., Rawat, A., Wistuba, M., Zantedeschi, V., Baracaldo, N., Chen, B., Ludwig, H., Molloy, I., and Edwards, B. Adversarial robustness toolbox v1.2.0. CoRR, 1807.01069, 2018. URL https://arxiv.org/pdf/1807.01069.
  • Nilsback & Zisserman (2008) Nilsback, M.-E. and Zisserman, A. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pp.  722–729. IEEE, 2008.
  • Oord et al. (2018) Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • OpenAI (2023) OpenAI, R. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5), 2023.
  • Parkhi et al. (2012) Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pp.  3498–3505. IEEE, 2012.
  • Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • Ramesh et al. (2022) Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • Sharir et al. (2020) Sharir, O., Peleg, B., and Shoham, Y. The cost of training nlp models: A concise overview. arXiv preprint arXiv:2004.08900, 2020.
  • Stallkamp et al. (2012) Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.
  • Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2013.
  • Tong et al. (2023) Tong, C., Zheng, X., Li, J., Ma, X., Gao, L., and Xiang, Y. Query-efficient black-box adversarial attacks on automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • Tu et al. (2019) Tu, C.-C., Ting, P., Chen, P.-Y., Liu, S., Zhang, H., Yi, J., Hsieh, C.-J., and Cheng, S.-M. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  742–749, 2019.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. (2019) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., and Gu, Q. On the convergence and robustness of adversarial training. In International Conference on Machine Learning, 2019.
  • Wang et al. (2020) Wang, Y., Zou, D., Yi, J., Bailey, J., Ma, X., and Gu, Q. Improving adversarial robustness requires revisiting misclassified examples. In International conference on learning representations, 2020.
  • Wei et al. (2023) Wei, Z., Chen, J., Wu, Z., and Jiang, Y.-G. Adaptive cross-modal transferable adversarial attacks from images to videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Wu et al. (2020) Wu, D., Wang, Y., Xia, S.-T., Bailey, J., and Ma, X. Skip connections matter: On the transferability of adversarial examples generated with resnets. In International Conference on Learning Representations, 2020.
  • Xu et al. (2017) Xu, W., Evans, D., and Qi, Y. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
  • Xu et al. (2024a) Xu, X., Zhang, J., Liu, F., Sugiyama, M., and Kankanhalli, M. S. Efficient adversarial contrastive learning via robustness-aware coreset selection. Advances in Neural Information Processing Systems, 36, 2024a.
  • Xu et al. (2024b) Xu, X., Zhang, J., Liu, F., Sugiyama, M., and Kankanhalli, M. S. Enhancing adversarial contrastive learning via adversarial invariant regularization. Advances in Neural Information Processing Systems, 36, 2024b.
  • Zhang et al. (2019) Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp.  7472–7482. PMLR, 2019.
  • Zhang et al. (2024) Zhang, J., Ma, X., Wang, X., Qiu, L., Wang, J., Jiang, Y.-G., and Sang, J. Adversarial prompt tuning for vision-language models. In European conference on computer vision, 2024.
  • Zhang et al. (2021) Zhang, Z., Chen, Y., and Wagner, D. Seat: similarity encoder by adversarial training for detecting model extraction attack queries. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, pp.  37–48, 2021.
  • Zhao et al. (2024) Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N.-M. M., and Lin, M. On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhu et al. (2023) Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix A Datasets and Models

We evaluate Blacklight, PIHA, and our ACPT methods on 5 benchmark datasets: CIFAR-10, GTSRB, ImageNet, Flowers, and Pets. The training phase was performed with 3090 GPUs, utilizing PyTorch with the Adam optimizer for 100 epochs. Table 4 summarizes the 5 image classification tasks used in our experiments.

Table 4: A summary of the datasets and the corresponding models used in our experiments.
Dataset Model Dimension Category Top-1 Acc.
CIFAR-10 ResNet20 32×\times×32×\times×3 10 91.73%
GRSRB ResNet34 32×\times×32×\times×3 43 94.96%
FLowers ResNet101 224×\times×224×\times×3 102 85.80%
ImageNet ResNet152 224×\times×224×\times×3 1000 78.33%
Pets Vit-B/16 224×\times×224×\times×3 37 93.13%

Appendix B Effect of Prompt Token Length

Here, we analyze the impact of prompt token length of ACPT on the detection performance, with varying token lengths K[0,30]𝐾030K\in[0,30]italic_K ∈ [ 0 , 30 ]. Note that when K=0𝐾0K=0italic_K = 0, the ACPT-tuned encoder degenerates to the vanilla CLIP image encoder. As depicted in Figure 5, our AdvQDet can reliably distinguish between benign and adversarial queries, assigning high average similarity scores (close to 1 almost everywhere) to adversarial queries. The difference is more pronounced as the token length of ACPT increases.

Refer to caption
Figure 5: The average similarity score of the first 50 benign and adversarial queries under varying prompt token length (“ACPT-x” with x denoting the token length) on ImageNet.

Appendix C Detecting Query-Based Attacks on Vision-Language Models

Our previous experiments have shown that the ACPT method is effective, efficient, and robust at detecting query-based attacks in image classification tasks. Here, we extend ACPT to detect query-based attacks on the image captioning task. Additionally, Figure 6 visualizes the process of query-based attacks, showcasing intermediate adversarial examples such as those generated by Boundary (Brendel et al., 2018), HSJA (Chen et al., 2020a), NESS (Ilyas et al., 2018), QEBA (Li et al., 2020), Square (Andriushchenko et al., 2020), SurFree (Maho et al., 2021), ZOO (Chen et al., 2017), and AttackVLM (Zhao et al., 2024). While methods like Boundary, HSJA, NESS, QEBA, Square, SurFree, and ZOO are specifically designed for image classification, AttackVLM targets image captioning tasks. This visualization reveals that the underlying process of such attacks is consistent across different tasks.

Unlike query-based attacks on image classification, AttackVLM (Zhao et al., 2024) first employs pre-trained CLIP (Radford et al., 2021) and BLIP (Li et al., 2022b) as surrogate models to generate attacks, either by matching image or textual embeddings, aiming to generate targeted responses. These adversarial examples are then transferred to other large Vision-Language Models (VLMs), including MiniGPT-4 (Zhu et al., 2023), LLaVA (Liu et al., 2024), and BLIP-2 (Li et al., 2023). Furthermore, AttackVLM utilizes query-based attacks that incorporate transfer-based attacks as an initial step, significantly boosting the effectiveness of targeted evasion against such VLMs, aiming for the targeted response generation over large VLMs. Despite these advanced techniques, our experiments show that ACPT can effectively detect AttackVLM attacks within 3 attempts.

Refer to caption
Figure 6: Malicious queries (x0,x1,,x49subscript𝑥0subscript𝑥1subscript𝑥49x_{0},x_{1},\ldots,x_{49}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 49 end_POSTSUBSCRIPT) generated by 8 query-based attacks (Boundary, HSJA, NESS, QEBA, Square, SurFree, ZOO, and AttackVLM) exhibit notable differences during the generation process. However, the sequences of query images produced by these attacks are highly similar to one another.

Appendix D The Trade-off: Detection Rate vs. False Positives Rate

The trade-off is related to encoder E𝐸Eitalic_E and the distribution of query data. Following the OARS work (Hooda et al., 2023), we assume an isotropic Gaussian distribution for benign queries 𝒩(𝐩𝐱,Iσ2)𝒩subscript𝐩𝐱𝐼superscript𝜎2\mathcal{N}(\mathbf{p_{\mathbf{x}}},I\sigma^{2})caligraphic_N ( bold_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and another Gaussian distribution δ𝒩(0,Iβ2)similar-to𝛿𝒩0𝐼superscript𝛽2\delta\sim\mathcal{N}(0,I\beta^{2})italic_δ ∼ caligraphic_N ( 0 , italic_I italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for adversarial perturbations. A false negative occurs when encoder E𝐸Eitalic_E fails to identify the malicious query 𝐱+δ𝐱𝛿{\mathbf{x}}+\deltabold_x + italic_δ, especially if the embeddings of 𝐱𝐱{\mathbf{x}}bold_x and 𝐱+δ𝐱𝛿{\mathbf{x}}+\deltabold_x + italic_δ are significantly different, meaning sim(E(𝐱),E(𝐱+δ))μ𝑠𝑖𝑚𝐸𝐱𝐸𝐱𝛿𝜇sim(E({\mathbf{x}}),E({\mathbf{x}}+\delta))\leq\muitalic_s italic_i italic_m ( italic_E ( bold_x ) , italic_E ( bold_x + italic_δ ) ) ≤ italic_μ. Consequently, we define the detection rate as αdet=[sim(E(𝐱),E(𝐱+δ))μ]superscript𝛼detdelimited-[]𝑠𝑖𝑚𝐸𝐱𝐸𝐱𝛿𝜇\alpha^{\text{det}}=\mathbb{P}[sim(E({\mathbf{x}}),E({\mathbf{x}}+\delta))\geq\mu]italic_α start_POSTSUPERSCRIPT det end_POSTSUPERSCRIPT = blackboard_P [ italic_s italic_i italic_m ( italic_E ( bold_x ) , italic_E ( bold_x + italic_δ ) ) ≥ italic_μ ], while the false positive rate can be expressed as αfp=[sim(E(𝐱1),E(𝐱2))μ]superscript𝛼fpdelimited-[]𝑠𝑖𝑚𝐸subscript𝐱1𝐸subscript𝐱2𝜇\alpha^{\text{fp}}=\mathbb{P}[sim(E({\mathbf{x}}_{1}),E({\mathbf{x}}_{2}))\geq\mu]italic_α start_POSTSUPERSCRIPT fp end_POSTSUPERSCRIPT = blackboard_P [ italic_s italic_i italic_m ( italic_E ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_E ( bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ≥ italic_μ ]. Furthermore, the trade-off between the detection rate αdetsuperscript𝛼det\alpha^{\text{det}}italic_α start_POSTSUPERSCRIPT det end_POSTSUPERSCRIPT and the false positive rate αfpsuperscript𝛼fp\alpha^{\text{fp}}italic_α start_POSTSUPERSCRIPT fp end_POSTSUPERSCRIPT, is influenced by the standard deviation β𝛽\betaitalic_β of the perturbation distribution and the expected spread σ𝜎\sigmaitalic_σ of natural queries. Hence, our observations find that natural images are sufficiently spread out, while adversarial examples generated by the query-based attacks tend to cluster more centrally. This suggests that a stronger encoder can achieve a high detection rate while maintaining a low false positive rate. Additionally, by implementing an effective defense action, such as returning cache predictions, our approach is designed to minimize the impact of false positives on benign users.