HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: forest

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2401.00897v1 [cs.CV] 31 Dec 2023

Masked Modeling for Self-supervised Representation Learning on Vision and Beyond

Siyuan Li*, Luyuan Zhang*, Zedong Wang, Di Wu, Lirong Wu, Zicheng Liu, Jun Xia, Cheng Tan,
Yang Liu, Baigui Sun, Stan Z. Li{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT
Siyuan Li and Luyuan Zhang are co-first authors. Stan Z. Li is the corresponding author. Siyuan Li, Luyuan Zhang, Zedong Wang, Di Wu, Lirong Wu, Zicheng Liu, Jun Xia, Cheng Tan, and Stan. Z. Li are from the AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, Zhejiang, China, 310030.
E-mail: [email protected][email protected]; [email protected][email protected][email protected][email protected]; [email protected][email protected]. cn; [email protected]. Siyuan Li, Yang Liu, and Baigui Sun are with the DAMO Academy, Hangzhou, Zhejiang, China.
Email: [email protected]; [email protected].
Abstract

As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability and the low dependence on labeled data. Among these varied self-supervised techniques, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training. This paradigm enables deep models to learn robust representations and has demonstrated exceptional performance in the context of computer vision, natural language processing, and other modalities. In this survey, we present a comprehensive review of the masked modeling framework and its methodology. We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures and more. Then, we systematically investigate its wide-ranging applications across domains. Furthermore, we also explore the commonalities and differences between masked modeling methods in different fields. Toward the end of this paper, we conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research. A paper list project with this survey is available at https://github.com/Lupin1998/Awesome-MIM.

Index Terms:
Self-supervised Learning, Masked Modeling, Generative Model, Natural Language Processing, Audio and Speech, Graph

1 Introduction

Deep learning has made tremendous progress over the past decade, with an early emphasis on the supervised learning approaches [81, 80, 147, 123] that depend heavily on large-scale labeled data. However, self-supervised learning (SSL) and pretraining techniques [144] have burgeoned, captivating the deep learning community with their advanced transferability and reduced dependence on labels. Fundamentally, SSL is to learn valuable representations from unlabeled data, e.g., intrinsic data structures, with designated pretext tasks. The development of SSL and pretraining techniques has been rapid, with a proliferation of variants across modalities and fields. To date, their evolutions have followed far different trajectories depending on specific modality and domain. Thus, it is crucial to provide an up-to-date survey of the rapidly growing masked modeling. The development timeline of SSL is schematically illustrated in Figure 1.

Early Attempts. Due to the underwhelming results from discriminative pretext tasks, early-stage SSL methods were dominated by generative objectives. Research at that time focused heavily on generative modeling itself, such as image and text generation tasks, with pretraining treated as a byproduct rather than the major concern. Even today, generative approaches remain at the heart of self-supervision, including Autoencoder-based models [167, 50], GAN-based models [14], and diffusion-based models [83]. In contrast, former discriminative SSL frameworks were hinged on ad-hoc pretext tasks. Methods like [44] and [166] introduced other tasks like colorization and shuffle-reconstruction. [151] pioneered the use of masked inputs for reconstruction, which served as a precursor to today’s masked modeling. However, these approaches have not yet hit the mainstream.

Refer to caption
Figure 1: Research in self-supervised learning (SSL) can be broadly categorized into Generative and Discriminative paradigms. We reviewed major SSL research since 2008 and found that SSL has followed distinct developmental trajectories and stages across time periods and modalities. Since 2018, SSL in NLP has been dominated by generative masked language modeling, which remains mainstream. In computer vision, discriminative contrastive learning dominated from 2018 to 2021 before masked image modeling gained prominence after 2022.

Language Domain. In 2018, BERT [43] and GPT [91] introduced Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) for natural language processing (NLP), ushering in more standardized objectives. Because of the remarkable performance of BERT and GPT, generative pretraining methods based on MLM and NSP have become the mainstream approaches for NLP. From 2018 to 2020, the NLP community mainly focused on refining pretraining strategies based on MLM and NSP. After contrastive learning was theoretically formalized, some 2021 works [64] explored contrastive discriminative pretraining for NLP. However, MLM-based research remains in a dominant position.

Vision Domain. In contrast to NLP, self-supervised pretraining in computer vision (CV) has followed a more complex and diverse development. In 2018, theoretical advances in contrastive learning like [113] and [233] established their foundations, enabling significant performance gains in linear evaluation protocols. This catalyzed the rise of discriminative models for SSL in computer vision. From 2019 to 2021, CV research was dominated by contrastive approaches, with influential frameworks like [79], [31], and [71] achieving impressive results. During this period, some generative models like iGPT [30] adopted auto-regressive pretraining with a GPT-2 [91] backbone. However, due to performance limitations, generative self-supervision had minimal impact compared to contrastive learning. This changed in 2021 when Vision Transformers [49] (ViT) altered the CV self-supervision landscape. Post-ViT [49], CV research began emulating BERT [43] by tokenizing images and then pretraining transformers. MAE [78] formally introduced masked image modeling, achieving strong performance. Since then, CV self-supervision has focused on generative reconstruction and masked modeling.

Multimodality. The earliest multimodal pre-trained models emerged in 2020, with VL-BERT [200] fusing modalities using a transformer architecture. In 2021, CLIP [176] combined computer vision and NLP modalities, ushering in an era of contrastive learning for multimodal pretraining that became mainstream in academia. Proposed in 2022, BEiT.v3 [220] introduced masked modeling as a pretraining technique for multimodal models, while MetaTransformer [272] combined multiple approaches. Since then, masked modeling has played a pivotal role in multimodal research.

Other Domains. SSL has been broadly applied across modalities beyond NLP and CV, including Audio, Speech, Biology, Video, and others. Research on SSL pretraining for Audio and Speech has closely followed the paradigms in CV and NLP. When contrastive learning gained popularity in 2018, influential speech models like [39] and [8] adopted contrastive learning for pretraining. Notably, [8] combined masked modeling as a data augmentation technique for contrastive learning. In 2021, [27] and then [95] in 2022 drew inspiration from masked image modeling in CV to implement masked spectrum modeling for audio. Since then, Masked Modeling has been a main direction in audio and speech research. As AlphaFold [103] achieved a great breakthrough in accurate protein structure predictions in 2021s, masked modeling has been introduced into Biology and Chemistry to assist the scientists as the AI-for-Science (AI4Sci) research paradigm.

Masked modeling has demonstrated compelling performance across modalities, including vision, language, speech, and beyond. With its widespread adoption, the landscape of masked modeling research has grown increasingly diverse. A multitude of masked modeling methods have emerged, creating a complex ecosystem of models tailored to different data types and tasks. Therefore, it is highly worthwhile to systematically review recent advances and provide structured categorization of the extensive masked modeling literature. In this paper, we conduct an extensive survey of the masked modeling research landscape. We thoroughly investigate the latest innovations in self-supervised representation learning across vision, NLP, speech, and other domains. Our main contribution is a comprehensive taxonomy that organizes the extensive body of masked modeling techniques into coherent groups according to training objectives, model architectures, and applications. This framing elucidates the relationships between existing methods and paves the way for developing new masked modeling techniques. Our review and classification provide a holistic reference to inform and accelerate future masked modeling research across modalities.

To sum up, our contributions include:

  1. 1.

    We provide a timely literature review and a comprehensive framework, taking computer vision as an instance, to holistically conceptualize masked modeling principles that can categorize different applications to date across domains and modalities under a common lens.

  2. 2.

    We meticulously review and discuss the technical details within the masked modeling framework, such as masking strategies, targets, networks, and more, to let researchers get a better grasp of the involved techniques and thus gain a deeper understanding and insights.

  3. 3.

    We systematically survey the downstream applications of masked modeling in vision, presenting the technical challenges and further showcasing their widespread applicability to other modalities and domains beyond vision, such as audio, speech, graph, biology, and more.

  4. 4.

    Through extensive algorithmic research and detailed evaluations, we provide a collection of comprehensive tables and awesome lists of masked modeling methods on GitHub. In the end, we identified the future directions of maksed modeling research and further provided heuristic suggestions and reflections on these directions.

2 Preliminary

2.1 Notations

The notations used in this survey are illustrated in Table I, and we will present a detailed demonstration of the changes and corresponding relationships between the symbols and variables in the table.

In this paper, x𝑥xitalic_x denotes a data sequence which can be a sentence in NLP, a patch sequence in CV, and a data sequence in another modality. In CV, 𝒙𝒙\boldsymbol{x}bold_italic_x denotes a patch sequence, that 𝒙N×(P2C)𝒙superscript𝑁superscript𝑃2𝐶\boldsymbol{x}\in\mathbb{R}^{N\times(P^{2}C)}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × ( italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C ) end_POSTSUPERSCRIPT and N𝑁Nitalic_N denotes the number of the patch, P2×Csuperscript𝑃2𝐶P^{2}\times Citalic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_C denotes the dimension of a patch vector. That means 𝒙=[𝒙1,𝒙2,,𝒙n]𝒙subscript𝒙1subscript𝒙2subscript𝒙𝑛\boldsymbol{x}=[\boldsymbol{x}_{1},\boldsymbol{x}_{2},\cdots,\boldsymbol{x}_{n}]bold_italic_x = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], and 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote a patch, and 𝒙i=[xi1,xi2,,xiP2×C]subscript𝒙𝑖superscriptsubscript𝑥𝑖1superscriptsubscript𝑥𝑖2superscriptsubscript𝑥𝑖superscript𝑃2𝐶\boldsymbol{x}_{i}=[x_{i}^{1},x_{i}^{2},\cdots,x_{i}^{P^{2}\times C}]bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_C end_POSTSUPERSCRIPT ]. In this paper, we use 𝒙k,𝒙iksuperscript𝒙𝑘superscriptsubscript𝒙𝑖𝑘\boldsymbol{x}^{k},\boldsymbol{x}_{i}^{k}bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT to denote the different sequences and patches, and we use 𝒙visuperscript𝒙subscript𝑣𝑖\boldsymbol{x}^{v_{i}}bold_italic_x start_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to denote the different views of the patch. In NLP, 𝒙=[𝒙𝟏,𝒙𝟐,,𝒙𝑳]𝒙subscript𝒙1subscript𝒙2subscript𝒙𝑳\boldsymbol{x}=[\boldsymbol{x_{1}},\boldsymbol{x_{2}},\cdots,\boldsymbol{x_{L}}]bold_italic_x = [ bold_italic_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT ] denotes the original sentence. we use 𝒆=[𝒆𝟏,𝒆𝟐,,𝒆𝑳]𝒆subscript𝒆1subscript𝒆2subscript𝒆𝑳\boldsymbol{e}=[\boldsymbol{e_{1}},\boldsymbol{e_{2}},\cdots,\boldsymbol{e_{L}}]bold_italic_e = [ bold_italic_e start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , ⋯ , bold_italic_e start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT ] to denote the embedded sequence. Encoder and decoder are denoted by fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) and gϕ()subscript𝑔italic-ϕg_{\phi}(\cdot)italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ), where θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ are learnable parameters. In masked modeling, some tokens or patches of 𝒙𝒙\boldsymbol{x}bold_italic_x are selected to mask, and we use ={0,1}Nsuperscript01𝑁\mathcal{M}=\{0,1\}^{N}caligraphic_M = { 0 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT to denote the mask set, which means a masked sequence can be denoted as 𝒙=[𝒙𝟏,,𝒙𝒊𝟏,0,𝒙𝒊+𝟏,,𝒙𝒏]direct-product𝒙subscript𝒙1subscript𝒙𝒊10subscript𝒙𝒊1subscript𝒙𝒏\boldsymbol{x}\odot\mathcal{M}=[\boldsymbol{x_{1}},\cdots,\boldsymbol{x_{i-1}}% ,0,\boldsymbol{x_{i+1}},\cdots,\boldsymbol{x_{n}}]bold_italic_x ⊙ caligraphic_M = [ bold_italic_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT bold_italic_i bold_- bold_1 end_POSTSUBSCRIPT , 0 , bold_italic_x start_POSTSUBSCRIPT bold_italic_i bold_+ bold_1 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT bold_italic_n end_POSTSUBSCRIPT ]. The left visible patch(token) can be denoted as x~=xi=1,𝕀=1N~𝑥superscriptsubscript𝑥𝑖1subscript𝕀1𝑁\tilde{x}=x_{i=1,\mathbb{I}_{\mathcal{M}=1}}^{N}over~ start_ARG italic_x end_ARG = italic_x start_POSTSUBSCRIPT italic_i = 1 , blackboard_I start_POSTSUBSCRIPT caligraphic_M = 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT or e~=ei=1,𝕀=1N~𝑒superscriptsubscript𝑒𝑖1subscript𝕀1𝑁\tilde{e}=e_{i=1,\mathbb{I}_{\mathcal{M}=1}}^{N}over~ start_ARG italic_e end_ARG = italic_e start_POSTSUBSCRIPT italic_i = 1 , blackboard_I start_POSTSUBSCRIPT caligraphic_M = 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT.

Refer to caption
Figure 2: Illustration of two popular self-supervised learning frameworks. For simplicity, the input data can be serialized and transformed into a sequence of embedded tokens. (a) Contrastive learning learns discriminative representation from two augmented views by aligning two projected tokens. (b) Masked modeling learns contextual information by the generative paradigm that reconstructs the masked tokens, which can be
Notations Descriptions
m×nsuperscript𝑚𝑛\mathbb{R}^{m\times n}blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT Two-dimensional tensor space
m×n×psuperscript𝑚𝑛𝑝\mathbb{R}^{m\times n\times p}blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n × italic_p end_POSTSUPERSCRIPT Three-dimensional tensor space
𝒩𝒩\mathcal{N}caligraphic_N Natural number set from 1111 to N𝑁Nitalic_N
𝒙𝒙\boldsymbol{x}bold_italic_x, 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT A data sequence and its i𝑖iitalic_i-th element
𝒙m:nsubscript𝒙:𝑚𝑛\boldsymbol{x}_{m:n}bold_italic_x start_POSTSUBSCRIPT italic_m : italic_n end_POSTSUBSCRIPT The subsequence from m𝑚mitalic_m to n𝑛nitalic_n in 𝒙𝒙\boldsymbol{x}bold_italic_x
𝒎𝒎\boldsymbol{m}bold_italic_m Encoded representations of masked patch/token
𝒛𝒛\boldsymbol{z}bold_italic_z Latent variables
={0,1}Nsuperscript01𝑁\mathcal{M}=\{0,1\}^{N}caligraphic_M = { 0 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT A set of masks for N𝑁Nitalic_N elements
isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The i𝑖iitalic_i-th element in set \mathcal{M}caligraphic_M
𝒙~~𝒙\tilde{\boldsymbol{x}}over~ start_ARG bold_italic_x end_ARG The set of visible tokens in masked sequence
θ,ω,γ,𝜃𝜔𝛾\theta,\omega,\gamma,\cdotsitalic_θ , italic_ω , italic_γ , ⋯ Parameters of the deep neural networks
τ𝜏\tauitalic_τ Temperature parameter in contrastive learning
λ𝜆\lambdaitalic_λ Weights of loss functions
Natural Language Processing (NLP)
𝒆𝒆\boldsymbol{e}bold_italic_e Embedded word tokens.
𝒱𝒱\mathcal{V}caligraphic_V, visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Vocabulary set (or codebook) and its i𝑖iitalic_i-th elements
Computer Vision (CV)
𝐗𝐗\mathbf{X}bold_X Images X
𝐗vsuperscript𝐗𝑣\mathbf{X}^{v}bold_X start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT Different views of the image 𝐗𝐗\mathbf{X}bold_X
𝒙visuperscript𝒙subscript𝑣𝑖\boldsymbol{x}^{v_{i}}bold_italic_x start_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT A patch sequence with different views.
𝒒ϕ(|)\boldsymbol{q}_{\phi}(\cdot|\cdot)bold_italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ | ⋅ ) The quantization tokenizer
𝒑ψ(|)\boldsymbol{p}_{\psi}(\cdot|\cdot)bold_italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ | ⋅ ) The decoder to train the tokenizer
fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) Encoder with parameter θ𝜃\thetaitalic_θ
fθ()subscript𝑓superscript𝜃f_{\theta^{\prime}}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ) The teacher model with parameter θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
gθ()subscript𝑔𝜃g_{\theta}(\cdot)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) Decoder of with parameter θ𝜃\thetaitalic_θ
()\nabla(\cdot)∇ ( ⋅ ) Gradient function
𝒯()𝒯\mathcal{T}(\cdot)caligraphic_T ( ⋅ ) The transformation function
𝕀()subscript𝕀\mathbb{I}_{(\cdot)}blackboard_I start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT An indicator function
𝒢()𝒢\mathcal{G}(\cdot)caligraphic_G ( ⋅ ) Generator in adversarial learning
𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ) Discriminator in adversarial learning
()\mathcal{F}(\cdot)caligraphic_F ( ⋅ ) Fourier transform function
p()𝑝p(\cdot)italic_p ( ⋅ ) Probability density function
p(|)p(\cdot|\cdot)italic_p ( ⋅ | ⋅ ) Conditional probability distribution
sg()sg\textrm{sg}(\cdot)sg ( ⋅ ) Stop-gradient operation
,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ Inner product function
|||\cdot|| ⋅ | Cardinality of the set
\|\cdot\|∥ ⋅ ∥ Norm of the vector
𝒮𝒮\mathcal{S}caligraphic_S Similarity measurement function
()Tsuperscript𝑇(\cdot)^{T}( ⋅ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT Transpose function
direct-product\odot Element-wise multiplication operation
TABLE I: Mathmetical notations.

2.2 Self-Supervised Learning

In this subsection, we will give a brief introduction to the methods of SSL. Typically, SSL methods are universally divided into two categories [142], i.e., Generative and Discriminative, as shown in Table 3.

Generative model usually encodes the input x𝑥xitalic_x into a latent variable z𝑧zitalic_z and decodes the latent variable z𝑧zitalic_z to reconstruct the input x𝑥xitalic_x with an encoder-decoder architecture. Autoregressive models typically model a series of regressions one by one for one input.

Auto-Regressive models typically model a series of regressions one by one for one input, where the current output depends on the previous inputs or outputs in the sequence. GPT [91] and Transformer [209] are both AR models. The learning object of the AR model can be formulated as:

maxθpθ(𝒙)=t=1Tlogpθ(𝒙t|𝒙1:t1),subscript𝜃subscript𝑝𝜃𝒙superscriptsubscript𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝒙𝑡subscript𝒙:1𝑡1\displaystyle\max\limits_{\theta}p_{\theta}(\boldsymbol{x})=\sum_{t=1}^{T}\log p% _{\theta}(\boldsymbol{x}_{t}|\boldsymbol{x}_{1:t-1}),roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) , (1)

where each variable is dependent on previous variables [142].

Auto-Encoder typically reconstructs the input from the corrupted input. The learning object of the AE model can be formulated as:

min(𝒙,gdec(fenc(𝒙))).𝒙subscript𝑔decsubscript𝑓enc𝒙\displaystyle\min\mathcal{L}\big{(}\boldsymbol{x},g_{\textrm{dec}}(f_{\textrm{% enc}}(\boldsymbol{x}))\big{)}.roman_min caligraphic_L ( bold_italic_x , italic_g start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( bold_italic_x ) ) ) . (2)

We further divide the AE model into denoising AE and masked AE. The denoising AE model is trained to reconstruct clean data from noisy or corrupted input. By removing noise or corruption, the model learns robust representations. And a masked auto-encoder is trained to predict missing or masked portions of the input data. By reconstructing the missing parts, the model learns contextual representations.

Flow Based model aims to learn densities p(x)𝑝𝑥p(x)italic_p ( italic_x ) from data. Suppose a latent variable z𝑧zitalic_z follows a known distribution pZ(x)subscript𝑝𝑍𝑥p_{Z}(x)italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_x ) and define z=fθ(x)𝑧subscript𝑓𝜃𝑥z=f_{\theta}(x)italic_z = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ). The learning objective is to maximize the likelihood [142]:

maxθilogpθ(x(i))=maxθilogpZ(fθ(x(i)))+log|fθx(x(i))|.subscript𝜃subscript𝑖subscript𝑝𝜃superscript𝑥𝑖subscript𝜃subscript𝑖subscript𝑝𝑍subscript𝑓𝜃superscript𝑥𝑖subscript𝑓𝜃𝑥superscript𝑥𝑖\begin{split}&\max\limits_{\theta}\sum\limits_{i}\log p_{\theta}(x^{(i)})\\ =&\max\limits_{\theta}\sum\limits_{i}\log p_{Z}(f_{\theta}(x^{(i)}))+\log\bigg% {|}\frac{\partial f_{\theta}}{\partial x}(x^{(i)})\bigg{|}.\end{split}start_ROW start_CELL end_CELL start_CELL roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) + roman_log | divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x end_ARG ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) | . end_CELL end_ROW (3)

GAN-Based model so-called adversarial learning involves training two models in competition with each other, typically a generator 𝒢𝒢\mathcal{G}caligraphic_G and discriminator 𝒟𝒟\mathcal{D}caligraphic_D, which can be formulated as:

min𝒢max𝒟V(𝒟,𝒢)=𝔼xpdata(x)[log𝒟(x)]+𝔼zpz(z)[log(1𝒟(𝒢(z)))].subscript𝒢subscript𝒟𝑉𝒟𝒢subscript𝔼similar-to𝑥subscript𝑝data𝑥delimited-[]𝒟𝑥subscript𝔼similar-to𝑧subscript𝑝𝑧𝑧delimited-[]1𝒟𝒢𝑧\begin{split}&\min_{\mathcal{G}}\max_{\mathcal{D}}V(\mathcal{D},\mathcal{G})=% \\ &\mathbb{E}_{x\sim p_{\text{data}}(x)}\big{[}\log\mathcal{D}(x)\big{]}+\mathbb% {E}_{z\sim p_{z}(z)}\big{[}\log(1-\mathcal{D}(\mathcal{G}(z)))\big{]}.\\ \end{split}start_ROW start_CELL end_CELL start_CELL roman_min start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT italic_V ( caligraphic_D , caligraphic_G ) = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT [ roman_log caligraphic_D ( italic_x ) ] + blackboard_E start_POSTSUBSCRIPT italic_z ∼ italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ) end_POSTSUBSCRIPT [ roman_log ( 1 - caligraphic_D ( caligraphic_G ( italic_z ) ) ) ] . end_CELL end_ROW (4)

Diffusion-based model initially processes images through a series of Gaussian noise treatments, followed by restoration of the image through the model. The diffusion-based model process is divided into forward and reverse processes. The forward process treats the image with cumulative Gaussian noise, which can be modeled as follows:

q(xt|xt1)=𝒩(xt;1βtxt1,βt𝐈),q(x1:T|x0)=t=1Tq(xt|xt1),formulae-sequence𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝑥𝑡1subscript𝛽𝑡subscript𝑥𝑡1subscript𝛽𝑡𝐈𝑞conditionalsubscript𝑥:1𝑇subscript𝑥0superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1\displaystyle\begin{split}q(x_{t}|x_{t-1})&=\mathcal{N}(x_{t};\sqrt{1-\beta_{t% }}x_{t-1},\beta_{t}\mathbf{I}),q(x_{1:T}|x_{0})\\ &=\prod_{t=1}^{T}q(x_{t}|x_{t-1}),\end{split}start_ROW start_CELL italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL start_CELL = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , italic_q ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , end_CELL end_ROW (5)

in which βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is mean coefficient. The reverse process of the diffusion-based model, which involves denoising and inference, has a learning objective that can be modeled as follows:

pθ(X0:T)subscript𝑝𝜃subscript𝑋:0𝑇\displaystyle p_{\theta}(X_{0:T})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) =p(xT)t=1Tpθ(xt1|xt);absent𝑝subscript𝑥𝑇superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡\displaystyle=p(x_{T})\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t});= italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; (6)
pθ(xt1|xt)subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡\displaystyle p_{\theta}(x_{t-1}|x_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =𝒩(xt1;μθ(xt,t),Σθ(xt,t)).absent𝒩subscript𝑥𝑡1subscript𝜇𝜃subscript𝑥𝑡𝑡subscriptΣ𝜃subscript𝑥𝑡𝑡\displaystyle=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma_{\theta}(x_{t},% t)).= caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) . (7)
Refer to caption
Figure 3: Self-supervised learning is universally divided into generative and discriminative [142], and the generative model can be further divided into AR, AE, Flow-based, and GAN-based models, where AE model can be divided into Denoised AE and Masked AE.

Discriminative model are typically formulated using contrastive learning objectives. The core idea in contrastive learning is to train encoders to produce similar representations for semantically related instances while distinguishing unrelated samples [142]. Contrasting at the context-instance level involves comparing the local feature, which is encoded, with the global representation from the identical sample. In contrast, the instance-instance contrast method is more focused on the representation at the instance level, examining the commonalities across multiple samples [142]. InfoNCE [167] is one of the basic loss functions in contrastive learning. It can be formulated as:

infoNCE=𝔼(𝒙i,𝒙j)p(𝒙)[exp(f(𝒙i)Tf(𝒙j)/τ)k=1Kexp(f(𝒙i)Tf(𝒙k)/τ)].subscriptinfoNCEsubscript𝔼similar-tosuperscript𝒙𝑖superscript𝒙𝑗𝑝𝒙delimited-[]𝑓superscriptsuperscript𝒙𝑖𝑇𝑓superscript𝒙𝑗𝜏superscriptsubscript𝑘1𝐾𝑓superscriptsuperscript𝒙𝑖𝑇𝑓superscript𝒙𝑘𝜏\mathcal{L}_{\text{infoNCE}}=-\mathbb{E}_{(\boldsymbol{x}^{i},\boldsymbol{x}^{% j})\sim p(\boldsymbol{x})}\left[\frac{\exp(f(\boldsymbol{x}^{i})^{T}f(% \boldsymbol{x}^{j})/\tau)}{\sum_{k=1}^{K}\exp(f(\boldsymbol{x}^{i})^{T}f(% \boldsymbol{x}^{k})/\tau)}\right].caligraphic_L start_POSTSUBSCRIPT infoNCE end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∼ italic_p ( bold_italic_x ) end_POSTSUBSCRIPT [ divide start_ARG roman_exp ( italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG ] . (8)

2.3 Masked Modeling

Masked Language Modeling. Masked Language Modeling was first introduced in BERT. The central idea of MLM is to randomly mask tokens within a sentence and replace them with a Mask vector. The encoder then predicts the masked vector. We formally define the problem of MLM as follows:

A sentence 𝒙=[𝒙𝟏,𝒙𝟐,,𝒙𝑳]𝒙subscript𝒙1subscript𝒙2subscript𝒙𝑳\boldsymbol{x}=[\boldsymbol{x_{1}},\boldsymbol{x_{2}},\cdots,\boldsymbol{x_{L}}]bold_italic_x = [ bold_italic_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT ] is first tokenized as 𝒆=[𝒆𝟏,𝒆𝟐,𝒆𝑳]𝒆subscript𝒆1subscript𝒆2subscript𝒆𝑳\boldsymbol{e}=[\boldsymbol{e_{1}},\boldsymbol{e_{2}},\cdots\boldsymbol{e_{L}}]bold_italic_e = [ bold_italic_e start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , ⋯ bold_italic_e start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT ] through a tokenizer 𝒒ϕ(|)\boldsymbol{q}_{\phi}(\cdot|\cdot)bold_italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ | ⋅ ), in which L𝐿Litalic_L denotes the number of the tokens in this sentence. The masked sequence of the embedded sentence 𝒆direct-product𝒆\boldsymbol{e}\odot\mathcal{M}bold_italic_e ⊙ caligraphic_M is fed into a Transformers encoder fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ). mi=fθ(e~)subscript𝑚𝑖subscript𝑓𝜃~𝑒m_{i}=f_{\theta}(\tilde{e})italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_e end_ARG ) is the hidden state of the last layer at the masked position and can be regarded as a fusion of contextualized representations of surrounding tokens. And the MLM task [115] can be formulated mathematically as:

MLM(x)=1i𝒩𝕀{i=1}logexp(miei)k=1|𝒱|exp(miek),subscriptMLM𝑥1normsubscript𝑖𝒩subscript𝕀subscript𝑖1subscript𝑚𝑖subscript𝑒𝑖superscriptsubscript𝑘1𝒱subscript𝑚𝑖subscript𝑒𝑘\mathcal{L}_{\textrm{MLM}}(x)=-\frac{1}{\|\mathcal{M}\|}\sum_{i\in\mathcal{N}}% \mathbb{I}_{\{\mathcal{M}_{i}=1\}}\log\frac{\exp(m_{i}\cdot e_{i})}{\sum_{k=1}% ^{|\mathcal{V}|}\exp(m_{i}\cdot e_{k})},caligraphic_L start_POSTSUBSCRIPT MLM end_POSTSUBSCRIPT ( italic_x ) = - divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_M ∥ end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT { caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 } end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT roman_exp ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG , (9)

Masked Image Modeling. The core concept of Masked Image Modeling (MIM) aligns with that of MLM. It involves masking certain pixel regions of the input image and reconstructing the original image based on the unmasked portions. Given that images lack the tokenizer structure inherent in natural language, the intuitive approach is to reconstruct pixel values directly. However, due to the high redundancy and dimensionality of image pixel information, pixel-level reconstruction is often challenging. This has historically hindered the progress of MIM. It wasn’t until the introduction of the ViT, which segments images into patches that MIM began to emerge as a feasible approach. We formally define the problem of MIM as follows: A image 𝐗H×W×C𝐗superscript𝐻𝑊𝐶\mathbf{X}\in\mathbb{R}^{H\times W\times C}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT is partitioned into multiple patches 𝒙N×(P2C)𝒙superscript𝑁superscript𝑃2𝐶\boldsymbol{x}\in\mathbb{R}^{N\times(P^{2}C)}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × ( italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C ) end_POSTSUPERSCRIPT, 𝒙=[𝒙𝟏,𝒙𝟐,,𝒙𝑵]𝒙subscript𝒙1subscript𝒙2subscript𝒙𝑵\boldsymbol{x}=[\boldsymbol{x_{1}},\boldsymbol{x_{2}},\cdots,\boldsymbol{x_{N}}]bold_italic_x = [ bold_italic_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT bold_italic_N end_POSTSUBSCRIPT ] where N𝑁Nitalic_N denotes the number of patch. Masked sequence can be denoted as 𝒙direct-product𝒙\boldsymbol{x}\odot\mathcal{M}bold_italic_x ⊙ caligraphic_M. The remaining unmasked patches 𝒙~~𝒙\tilde{\boldsymbol{x}}over~ start_ARG bold_italic_x end_ARG is used to reconstruct the original pixel through an encoderfθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) and a decoder gθ()subscript𝑔𝜃g_{\theta}(\cdot)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ). We use misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the hidden layer at the masked portion as NLP and mi=fθ(x~)subscript𝑚𝑖subscript𝑓𝜃~𝑥m_{i}=f_{\theta}(\tilde{x})italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG ), The learning object can be formulated as:

MIM=1i𝒩𝕀i=1mi𝒙𝒊2.subscriptMIM1normsubscript𝑖𝒩subscript𝕀subscript𝑖1superscriptnormsubscript𝑚𝑖subscript𝒙𝒊2\displaystyle\mathcal{L}_{\textrm{MIM}}=\frac{1}{\|\mathcal{M}\|}\sum_{i\in% \mathcal{N}}\mathbb{I}_{\mathcal{M}_{i}=1}\|m_{i}-\boldsymbol{x_{i}}\|^{2}.caligraphic_L start_POSTSUBSCRIPT MIM end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_M ∥ end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT ∥ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (10)
Refer to caption
Figure 4: The overview of the basic MIM framework, containing four building blocks with their internal components and functionalities. All MIM research can be summarized as innovations upon these four blocks, i.e., Masking, Encoder, Target, and Head. Frameworks of masked modeling in other modalities are similar to this framework.

Beyond. Beyond computer vision and natural language processing, masked modeling can also be applied to various data structures and multimodal domains. The core idea is to mask parts of the input vector with mask tokens and then reconstruct the data through an encoder-decoder framework. Masked data modeling can be formally described as: given an input sequence x𝑥xitalic_x of any modality, we generate the corrupted sample xdirect-product𝑥x\odot\mathcal{M}italic_x ⊙ caligraphic_M by replacing elements in xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with mask tokens [MASK]. We use 𝒮(,)𝒮\mathcal{S}(\cdot,\cdot)caligraphic_S ( ⋅ , ⋅ ) to denote the similarity between the predicted mask tokens and the original data. The learning object can be formulated as:

MDM=1i𝒩𝕀{i=1}𝒮(mi,xi),subscriptMDM1normsubscript𝑖𝒩subscript𝕀subscript𝑖1𝒮subscript𝑚𝑖subscript𝑥𝑖\displaystyle\mathcal{L}_{\textrm{MDM}}=\frac{1}{\|\mathcal{M}\|}\sum_{i\in% \mathcal{N}}\mathbb{I}_{\{\mathcal{M}_{i}=1\}}\mathcal{S}(m_{i},x_{i}),caligraphic_L start_POSTSUBSCRIPT MDM end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_M ∥ end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT { caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 } end_POSTSUBSCRIPT caligraphic_S ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (11)

in which 𝒮(,)𝒮\mathcal{S}(\cdot,\cdot)caligraphic_S ( ⋅ , ⋅ ) can be MSE and other functions which measure the similarities.

3 Basic framework and A unified perspective

In this section, we will introduce a unified perspective for Masked Modeling, offering a comprehensive categorization of masked modeling research. This will be complemented by an in-depth exposition of the basic framework of Masked Modeling, ensuring a profound understanding of its intricacies. Since masked modeling has been most thoroughly explored and developed in computer vision with the most comprehensive techniques and has laid the foundation for developments across domains, this paper takes masked image modeling as an example to elucidate masked modeling from the perspective of computer vision.

3.1 A Unified Perspective

Based on the current research on MIM for self-supervised pre-training, this paper conducts an in-depth investigation and proposes a unified research framework and paradigm for MIM, providing a detailed classification of existing studies. The framework mainly consists of four modules, namely: Mask, Target, Encoder, and Head. An overview of our framework is visually presented in Figure 4. In the following subsections, we will elaborate on the specific contents of these four modules.

Mask: Mask module is to generate a mask set \mathcal{M}caligraphic_M. The masked image can be denoted as 𝒙direct-product𝒙\boldsymbol{x}\odot\mathcal{M}bold_italic_x ⊙ caligraphic_M. Some typical mask strategy consists of Random Mask, Attention Mask, Contextual Mask, and so on.

Target: The Target module’s role is to generate supervisory signals. The target module can be formulated as: 𝒯(fω(𝒙))𝒯subscript𝑓𝜔𝒙\mathcal{T}(f_{\omega}(\boldsymbol{x}))caligraphic_T ( italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( bold_italic_x ) ), fω()subscript𝑓𝜔f_{\omega}(\cdot)italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( ⋅ ) is a model with parameter ω𝜔\omegaitalic_ω. Within this module, models like VQ-GAN [50] and dVAE can be utilized as tools to extract these signals, and different supervisory signals can lead to different model preferences.

Encoder: The Encoder fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is the target for MIM pre-training and can adopt various network architectures, such as Transformer, CNN, or a hybrid of both. The input for the Encoder can be a visible patch and a combination of visible and masked patches.

Head: The Head module’s purpose is to compare the supervisory signals with the encoded features. The primary task of MIM is to reconstruct the original image, so the most common head is the MIM head, which reconstructs the original image or its features. Additionally, there’s the Contrastive head, which employs contrastive learning to enhance the model’s performance.

Based on the unified perspective we proposed, the MIM problem can be mathematically represented as:

MIM=𝒮(𝒯1(fω(𝒙)),𝒯2(gγ(fθ(𝒙)))).subscriptMIM𝒮subscript𝒯1subscript𝑓𝜔𝒙subscript𝒯2subscript𝑔𝛾subscript𝑓𝜃direct-product𝒙\mathcal{L}_{\textrm{MIM}}=\mathcal{S}(\mathcal{T}_{1}(f_{\omega}(\boldsymbol{% x})),\mathcal{T}_{2}(g_{\gamma}(f_{\theta}(\boldsymbol{x}\odot\mathcal{M})))).caligraphic_L start_POSTSUBSCRIPT MIM end_POSTSUBSCRIPT = caligraphic_S ( caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( bold_italic_x ) ) , caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ⊙ caligraphic_M ) ) ) ) . (12)

Permuting and combining these four modules, we have meticulously categorized the research on MIM. The detailed classification is elaborated in Figure III.

Refer to caption
Figure 5: MAE proposed a basic framework for MIM pre-training, where the visible patches are encoded while the encoded features are decoded together with masked patches to reconstruct the pixel. The figure is reproduced from [78].

3.2 Basic Framework

iGPT [30]: The input image 𝐗𝐗\mathbf{X}bold_X, when arranged according to pixel values and subsequently downsampled, forms a pixel sequence 𝒙𝒙\boldsymbol{x}bold_italic_x that is fed into a Transformer structure identical to GPT-2 [91]. This model predicts the value of the next pixel 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the current pixel value 𝒙1:tsubscript𝒙:1𝑡\boldsymbol{x}_{1:t}bold_italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT. Given that iGPT predicts pixel values in sequence, its masking approach can be considered as “Basic Masking“, with the target being the Token. Based on GPT, the encoder of the iGPT is Transformer and the decoder is a Linear MIM Head. The loss of iGPT can be formulated as Eq. 1.

MAE [78]: The overview of MAE can be seen in Figure 5. The input image 𝐗H×W×C𝐗superscript𝐻𝑊𝐶\mathbf{X}\in\mathbb{R}^{H\times W\times C}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT is partitioned into multiple patches 𝒙N×(P2C)𝒙superscript𝑁superscript𝑃2𝐶\boldsymbol{x}\in\mathbb{R}^{N\times(P^{2}C)}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × ( italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C ) end_POSTSUPERSCRIPT, where approximately 75% of the patches are Randomly Masked. The remaining unmasked patches x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG are then fed into the Transformer Encoder fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) which generates the features. These features, in conjunction with the masked patches, are input into the Transformer Decoder gω()subscript𝑔𝜔g_{\omega}(\cdot)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( ⋅ ) for the purpose of reconstructing the Pixels of the original image. The quality of the reconstruction is measured using the MSE loss function. MAE [78] is formulated as:

1gω(fθ(x~)))𝒙~2.\displaystyle\frac{1}{\|\mathcal{M}\|}\|g_{\omega}(f_{\theta}(\tilde{x})))-% \tilde{\boldsymbol{x}}\|^{2}.divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_M ∥ end_ARG ∥ italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG ) ) ) - over~ start_ARG bold_italic_x end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (13)

iGPT [30] and MAE [78] represent two distinct basic frameworks within MIM research: iGPT [30] is based on the Auto-Regressive MIM research paradigm, while MAE is grounded in the Auto-Encoder paradigm. Both are categorized within the classification we proposed. The four modules of iGPT [30] can be classified as: Basic Masking(Auto-Regressive Masking) + Transformer + Tokenizer + MIM Head, whereas MAE can be categorized as Basic Masking (Random) + Transformer + Pixel + MIM Head. Table II summarizes the difference between iGPT and MAE.

Model MAE iGPT
Mask Basic (Random) Basic (AR Mask)
Encoder Transformer Transformer
Target Pixel Token
Head MIM Head (Transformer) MIM Head (Linear)
Category BTPM BTTM
Type AE AR
TABLE II: This table outlines four parts of iGPT and MAE, where iGPT and MAE represent two different types of research. iGPT is a generative model based on auto-regressive methods, while MAE is based on auto-encoders.
[Uncaptioned image]
 B  T  P  M  B  T  F  M  B  T  T  M  B  T  T  B  B  T  F  B  B  T  P  C  A  T  F  B  A  T  P  M  A  T  F  M  B  C  P  M  B  T  P  B  B  T  F  C  A  T  P  C  A  T  F  C
MAE[78] SimMIM[241] RePre[216] DMAE[231] RCMAE[118] RMAE[164] Hiera[190] BootMAE[47] SdAE[33] TTT-MAE[61] MaskVLM[112] MAE-lite[218]
CAE[32]
SIM[202]
dBOT[143]
MaskDistill[172]
CAE.V2[269]
FastMIM[73]
Data2Vec[6]
MFM[239]
MP3[18]
MaskFeat[224]
MultiMAE[4]
iGPT[30]
iBOT[279]
BEiT[12]
BEiT.V2[171]
BEiT.V3[220]
MaPeT[13]
RandSAC[92]
MaskGIT[20]
CIM[275]
mcBEiT[127]
MVP[225]
PeCo[46]
MAGE[125]
MaskCLIP[48]
Ge2AE[137]
ConMIM[253]
LayerGrafted[100]
SDMAE[102]
MST[130]
ADIOS[194]
UnMAE[128]
SemMAE[121]
LoMaR[26]
i-MAE[263]
ccMIM[268]
AutoMAE[25]
HPM[212]
I-JEPA[2]
MixMIM[139]
ObjMAE[229]
AttMask[104]
MILAN[87]
DMJD[154]
MaskAlign
data2vec2.0[5]
ConvNeXt.V2[228]
SparK[203]
ConvMAE[63]
CAN[161]
MSN[1]
ExtreMA[232]
MimCo[59]
FLIP[129]
MOMA[252]
D-iGPT[181]
CMAE[99]
ACLIP[251]
TABLE III: We conducted a comprehensive survey of research related to MIM and categorized them according to the four modules we proposed. We divided the Mask strategy into Basic Mask and Advanced Mask, the Encoder Architecture into CNN and Transformer, the learning Target into Pixel, Tokenizer, and Feature, and the Head into MIM Head, Contrastive Head, and their combination. We use the initials of each module to form a category name; for example, MAE is categorized as BTPM because it uses a Transformer as the encoder structure, a Random Mask as the masking strategy, a Pixel as the target, and MIM Head for reconstruction. Note that we only list the widely known methods for BTPM, BTFM, BTTM, and ATPM because they cover most of the existing MIM algorithms. Refer to Table IV for detailed information and categories.

4 Method

In this section, we will sequentially introduce the four essential modules for the MIM Framework, i.e., Mask Strategy, Targets, Architecture of the encoder, and MIM Head. Within each module, there are many studies; we will provide a more detailed classification and summary. After presenting the MIM Framework, we will discuss some research on MIM theory and several fundamental directions where MIM is applied, such as multimodality and large-scale models.

4.1 Masking Strategy

This subsection will also spotlight typical masking strategies employed in MIM. For classification purposes, we bifurcate masking strategies into basic and advanced masking. Basic masking, which encompasses pixel-wise predictions based on AR models and the Random Mask introduced by MAE, has been elaborated upon in Sec. 3. Consequently, our ensuing discussion will primarily focus on Advanced Masking techniques. As illustrated in the accompanying figure, Advanced Masking can be further subdivided into four types: Hard Sampling, Mixture, Adversarial Mask, and Contextual Mask.

Remark: Despite the excellent performance, Mixture mask and Adversarial Mask have a more expensive computational cost. It can be concluded that an attention-based mask strategy usually performs better in mining hard samples and costs less.

4.1.1 Hard Sampling

AttMask [104]: In the AttMask [104] framework, a teacher model fθsubscript𝑓superscript𝜃f_{\theta^{\prime}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is employed to extract the attention maps a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG and image features fθ(𝒙)subscript𝑓𝜃𝒙f_{\theta}(\boldsymbol{x})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) from the input images 𝐗𝐗\mathbf{X}bold_X and patches 𝒙𝒙\boldsymbol{x}bold_italic_x. The student model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT then masks the regions with high attention scores in the attention maps. Subsequently, using the visible regions of the image 𝒙~~𝒙\tilde{\boldsymbol{x}}over~ start_ARG bold_italic_x end_ARG, the student model predicts the masked regions, and the teacher model’s output fθ(𝒙)subscript𝑓superscript𝜃𝒙f_{\theta^{\prime}}(\boldsymbol{x})italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ) serves as a supervisory signal for learning. The reconstruct loss of AttMask can be formulated as:

MIM=vi𝒩𝕀{i=0}fθ(𝒙v)ilogfθ(𝒙v)i.subscriptMIMsubscript𝑣subscript𝑖𝒩subscript𝕀subscript𝑖0subscript𝑓𝜃subscriptdirect-productsuperscript𝒙𝑣𝑖subscript𝑓superscript𝜃subscriptdirect-productsuperscript𝒙𝑣𝑖\displaystyle\mathcal{L}_{\textrm{MIM}}=\sum_{v}\sum_{i\in\mathcal{N}}\mathbb{% I}_{\{\mathcal{M}_{i}=0\}}f_{\theta}(\boldsymbol{x}^{v}\odot\mathcal{M})_{i}% \log f_{\theta^{\prime}}(\boldsymbol{x}^{v}\odot\mathcal{M})_{i}.caligraphic_L start_POSTSUBSCRIPT MIM end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT { caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 } end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ⊙ caligraphic_M ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ⊙ caligraphic_M ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (14)

Employing attentive masking, AttMask not only delivers excellent results but also has relatively lower computational overhead. In our classification, AttMask is categorized as Advanced Mask + Transformer + Features + MIM Head (ATFM).

HPM [212] introduces a teacher-student framework. The teacher model fθsubscript𝑓superscript𝜃f_{\theta^{\prime}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT predicts the reconstruction loss for each patch xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, while the student model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT masks and reconstructs the image 𝒙𝒙\boldsymbol{x}bold_italic_x using an ”easy to hard” approach guided by the teacher model. The student model learns the mask set \mathcal{M}caligraphic_M with higher reconstruction loss in this process. In our classification, HPM [212] is categorized as ATFM. The object of HPM concludes a reconstruction loss and a prediction loss, and reconstruction loss is formulated as:

rec=1fθ(𝒙)fθ(𝒙)2.subscriptrec1normsubscriptsuperscriptnormsubscript𝑓𝜃direct-product𝒙subscript𝑓superscript𝜃𝒙2\displaystyle\mathcal{L}_{\textrm{rec}}=\frac{1}{\|\mathcal{M}\|}\sum_{% \mathcal{M}}\|f_{\theta}(\boldsymbol{x}\odot{\mathcal{M}})-f_{\theta^{\prime}}% (\boldsymbol{x})\|^{2}.caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_M ∥ end_ARG ∑ start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ⊙ caligraphic_M ) - italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (15)

l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance and cross-entropy can also be used to measure the distance, and the predictor loss can be formulated as:

pred=(gω(fθ(𝒙))rec)2(1),subscriptpreddirect-productsuperscriptsubscript𝑔𝜔subscript𝑓𝜃direct-product𝒙subscriptrec21\mathcal{L}_{\textrm{pred}}=\left(g_{\omega}(f_{\theta}(\boldsymbol{x}\odot% \mathcal{M}))-\mathcal{L}_{\textrm{rec}}\right)^{2}\odot(1-\mathcal{M}),caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT = ( italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ⊙ caligraphic_M ) ) - caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⊙ ( 1 - caligraphic_M ) , (16)

Meanwhile, SemMAE [121] (Advanced Mask + Transformer + Pixel + MIM Head,ATPM) implements a semantic-based masking strategy through semantic information learned by ViT, MILAN [87] (ATFM) combines attention mask with an online feature as the target. ObjMAE [229] (ATFM) proposes an object-wise mask strategy that discards non-objective patches.

4.1.2 Mixture

MixedAE [28]: Building upon the framework proposed by MAE, MixedAE introduces a technique of blending portions from different images as input to the network. MixedAE enhances the model’s representational capacity by incorporating contrastive learning. And MixedAE is categorized as ATPM. The loss function for this contrastive learning can be formulated as:

con=logexp(cos(𝒙iva,𝒙jvb)/τ)k=12N𝕀{ki}exp(cos(𝒙iva,𝒙kvb/τ)).subscriptconsubscriptsuperscript𝒙subscript𝑣𝑎𝑖subscriptsuperscript𝒙subscript𝑣𝑏𝑗𝜏superscriptsubscript𝑘12𝑁subscript𝕀𝑘𝑖subscriptsuperscript𝒙subscript𝑣𝑎𝑖subscriptsuperscript𝒙subscript𝑣𝑏𝑘𝜏\mathcal{L}_{\textrm{{con}}}=-\log\frac{\exp(\cos(\boldsymbol{x}^{v_{a}}_{i},% \boldsymbol{x}^{v_{b}}_{j})/\tau)}{\sum_{k=1}^{2N}\mathbb{I}_{\{k\neq i\}}\exp% (\cos(\boldsymbol{x}^{v_{a}}_{i},\boldsymbol{x}^{v_{b}}_{k}/\tau))}.caligraphic_L start_POSTSUBSCRIPT con end_POSTSUBSCRIPT = - roman_log divide start_ARG roman_exp ( roman_cos ( bold_italic_x start_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT { italic_k ≠ italic_i } end_POSTSUBSCRIPT roman_exp ( roman_cos ( bold_italic_x start_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_τ ) ) end_ARG . (17)

MixMIM [139] (ATPM)utilizes both mixed masking and attention mask as masking methods and improves the network architecture to a Hierarchical transformer. i-MAE [263] (ATPM) designs a mixed masking strategy for its input and simultaneously introduces a linear layer to separate the mixed input before reconstruction to improve the performance.

4.1.3 Adversarial

ADIOS [194] (ATPM) combines MIM with adversarial learning. Generator 𝒢𝒢\mathcal{G}caligraphic_G produces images with different masks based on the original image, while Discriminator 𝒟𝒟\mathcal{D}caligraphic_D aligns the generated images with the original ones. The former aims to maximize the distance between the original image and the masked image, while the latter aims to minimize the distance between the restored image and the original image. Since ADIOS does not rely on the block construction of the Transformer, it can be implemented in the backbone of CNNs. AutoMAE [25] (ATPM), on the other hand, introduces a Mask Generator based on the MAE architecture to generate different mask strategies. The former’s learning objective is to generate patches that are as difficult to reconstruct as possible, increasing the difficulty of the reconstruction task. In contrast, the encoder adaptively reconstructs the original image based on different mask methods.

4.1.4 Contextual Masking

UnMAE [128] (ATPM) proposes a Uniform Masking strategy for masking, with the selection of the masked portion consisting of two parts: Uniform Sampling and Secondary Masking. The former randomly samples a patch from a 2x2 grid, while the latter randomly masks a portion of the already sampled area. Additionally, UnMAE supports a pyramid-structured Transformer architecture. This context-based masking approach can better extract local information. LoMaR [26] (ATPM), on the other hand, builds upon MAE by using small-window patches for local reconstruction prediction, improving efficiency and accuracy compared to MAE.

Refer to caption
Figure 6: The types of the MIM target include three categories. For the Tokenizer, we divided it into Online and Offline sections. In the Feature section, we categorized Features into Low-level features, High-level features, and Fourier features. The Low-level features include Position and HOG features, while the High-level features comprise Online teacher, Offline teacher, and Feature distillation.

4.2 Different Targets

In this subsection, we will delve into the targets used during MIM training. For classification purposes, we categorize these targets into three main types: tokenizer, pixel, and features. Delving deeper, these categories can be further detailed, with comprehensive explanations provided in the accompanying Figure 6.

4.2.1 Raw Pixel

Raw Pixel is the most fundamental Target in MIM. Classic models like MAE and SimMIM [241] (BTPM) are based on Raw Pixel for image reconstruction. I-JEPA [2] (ATPM) also reconstructs through pixels. What sets it apart is that I-JEPA uses a Context Patch as the input for the Encoder, and the reconstruction target is the three different patches adjacent to the Context Patch. By reconstructing through the Context Patch, I-JEPA can achieve better contextual representation capabilities while also reducing computational overhead.

4.2.2 Tokenizer

The structure of the Tokenizer extracts low-level semantic features (pixels) into higher-level semantic features (visual tokens), transferring the way of natural language processing to computer vision and improving the model’s representational power.

A tokenizer is a mapping function 𝒒ϕ(𝒛|𝒙)subscript𝒒italic-ϕconditional𝒛𝒙\boldsymbol{q}_{\phi}(\boldsymbol{z}|\boldsymbol{x})bold_italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_x ) that encodes image 𝐗H×W×C𝐗superscript𝐻𝑊𝐶\mathbf{X}\in\mathbb{R}^{H\times W\times C}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT into z=z1,,z|𝒱|𝒱h×wformulae-sequence𝑧subscript𝑧1subscript𝑧𝒱superscript𝒱𝑤z=z_{1},\dots,z_{|\mathcal{V}|}\in\mathcal{V}^{h\times w}italic_z = italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT | caligraphic_V | end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT, where the vocabulary 𝒱={1,,|𝒱|}𝒱1𝒱\mathcal{V}=\{1,\dots,|\mathcal{V}|\}caligraphic_V = { 1 , … , | caligraphic_V | } contains token indices. These latent variables represent high-level semantic features of certain parts of the image. Hence, we can represent an image based on the dictionary 𝒱𝒱\mathcal{V}caligraphic_V, which can be used as the supervisory signal for MIM. The tokenizer 𝒒ϕ(𝒛|𝒙)subscript𝒒italic-ϕconditional𝒛𝒙\boldsymbol{q}_{\phi}(\boldsymbol{z}|\boldsymbol{x})bold_italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_x ) maps image pixels x𝑥xitalic_x into discrete tokens z𝑧zitalic_z according to a visual codebook [207] (i.e., vocabulary), and decoder 𝒑ψ(𝒙|𝒛)subscript𝒑𝜓conditional𝒙𝒛\boldsymbol{p}_{\psi}(\boldsymbol{x}|\boldsymbol{z})bold_italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) learns to reconstruct the image based on visual tokens 𝒛𝒛\boldsymbol{z}bold_italic_z [12]. The learning objective of the tokenizer can be formulated as:

min𝔼𝒛𝒒ϕ(𝒛|𝒙)(log𝒑ψ(𝒙|𝒛))subscript𝔼similar-to𝒛subscript𝒒italic-ϕconditional𝒛𝒙subscript𝒑𝜓conditional𝒙𝒛\displaystyle\min\mathbb{E}_{\boldsymbol{z\sim}\boldsymbol{q}_{\phi}(% \boldsymbol{z}|\boldsymbol{x})}(\log\boldsymbol{p}_{\psi}(\boldsymbol{x}|% \boldsymbol{z}))roman_min blackboard_E start_POSTSUBSCRIPT bold_italic_z bold_∼ bold_italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_x ) end_POSTSUBSCRIPT ( roman_log bold_italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) ) (18)

The training methods of tokenizer conclude dVAE and VQ-GAN [50].

BEiT [12] (Basic Mask + Transformer +Tokenizer + MIM Head, BTTM): BEiT introduces the concept of tokens in the visual domain. The training of BEiT is divided into two stages. The first stage involves training the tokenizer using dVAE [12], and the second stage involves training the model’s encoder using the tokenizer. In first stage, BEiT discretely encodes image 𝐗H×W×C𝐗superscript𝐻𝑊𝐶\mathbf{X}\in\mathbb{R}^{H\times W\times C}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT into z=z1,,zN𝒱h×wformulae-sequence𝑧subscript𝑧1subscript𝑧𝑁superscript𝒱𝑤z=z_{1},\dots,z_{N}\in\mathcal{V}^{h\times w}italic_z = italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT, where the vocabulary 𝒱={1,,|𝒱|}𝒱1𝒱\mathcal{V}=\{1,\dots,|\mathcal{V}|\}caligraphic_V = { 1 , … , | caligraphic_V | } contains discrete token indices. After the tokenizer is pre-trained, The encoder f𝑓fitalic_f encodes the unmasked regions of an image, and encoded features are then passed through the MIM Head, with discrete image tokens serving as the supervision signal for learning. The Learning object of BEiT [12] can be formulated as:

maxdataset𝔼[i𝒩𝕀{i=0}logpMIM(zi|𝒙)],subscriptdatasetsubscript𝔼delimited-[]subscript𝑖𝒩subscript𝕀subscript𝑖0subscript𝑝MIMconditionalsubscript𝑧𝑖direct-product𝒙\displaystyle\max\sum_{\textrm{dataset}}\mathbb{E}_{\mathcal{M}}\left[\sum_{i% \in\mathcal{N}}\mathbb{I}_{\{\mathcal{M}_{i}=0\}}\log p_{\text{MIM}}(z_{i}|% \boldsymbol{x}\odot\mathcal{M})\right],roman_max ∑ start_POSTSUBSCRIPT dataset end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT { caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 } end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT MIM end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_x ⊙ caligraphic_M ) ] , (19)

where 𝒟𝒟\mathcal{D}caligraphic_D denotes the traning corpus.

iBOT [279] (BTTM): Authors formulate MIM as a knowledge-distillation task and perform self-distillation using a teacher-student framework. The online tokenizer, jointly optimized with MIM, progressively captures high-level visual semantics and eliminates the need for a separate pre-training stage. The teacher model is updated by the student model with EMA:

𝜽m𝜽+(1m)𝜽.superscript𝜽𝑚superscript𝜽1𝑚𝜽\displaystyle\boldsymbol{\theta}^{\prime}\leftarrow m\boldsymbol{\theta}^{% \prime}+(1-m)\boldsymbol{\theta}.bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_m bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + ( 1 - italic_m ) bold_italic_θ . (20)

Building on the framework of BEiT,BEiTv2 [171] (BTTM) employs distillation on VQ to transform the discrete semantic space into compact codes. Building further upon BEiTv2, BEiTv3 [220] (BTTM) integrates MOE and multimodality to design specialized tokenizers for vision, language and vision-language tasks and scales up the model. Peco [46] (BTTM)utilizes a perceptual prediction target to train a perceptual codebook. mc-BEiT [127] (BTTM) represents a masked patch with a soft probability of vector instead of a unique token id. CIM [53] (BTTM) proposed an encoder-enhancer architecture in which a small pre-trained BEiT is used as an encoder, and a CNN-based model can be applied to the enhancer. Pixel reconstruction and GAN loss are used in CIM, respectively.

4.2.3 Low-Level Features

Some models use low-level image features as training targets, which helps to enhance the model’s learning of detailed information in images. These low-level image features typically come in three types: HOG Features, positional information of the image, and Fourier Features.

HOG Features. MaskFeat [224](Basic Mask + Transformer + Feature + MIM Head, BTFM) proposes a framework based on MAE. Notably, the supervision signal for training the model is derived from the HOG features of the original image. FastMIM [73] (BTFM) designs a Hierarchical transformer and utilizes HOG features as the target.

Position. DILEMMA [191](BTFM) employs a teacher model to generate position encoding. The student model is trained to predict new positions and judge whether the prediction is true or not. MP3 [261] (BTFM) trains a masked transformer to predict the position of patches using MAE as a loss function. SDMAE [243] (ATFM) combines position prediction loss, pixel loss, and global contrastive loss to train its backbone. DropPos [211] (BTFM) randomly selects a subset of patches and replaces their positional encodings with mask tokens. The positional encodings are then reconstructed.

Fourier Features. Models combined with Fourier Features can generally be divided into two main categories. Calculating Loss In Fourier domain: Ge2AE [137] (Basic Mask + Transformer + Feature + Both Head, BTFB) reconstructs in the Fourier domain while computing both contrastive loss and reconstruction loss. A2MIM [124] (BCFM) utilizes The intermediate layer features of the CNN-based and ViT- based encoder to reconstruct ground truth in the spatiotemporal domain and frequency domain. The discrete Fourier transform of each channel is defined as:

(u,v)=H,Wx(h,w)e2πj(uhH+vwW).subscript𝑢𝑣subscript𝐻𝑊𝑥𝑤superscript𝑒2𝜋𝑗𝑢𝐻𝑣𝑤𝑊\displaystyle\mathcal{F}_{(u,v)}=\sum_{H,W}x(h,w)e^{-2\pi j(\frac{uh}{H}+\frac% {vw}{W})}.caligraphic_F start_POSTSUBSCRIPT ( italic_u , italic_v ) end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_H , italic_W end_POSTSUBSCRIPT italic_x ( italic_h , italic_w ) italic_e start_POSTSUPERSCRIPT - 2 italic_π italic_j ( divide start_ARG italic_u italic_h end_ARG start_ARG italic_H end_ARG + divide start_ARG italic_v italic_w end_ARG start_ARG italic_W end_ARG ) end_POSTSUPERSCRIPT . (21)

The learning objective in the frequency domain can be formulated as follows:

freq=subscript𝑓𝑟𝑒𝑞absent\displaystyle\mathcal{L}_{freq}=caligraphic_L start_POSTSUBSCRIPT italic_f italic_r italic_e italic_q end_POSTSUBSCRIPT = C,H,Wω(x+\displaystyle\sum_{C,H,W}\omega\big{\lVert}\mathcal{F}(x\odot\mathcal{M}+∑ start_POSTSUBSCRIPT italic_C , italic_H , italic_W end_POSTSUBSCRIPT italic_ω ∥ caligraphic_F ( italic_x ⊙ caligraphic_M + (22)
de(x)(1))(x),\displaystyle\mathrm{de}(x)\odot(1-\mathcal{M}))-\mathcal{F}(x)\big{\lVert},roman_de ( italic_x ) ⊙ ( 1 - caligraphic_M ) ) - caligraphic_F ( italic_x ) ∥ ,

where ω=ω(u,v)𝜔𝜔𝑢𝑣\omega=\omega(u,v)italic_ω = italic_ω ( italic_u , italic_v ) is a dynamic frequency weighting matrix. Masking In Fourier Domain: MFM [239] (BTFM) masks in the frequency domain, adds noise, and then reconstructs the image. MSCN [101](BTFM), after masking in the frequency domain, integrates with contrastive learning and employs a contrastive loss. PixMIM [145] (BTFM) both reconstruct the image in both spatial and frequency domain.

4.2.4 High-Level Features

Some models take high-level features extracted from images as training targets. Research in this area is often associated with knowledge distillation or Teacher models, using the Teacher Model or distilled image features as training targets. This type of research can typically be categorized into off-line Teacher, on-line Teacher, and those combined with knowledge distillation.

Offline Teacher. MILAN [87] (ATFM) utilizes CLIP [176] to generate attention maps to guide the model to mask and generate features as the target. MOMA [252] (Basic Mask + Transformer + Feature + Contrastive Head, BTFC)builds upon the MAE and uses pre-trained Multiple Teacher features as the prediction target. Img2vec [168] (BTFM) uses a pre-trained ConvNet as the teacher model to extract features. Based on the MAE framework, it reconstructs patches and combines contrastive learning to compute the global loss. TinyMIM [182] (BTFM) discovered that using the intermediate layer features of the teacher model often yields better results, with a smaller gap to downstream tasks. As a result, TinyMIM utilizes features of each layer to be targeted.

Online Teacher. data2vec [6] (BTFM) utilizes contextualized representations of the online teacher model and combines several modalities, including speech, natural language process and computer vision. data2vec updates its parameter with the EMA:

𝜽τ𝜽+(1τ)𝜽superscript𝜽𝜏superscript𝜽1𝜏𝜽\displaystyle\boldsymbol{\theta}^{\prime}\leftarrow\tau\boldsymbol{\theta}^{% \prime}+(1-\tau)\boldsymbol{\theta}bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_τ bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + ( 1 - italic_τ ) bold_italic_θ (23)

data2vec.v2 [5] (ATFM), building on the foundation of data2vec, introduces a multi-mask training method to enhance efficiency and reduce computational costs. dBOT [143] (BTFM), based on iBOT, has designed a multi-stage distillation scheme, concluding that teacher models with different parameters tend to have consistent performance in student models after multi-stage distillation. BootMAE [47] (BTPM), while using online features as prediction targets, also adds the task of reconstructing image pixels. Unlike directly calculating the loss between features, RC-MAE [118] (BTPM) inputs the masked image into two transformer encoders with EMA-updated parameters. It then computes the contrastive loss of the reconstructed image, supplemented by a task of pixel-level image reconstruction. MaskDistill [172] (BTFM) MaskCLIP [48] (Basic Mask + Transformer + Feature + Both Head, BTFB) integrates multiple techniques, including MIM, multi-modality, online features, and contrastive learning.

Feature Distillation. DMJD [154] (ATFM) introduced a disjoint mask and simultaneously trained the encoder using features distillation and prediction reconstruction methods. CAE.v2 [269](BTFM) distills CLIP and is supplemented with a task to reconstruct CLIP features. SdAE [33] (BTPM) delves into creating effective views for the teacher branch and proposes a multi-fold masking strategy to reduce computational complexity.

4.3 Different Network Architecture

Compared to the traditional Transformer, Hierarchical Transformers often have lower computational overhead, faster training speeds, and better generalization capabilities on downstream tasks. As a result, some research focuses on enhancing the encoder structure to improve the operational efficiency of MIM or devising novel masking techniques to make MIM adaptable to various network architectures.

Transfer encoder to Hierarchical vision transformer: GreenMIM [94] (BTPM) inputs the masked image 𝐗direct-product𝐗\mathbf{X}\odot\mathcal{M}bold_X ⊙ caligraphic_M into a Hierarchical Transformer encoder. At each stage, Group Window Attention is used to perform self-attention computation on the masked image within groups. To reduce unnecessary computations in areas that are masked or do not contain useful information, the sparse convolution is introduced to discard invisible patches and only processes on the visible patches, achieving patch merging, similar to Figure 7. The processed data is then input into the Transformer decoder to reconstruct the original image. In this way, GreenMIM enhances computational efficiency as MAE [78] while maintaining the performance as SimMIM [241]. HiViT [271] (BTPM) removes local inter-unit operations, resulting in structurally simple hierarchical vision transformers. Hiera [190] (BTPM) eliminates the need for many of the complex components found in other hierarchical vision transformers and achieves superior accuracy. ConvMAE [63] (Basic Mask + CNN + Pixel + MIM Head, BCPM) proposes a multi-scale hybrid convolution-transformer, employs a masked convolution to prevent information leakage in the convolution blocks and a block-wise mask to reduce the computational cost. SparseMAE [276] (BCPM) introduces sparse MHSA and FFN blocks for sparse pre-training.

Make MIM Compatible with Convolutional Neural Networks: CIM [53] (Basic Mask + CNN + Tokenizer + MIM Head, BCTM) employs an auxiliary generator equipped with a compact trainable BEiT to corrupt the input images, thereby enhancing the network’s capability to either restore the original image pixels or predict whether each visual token has been replaced by a sample from the generator. Due to CIM’s approach of using an auxiliary generator to corrupt the input, there’s no need for specific input formats or preprocessing, which is compatible with CNNs. Since the objective of the enhancing network is to either restore the original image or predict if each visual token has been replaced by a generator sample, only forward propagation is required, ensuring compatibility with CNN architectures. A2MIM [124] (BCFM) introduces a unified architecture compatible with both Transformers and CNNs. Specifically, A2MIM posits that masking at the block embedding layer aligns well with the attention mechanism of Transformers, offering robustness against occlusion. For CNNs, masking at the network’s input stages leads to low-order interactions, undermining CNN’s context extraction capability. Therefore, the authors suggest masking intermediate features encompassing semantic and spatial information, allowing the mask token to encode interactions with a moderate number of tokens.

Refer to caption
Figure 7: Illustration of MIM for CNN architectures with the sparse convolutions and masking [228, 203], where the encoder only aggregates information of visible tokens. The figure is reproduced from [228].

Specially designed for Convolutional Neural Network: Spark [203] (BCPM) identified that the inability of convolutional operations to recognize irregularly randomly masked input images and the single-scale nature of BERT pre-training is fundamentally inconsistent with the hierarchical structure of convolutional networks, which is the primary reason MIM cannot be implemented on CNNs. To address this, Spark proposed treating unmasked pixels as 3D point clouds and using sparse convolution for encoding, allowing the model to operate on irregularly masked images, as in Figure 7. To integrate the hierarchical structure of convolutional networks, they introduced a hierarchical decoder to reconstruct images from multi-scale features. The authors validated this approach on traditional convolutional neural network models such as ResNet and ConvNeXt, and its performance showed significant improvements compared to contrastive learning and Transformer-based MIM. ConvNext.v2 [228] (BCPM) introduces a fully convolutional masked auto-encoder. Its core structure is based on ConvNext, where the convolution operation is transformed into sparse convolution. The decoder employs a lightweight ConvNext block, which simultaneously processes encoded pixels and masked tokens for image reconstruction, effectively migrating MIM to the CNN structure. Additionally, ConvNext.v2 proposes a Global Response Normalization layer that normalizes the feature map on each channel, capable of handling batches of any size. The framework of ConvNext.v2 is shown in Figure 7.

Refer to caption
Figure 8: The types of MIM Head include Linear or MLP, Transformer, or a combination of CNN and Transformer. The Contrastive Head section is categorized based on the algorithm type into Token-level and Global-level.

4.4 Head

The choice of the head in MIM exhibits significant variations. In our classification, we distinguish the heads into three categories: Contrastive Head, MIM Head, and a combination of Both Contrastive Head and MIM Head. It’s essential to highlight that both the MIM Head and Contrastive Head can have diverse internal architectures. The specifics of these structures are visually represented in the provided figure. In the following sections, we will bifurcate our discussion into two primary segments, focusing separately on the MIM Head and the Contrastive Head.

4.4.1 MIM Decoder

Linear or MLP: SimMIM [241] (BTPM) essentially adopts the framework of MAE but with several significant modifications. In SimMIM, the encoder processes both the visible patches and the masked tokens simultaneously. Remarkably, SimMIM’s decoder achieves satisfactory results using just a Linear Prediction Head. A detailed comparison between SimMIM and MAE can be found in the provided table. Other models utilize linear layers such as BEiT [12], BEiT.v2 [171], data2vec [6] and so on.

Transformer Decoder: The Transformer Decoder is the most widely used in MIM. More details are represented in Figure 8.

Combined Transformer with CNN: LocalMAE [213] (BTFM) employs intermediate features from multiple stages for multi-scale reconstruction. In the reconstruction segment, LocalMAE introduces a Transformer-Deconvolution-MLP architecture for the task.

Remark: One might wonder why certain models can achieve commendable reconstruction results with just a simple, lightweight Linear Head, while others necessitate a more intricate Transformer decoder for reconstruction. The crux of the matter lies in whether the input to the Encoder includes the masked tokens. If the patches inputted to the Encoder encompass the masked tokens, these tokens can interact with the visible patches within the encoder. This interaction allows the encoder to capture certain image information early on, making it feasible to reconstruct the original image effectively with just a Linear Head. Conversely, if the encoder doesn’t receive the masked tokens, these tokens would then need to interact with the visible patches within a more complex Transformer decoder to reconstruct the original image. Figure 10 compares SimMIM and MAE in detail.

Figure 9: The most significant difference between SimMIM and MAE lies in whether the input to the encoder includes the masked tokens and the structure of the MIM Head. An in-depth explanation of this aspect can be found in the designated Sec. 4.4.1.
Model MAE SimMIM Mask Random Random Encoder Transformer Transformer Target Raw Pixel Raw Pixel Input Visible Visible and Masked Head Transformer Linear Method Auto-Encoder Auto-Encoder
Refer to caption
Figure 9: The most significant difference between SimMIM and MAE lies in whether the input to the encoder includes the masked tokens and the structure of the MIM Head. An in-depth explanation of this aspect can be found in the designated Sec. 4.4.1.
Figure 10: Two categories of MIM methods combined with contrastive learning: token-level and global-level contrasting heads. For the token-level head, encoded tokens are subjected to an MLP Projection to compute contrastive learning loss. The global-level head aggregates global information on MIM targets and tokens before calculating the contrastive loss.

4.4.2 Combined with Contrastive Head

There are typically two approaches combining contrastive learning and masked language modeling: The first incorporates masked images as a data augmentation technique and applies them within the contrastive learning framework to benefit contrastive learning. The second utilizes the standard masked language modeling framework and adds contrastive learning objectives in the prediction head to benefit masked language modeling. In this section, we will detail both lines of work and elaborate on the network architecture for the contrastive prediction head.

Mask as Data Augmentation: MSN [1] (BTFC) utilizes masked images as a data augmentation technique and incorporates them into the framework of PCL [122]. MSCN [101] (BTFM) and Mimco [59] (BTFC) incorporate masked images as data augmentation into the frameworks of SimCLR and BYOL contrastive learning, respectively, to benefit contrastive representation learning. This achieves an integration of masked modeling and contrastive learning.

Add Contrastive Loss: This line of work builds upon masked modeling and incorporates a contrastive prediction head by adding or replacing the original MIM head. It can be categorized into two groups: token-level contrastive learning and global-level contrastive learning. Details are illustrated in Figure 10. Token Level Contrastive: ConMIM [253] (Basic Mask + Transformer + Pixel + Contrastive Head, BTPC) utilizes two transformer encoders, one for masked images and another for unmasked images. The branch that takes the masked images as input predicts the original images. The features obtained from the prediction are contrasted with those from the unmasked images through contrastive learning. The contrastive loss can be written as:

 con(x)=logexp(f(𝒙i),𝒙j/τ)k=12N𝕀{ki}exp(f(𝒙i),𝒙k/τ),subscript con𝑥𝑓subscript𝒙𝑖subscript𝒙𝑗𝜏superscriptsubscript𝑘12𝑁subscript𝕀𝑘𝑖𝑓subscript𝒙𝑖subscript𝒙𝑘𝜏\displaystyle\mathcal{L}_{\textrm{ con}}(x)=-\log\frac{\exp{(\langle f(% \boldsymbol{x}_{i}),\boldsymbol{x}_{j}\rangle/\tau)}}{\sum_{k=1}^{2N}\mathbb{I% }_{\{k\neq i\}}\exp{(\langle f(\boldsymbol{x}_{i}),\boldsymbol{x}_{k}\rangle/% \tau)}},caligraphic_L start_POSTSUBSCRIPT con end_POSTSUBSCRIPT ( italic_x ) = - roman_log divide start_ARG roman_exp ( ⟨ italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT { italic_k ≠ italic_i } end_POSTSUBSCRIPT roman_exp ( ⟨ italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ / italic_τ ) end_ARG , (24)

Global Level Contrastive: ccMIM [268] (ATPM) employs an attention mechanism to rank each patch in the image x𝑥xitalic_x and selects the more challenging parts as masked set \mathcal{M}caligraphic_M for reconstruction by masking. Subsequently, global-level contrastive learning is performed on the CLS token. CAN [161] (Basic Mask +Transformer +Pixel +Both, BTPB) adds Gaussian noise to the masked images. Building upon MAE, it performs pooling before reconstructing the image and computes a global-level contrastive loss.

Architecture of Contrastive Head: The Contrastive Head is consistent with the head in classic contrastive learning architectures, often consisting of multiple MLP or FNNs. They typically have an appended BN layer, as seen in models like SimCLR [31] and BYOL [71]. A characteristic feature of these heads is that they often upscale the dimensions, having a larger number of channels. Some Contrastive Heads utilize the Transformer Decoder. For research that employs the Transformer Decoder as the Contrastive Head, considerations usually revolve around the depth and width of the Transformer blocks.

4.5 Theoretical Foundation

Supervised learning, often referred to as statistical learning methods, typically possesses profound mathematical theoretical guarantees, providing precise mathematical conditions under which learning is assuredly successful. Training and test datasets usually stem from the assumption of independent and identically distributed statistics. As the number of training iterations increases, one can often achieve lower training and test losses. This is because supervised learning is relatively straightforward. In contrast, unsupervised learning lacks the simple and intuitive theoretical guarantees present in supervised learning. Intuitively, we believe that the essence of unsupervised learning is a form of information compression. The compression algorithms learned from the training set represent the universal knowledge and structure inherent within the data. The way to evaluate these compression algorithms is to determine whether they extract all the knowledge from unlabeled data, i.e., whether they provide as much assistance as possible and yield the maximum benefit. We will elucidate and summarize the theoretical foundations of Masked modeling from three perspectives.

From Contrastive Learning: Layer Grafted [100](Random + Transformer + Pixel + Contrastive Head, RTPC) finds that MIM and CL are suitable for lower and higher layers, respectively. The authors design a gradient surgery experiment by computing the cosine similarity between gradients of two tasks following [257] and verify the MIM loss and CL loss have different targets to optimize. The cosine similarity can be defined as:

𝑪MIM,CL(x)=θLMIM(x)TθLMIM(x)θLCL(x)θLCL(x).subscript𝑪MIMCL𝑥subscript𝜃subscript𝐿MIMsuperscript𝑥𝑇normsubscript𝜃subscript𝐿MIM𝑥subscript𝜃subscript𝐿CL𝑥normsubscript𝜃subscript𝐿CL𝑥\displaystyle\boldsymbol{C}_{\textrm{MIM},\textrm{CL}}(x)=\frac{\nabla_{\theta% }L_{\textrm{MIM}}\left(x\right)^{T}}{\left\|\nabla_{\theta}L_{\textrm{MIM}}% \left(x\right)\right\|}\frac{\nabla_{\theta}L_{\textrm{CL}}\left(x\right)}{% \left\|\nabla_{\theta}L_{\textrm{CL}}\left(x\right)\right\|}.bold_italic_C start_POSTSUBSCRIPT MIM , CL end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT MIM end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT MIM end_POSTSUBSCRIPT ( italic_x ) ∥ end_ARG divide start_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT CL end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT CL end_POSTSUBSCRIPT ( italic_x ) ∥ end_ARG . (25)

They propose a ”sequential cascade” approach where early layers are first trained under one MIM loss, and then later layers continue to be trained under another CL loss. That is:

MIMCL.subscriptMIMsubscriptCL\mathcal{L}_{\textrm{MIM}}\rightarrow\mathcal{L}_{\textrm{CL}}.caligraphic_L start_POSTSUBSCRIPT MIM end_POSTSUBSCRIPT → caligraphic_L start_POSTSUBSCRIPT CL end_POSTSUBSCRIPT . (26)

[264] demonstrates that the mask loss exhibits a lower bound compared to the align loss in contrastive learning, making it more effective than aligning within contrastive learning.

MAE12alignϵ+const.subscriptMAE12subscriptalignitalic-ϵconst\displaystyle\mathcal{L}_{\textrm{MAE}}\geq\frac{1}{2}\mathcal{L}_{\textrm{% align}}-\epsilon+\textrm{const}.caligraphic_L start_POSTSUBSCRIPT MAE end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG caligraphic_L start_POSTSUBSCRIPT align end_POSTSUBSCRIPT - italic_ϵ + const . (27)

Subsequently, a uniform loss, akin to that in contrastive learning, is incorporated into the mask loss.

From Masking: [107] models MIM as a hierarchical latent variable model. The objective of MIM is to recover the latent variable z𝑧zitalic_z shared between visible patches and invisible patches based on the lower-level visible patches. This latent variable encapsulates the information shared between the visible patch and the invisible portions. Both a very low mask ratio and an extremely high mask ratio tend to make the model focus on recovering low-level latent variable information, making it challenging to learn higher-level semantic features. Therefore, the mask ratio in MAE, which lies between the two, can assist the model in capturing higher-level latent variable information, enhancing its representation capability.

From Empirical Study: Many studies have extensively explored certain characteristics of masked language modeling through numerous experiments and obtained some valuable conclusions. [240] and [111] verified through extensive experiments that, compared to other self-supervised methods like jigsaw puzzles and image inpainting, masked language models demonstrate better transferability and superior performance on tasks like pose estimation, depth prediction, video object tracking, and object detection. [242] showed that masked models tend to underperform and are prone to overfitting on small datasets. As the dataset grows larger, the performance improvement of masked language models accelerates. [108] suggested that the efficacy of masked language modeling stems largely from the masking operation itself as the key to good performance, while different masking strategies contribute limited improvements.

We summarize some conclusions:

  1. 1.

    From Contrastive Learning: Unlike contrastive learning, MIM tends to capture low-level features and exhibits a strong local bias, while contrastive learning leans towards high-level features. This naturally explains why the development of contrastive learning preceded MIM. Before the advent of ViT, the dominant architecture in the visual domain was CNN, which inherently has a strong local bias. This made it complementary to contrastive learning, enhancing each other’s strengths. However, both MIM and CNN share this pronounced local bias, leading to suboptimal performance of MIM on CNN architectures. With the rise of ViT, which emphasizes capturing global information, it pairs better with MIM, propelling MIM to the forefront of SSL algorithms.

  2. 2.

    From Masking: Masking is the most fundamental and crucial technique in MIM. Compared to NLP, the visual domain typically employs a higher mask ratio. This is because, in contrast to language, image information is more redundant. A small masking ratio doesn’t significantly impact the overall semantic understanding of an image. Therefore, a larger mask ratio is used to obscure some of the image’s key information, increasing the difficulty of the reconstruction task and enabling the model to learn more robust representations.

  3. 3.

    From Empirical Study: Models based on MIM exhibit certain characteristics and preferences. For instance, they rely more on large-scale data for training and tend to learn better representations with larger datasets. Masked modeling performs better on tasks that require more detailed visual information, such as video object tracking and pose estimation. These tasks demand the model’s ability to capture low-level information.

4.6 Auto-Regressive For Generation

The majority of MIM research is based on Auto-regressive (AR) for generative SSL; however, autoregressive modeling has always been one of the important methods in generative self-supervision. Consequently, a significant body of research combines autoregressive generation with MIM, achieving both representation learning and generative tasks. In this section, we will introduce classic autoregressive generative model architectures and then discuss research paradigms that integrate representation learning with autoregression. In Figure 11, we provide a detailed comparison of the differences between these two research paradigms.

Refer to caption
Figure 11: Research on autoregression (AR) for generation and pre-training can be summarized by this flowchart. Some studies focus on improving the quality and speed of image generation, while others combine pre-training with image generation, performing further operations in the latent space. The figure is reproduced based on [50, 255, 125].

4.6.1 VQ-Based Generation

Vector Quantization(VQ) is a significant technique in generative models, where it quantizes the continuous feature representations output by the encoder into discrete vectors in a codebook. It was first introduced by VQ-VAE. Subsequent autoregressive generative models have been developed based on the VQ-VAE framework.

VQ-VAE [167] introduces a Generative Framework that encompasses both generation and training processes. During training, VQ-VAE encodes image pixels into feature vectors, searching for the token in the codebook that is closest to the feature vector. The image is then reconstructed through the decoder. Therefore, the training loss includes the quantization loss of the vectors and the reconstruction loss:

VQ-VAE=xg(vq)2+sg[f(x)]vq2+βf(x)sg[vq]2.subscriptVQ-VAEsuperscriptdelimited-∥∥𝑥𝑔subscript𝑣𝑞2superscriptdelimited-∥∥𝑠𝑔delimited-[]𝑓𝑥subscript𝑣𝑞2𝛽superscriptdelimited-∥∥𝑓𝑥𝑠𝑔delimited-[]subscript𝑣𝑞2\displaystyle\begin{split}\mathcal{L}_{\textrm{VQ-VAE}}=&\|x-g(v_{q})\|^{2}+\|% sg[f(x)]-v_{q}\|^{2}\\ +&\beta\|f(x)-sg[v_{q}]\|^{2}.\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT VQ-VAE end_POSTSUBSCRIPT = end_CELL start_CELL ∥ italic_x - italic_g ( italic_v start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_s italic_g [ italic_f ( italic_x ) ] - italic_v start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL + end_CELL start_CELL italic_β ∥ italic_f ( italic_x ) - italic_s italic_g [ italic_v start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (28)

where β𝛽\betaitalic_β is a hyperparameter used to control the weights of the two losses. The generation process involves producing feature vectors through PixelCNN, followed by vector quantization of these feature vectors, and then generating new images via the decoder.

Subsequent research based on VQ-VAE has two main focuses: one is to improve the training process to enhance the quality of image generation, and the other is to improve the generation process to increase the speed of image generation.

Improve Generation Quality: VQ-GAN [50] is based on the VQ-VAE architecture, using GPT-2 as the generator in the workflow to produce discrete encodings. To enhance the reconstruction performance of the Decoder, an adversarial loss is added to the reconstruction loss. The learning object consists of the reconstruct loss and adversarial loss, which can be formulated as:

𝒬*=minf,g,𝒱max𝒟𝔼xp(x)[\displaystyle\mathcal{Q}^{*}=\min_{f,g,\mathcal{V}}\max_{\mathcal{D}}\mathbb{E% }_{x\sim p(x)}\Big{[}caligraphic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_f , italic_g , caligraphic_V end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p ( italic_x ) end_POSTSUBSCRIPT [ VQ(f,g,𝒱)subscriptVQ𝑓𝑔𝒱\displaystyle\mathcal{L}_{\textrm{VQ}}(f,g,\mathcal{V})caligraphic_L start_POSTSUBSCRIPT VQ end_POSTSUBSCRIPT ( italic_f , italic_g , caligraphic_V )
+\displaystyle++ λGAN({f,g,𝒱},𝒟)],\displaystyle\lambda\mathcal{L}_{\textrm{GAN}}(\{f,g,\mathcal{V}\},\mathcal{D}% )\Big{]},italic_λ caligraphic_L start_POSTSUBSCRIPT GAN end_POSTSUBSCRIPT ( { italic_f , italic_g , caligraphic_V } , caligraphic_D ) ] , (29)

The process of generation is based on the GPT-2, and the process can be formulated as:

maxθpθ(𝒗)=t=1Tlogpθ(𝒗t|𝒗1:t1).subscript𝜃subscript𝑝𝜃𝒗superscriptsubscript𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝒗𝑡subscript𝒗:1𝑡1\displaystyle\max\limits_{\theta}p_{\theta}(\boldsymbol{v})=\sum_{t=1}^{T}\log p% _{\theta}(\boldsymbol{v}_{t}|\boldsymbol{v}_{1:t-1}).roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_v ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_v start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) . (30)

Improve Generation Speed: Based on the VQ-VAE and VQ-GAN, MaskGIT [20] learns to predict randomly masked tokens by attending to tokens from all directions. In the inference stage, the model initially generates all tokens of the image simultaneously and subsequently refines the image iteratively based on prior generations. RandSAC [92] adopts a strategy of segmenting tokens into hierarchical sections. Within each section, it employs a parallel prediction mechanism akin to BERT [144], while between different sections, it utilizes a sequential prediction approach reminiscent of GPT [91]. By randomizing the sequencing of sections and leveraging parallel training, it significantly enhances computational efficiency.

4.6.2 Combining Pre-training with Image Generation

iGPT is one of the earliest models considered to perform both generation and pre-training. By predicting pixel values through the Transformer’s autoregressive approach, iGPT achieves image generation capabilities. The unsupervised learning on large-scale unlabeled data makes iGPT a pre-trained model, which can achieve good results on downstream tasks through fine-tuning. MAGE [125] first maps images to tokens in a discrete latent space using VQ-GAN, then performs masked image modeling by masking tokens in the latent space. The training objective is to reconstruct unmasked tokens. In this way, MAGE is able to learn representations via masked image modeling in the latent space while achieving image generation. A contrastive loss is used in the latent space to improve the performance of the model. RCG [126] trains a representation generator by adding noise to the encoded representation and then removing it. Subsequently, it utilizes the generated representation within the MAGE architecture to achieve pixel generation, which unifies pre-training and representation learning.

4.7 Vision Fundation Model

Current research in artificial intelligence and deep learning is increasingly oriented towards the integration of data from multiple modalities. Consequently, multimodal research has emerged as one of the most significant directions in the field of artificial intelligence. We categorize multimodal studies into three broad types. The first category involves the use of multimodal data for pre-training, where the focus is on extending visual network architectures to multimodal contexts and exploring the upper limits of model capabilities through scaling up, as summarized in Table VIII. The second category primarily concentrates on the generation of multimodal data, encompassing tasks such as text-to-image conversion, as summarized in Table IX. The third category represents a vision generalist model, aiming to unify various visual tasks under a single network architecture.

4.7.1 Pre-train With Multimodality

Masked Modeling Methods. VL-BERT [200] incorporates visual and linguistic inputs into a BERT-based architecture, allowing early and unrestricted interactions between modalities for joint representation learning. MaskVLM [112] applies to mask to image-text pairs, and then the masked images and masked texts are separately inputted into the image encoder and text encoder. Furthermore, a multimodal encoder is designed to encode the masked text and image, followed by simultaneous reconstruction of both the image and text. BEiT.v3 integrates MOE and multimodality to design specialized tokenizers for vision, language, and vision-language tasks and scales up the model.

Refer to caption
Figure 12: Illustration of masked modeling with multimodality. (a) FLIP applies masking augmentations to the CLIP [176] framework for text-image alignment. (b) BEiT.v3 [220] designs a mixture-of-export encoder for text-image. The figures are reproduced from [220] and [129].

Contrastive Methods. A-CLIP [251] comprises an online update vision encoder and a language encoder. After images go through extracted feature maps and are masked, they undergo V-L contrastive learning and compute loss with CLIP features. In Figure 12, FLIP [129] uses visible image patches and text, which compute a contrastive loss after passing through different encoders. MaskCLIP [48] incorporates textual encoding into the masked image modeling architecture and computes contrastive loss between language and images to improve model performance through contrastive learning.

Scaling up. Deep learning models often see substantial performance improvements when the number of model parameters reaches a certain scale. Models based on MAE also exhibit phenomenal changes when their parameter size is expanded to a certain extent. A series of studies have scaled up the MAE parameters and tested their performance in various downstream tasks. Models such as EVA [55], EVA-02 [54], WSP [196], and others have achieved excellent results with large parameters. Table VIII summarizes information and performances of this category of models.

4.7.2 Multimodality for Image Generation

Another significant research direction in computer vision for multimodal models involves using multimodality for image generation. This encompasses various tasks, including Text-to-Image Generation and Image Generation. The study of image generation primarily falls into two approaches. The first employs an autoregressive method, predominantly based on Vector Quantization, and falls under VQ-based algorithms such as DALLE[178]. We have delved further into this in Sec. 4.6.1. The other research category primarily utilizes diffusion in conjunction with multimodality for image generation. Common models in this category include , DALLE-2 [177], DALLE-3 [15], Stable Diffusion [186], GPT-4V [227], among others.

4.7.3 Vision Generalist Model

Vision Generalist Model unifies multiple tasks within a single model, selecting different tasks through prompt input and setting the model’s output to a specific target, thereby achieving the unification of various tasks. Painter [222] considers an image paired with its corresponding task output, such as text or features, as a sample pair. Such a pair can encompass multiple modalities. The corresponding task output of the image is masked, and then the image, serving as the task’s prompt, is fed into the encoder to reconstruct the corresponding task output. Through this approach, Painter captures rich contextual information and demonstrates impressive performance across various tasks. InstructDiffusion [68] and InstructCV [60] build upon the foundation of stable diffusion, using prompts and the original image to reconstruct different task objectives, achieving a unification of various task architectures. LVM [9] employs a VQ-GAN encoder to transform images into a sequence of tokens, which are then trained using an autoregressive Transformer architecture. The model flexibly generates outputs by constructing partial visual sentences defining specific application tasks. Moreover, the authors propose a large-scale dataset for in-context learning based on LAION-5B, introducing visual sentences as a unified unit of visual data. This approach enables scalable model training from diverse data sources, thus leveraging the vast diversity present in visual data for comprehensive and robust model development.

Refer to caption
Figure 13: Illustration of various downstream tasks in computer vision. We summarize them by the label (task) types and data modalities. For example, tasks under recognition and detection utilize sample-level (e.g., classification) or sparse objective-level labels (e.g., detection and OCR) on 2D images, while low-level vision tasks prefer pixel-level supervision.

5 Vision Downstream Task

In this section, we will introduce the specific applications of MIM in Vision downstream tasks. Broadly speaking, we categorize the applications of MIM in vision downstream tasks into four parts: recognition and detection, low-level vision, video representation, and 3D vision tasks. Figure 13 provides a classification of CV downstream tasks.

5.1 Video Representation

Research applying Masked Modeling to the video domain primarily focuses on adapting models for high-dimensional video data. This category of research can be divided into two parts: one part is based on the AE architecture, adapting video data to the MAE framework, and the other is based on the AR architecture, using autoregressive methods (such as VQ-VAE, VQ-GAN) to predict video data.

5.1.1 AE-Based

AE-based models usually aim for Video Reconstruction as the task objective to achieve the purpose of Representation Learning. However, videos have higher dimensionality compared to images. Therefore, the focus is on adapting video data to fit within architectures like MAE and BEiT. To apply the 2D MAE framework to videos, a common approach is to mask out space-time tubes instead of spatial patches. This treats the video as a sequence of 2D frames and masks contiguous patches across time. More advanced methods mask at the 3D voxel level for finer spatio-temporal masking. Additional modifications, like introducing a motion-specific encoder, can help capture temporal dynamics.

Based on the framework of MAE, VideoMAE [206] performs spatial-temporal masking during pre-training by randomly occluding cubic patches in spatiotemporal spaces.

Refer to caption
Figure 14: Illustration of MIM on videos. Taking VideoMAE [206] as an example, it employs an asymmetric encoder-decoder architecture with random spatiotemporal cubic masks and reconstructs the missing ones. The figure is reproduced from [206].

Figure 14 shows the framework of VideoMAE. AdaMAE [11] adopts an adaptive sampling method that, based on semantic context, utilizes an auxiliary sampling network to sample visible tokens. It estimates a classification distribution concerning spatio-temporal block tokens, selecting tokens that increase the expected reconstruction error as visible tokens. VideoMAE.v2 [215] introduces a dual-masking strategy where the encoder operates on a subset of video tokens, and the decoder deals with another subset of video tokens. MotionMAE [247] reconstructs masked video patches and predicts motion structure, leveraging an asymmetric MAE architecture to outperform existing baselines in action classification and video object segmentation by effectively capturing both static and dynamic information in videos. OmniMAE [70] uses masked autoencoding with spatiotemporal patches to train on both images and videos, achieving competitive results in downstream tasks by reconstructing missing patches and applying pixel reconstruction loss. MAM2 [197] enhances self-supervised video transformer pre-training by separately decoding motion cues using RGB difference as a prediction target, achieving competitive video recognition performance with fewer pre-training epochs.

5.1.2 AR-Based

AR-based models typically aim at video prediction or video generation tasks, often employing VQ or GPT architectures to model video data. Given that video information is more redundant and higher-dimensional compared to image information, autoregressive models usually predict sequentially along one dimension at a time. Therefore, it is necessary to convert video data into tokens. In AR-based models, the design of the tokenizer is often crucial. Typically, some methods break videos into 2D patches across space and time to get space-time tokens. More sophisticated tokenizers divide the video into 3D voxels and vector quantize these voxel features to obtain discrete visual tokens.

Different from existing methods applying VQ-encoders on super voxel (3D-VQ), MGVIT [255] expand all 2D convolutions inVQ-GAN to 3D convolutions with a temporal axis, and combines 3D-VQ with VQ-GAN to design a new 3D-VQGAN architecture. MaskViT [75] employs an MAE-based architecture for video prediction, utilizing spatial and spatiotemporal window attention to enhance memory and training efficiency. FMNet [223] predicts the depth of masked frames using adjacent frames, and by reconstructing the masked temporal features, it improves temporal consistency.

5.2 Detection And Recognition

5.2.1 General Detection

iTPN [205] enhances the pre-training phase by incorporating a feature pyramid, unifying the reconstruction and recognition neck, and supplementing MIM with masked feature modeling, providing multi-stage supervision.

MIMdet [56] finds that a MIM pre-trained Vanilla ViT encoder can perform surprisingly well in challenging object-level recognition scenarios, even with randomly sampled partial observations. imTED [270] migrates a pre-trained Transformer encoder-decoder to a target detector, constructing a ”fully pre-trained” feature extraction pathway to maximize the detector’s generalization capability while introducing a Multi-Scale Feature Modulator to enhance scale adaptability.

5.2.2 Downstream Classification

Face Recognition. In FaceMAE [214], randomly masked face images are used to train the reconstruction module. An instance relation matching module is tailored to minimize the distribution gap between real faces and FaceMAE reconstructed ones.

Knowledge Distillation. G2SD [98] introduces two knowledge distillation processes to enhance the potential of smaller ViT models. During the generic distillation phase, the smaller model’s decoder is encouraged to align its feature predictions with the hidden representations of the larger model, thereby transferring task-agnostic knowledge. In the specific distillation phase, the smaller model’s predictions are constrained to be consistent with the larger model’s predictions, transferring task-specific features that ensure task performance. DMAE [10]introduces a computationally efficient knowledge distillation framework that leverages MAE to align intermediate feature maps between teacher and student models, enabling robust knowledge transfer and improved performance with high masking ratios and limited visible patches.

Efficient Fine-tuning. Robust Fine-tuing [238] presents a technique that uses masked image patches for counterfactual sample generation, enhancing model robustness by breaking spurious correlations during fine-tuning of large pre-trained models. MAE-CT [119] employs Nearest Neighbor Contrastive Learning to refine the top layers of a pre-trained MAE, enabling it to form semantic clusters and improve performance on classification tasks without the need for labeled data. MAE-CIL [260] explores a bilateral MAE framework for Class Incremental Learning, enhancing image reconstruction quality and representation stability through a novel fusion of image-level and embedding-level learning,

5.2.3 Medical Image

SD-MAE [152] performs region masking and reconstruction on histology images to learn useful representations. Additionally, self-distillation is introduced by making the student model mimic the outputs of the teacher autoencoder via a hint loss. MedMAE [280] migrates MIM to medical images and appends task-specific Heads for specific tasks. It achieves commendable results in various tasks such as chest X-ray disease classification, abdominal CT multi-organ segmentation, and MRI brain tumor segmentation. FreMAE [221] explores the potential of using Fourier Transform for masked image modeling in medical image segmentation, integrating both global structural information and local details. This is achieved by leveraging the frequency domain and multi-stage supervision. GCMAE [175] employs MIM for representation learning in the computational pathology domain, effectively extracting both global and local features from pathological images.

5.2.4 OCR

DocMAE [141] proposes a self-supervised framework that leverages masked autoencoders to learn rectification models for document image correction without human annotation. MaskOCR [153] presents a novel pre-training approach that uses masked image modeling to learn robust encoder-decoder architectures for text recognition in a self-supervised manner without text annotations.

5.2.5 Remote sensing

Based on MAE, SatMAE [41] incorporates a temporal embedding and independently masks image patches across time to harness the temporal information present in the data. This approach allows the model to learn from the changes in the data over time, providing a richer and more nuanced understanding of the imagery. CMID [163] is capable of learning both global semantic separable and local spatial perceptible representations by combining contrastive learning with MIM in a self-distillation manner. This approach addresses the limitations of existing RS SSL methods, which typically focus on either global or local representations, and is better suited to the varied and complex representations required for different RS downstream tasks.

5.2.6 Low-Level Vision

Deep learning models have achieved state-of-the-art results in various image tasks, but they often struggle to generalize across different noise distributions. MaskedDenoising [23] proposes a method that masks random pixels in the input image and reconstructs the missing information during training. Additionally, MaskedDenoising masks feature in the self-attention layer to address inconsistencies between training and testing. The masking training approach introduced by MaskedDenoising enhances the generalization performance of denoising networks. DreamTeacher [120] employs two knowledge distillation methods for pre-training image backbones and performing image denoising: feature distillation and label distillation. Feature distillation transfers features from the generative model to the target backbone, while label distillation transfers task-specific labels to the target backbone.

5.3 3D Vision Task

5.3.1 Depth Estimation

Mesa [106] introduces a novel pre-training framework that synergizes masked, geometric, and supervised learning to enhance the representation of later layers in monocular depth estimation models.UniPAD [248] introduces a SSL paradigm that utilizes 3D volumetric differentiable rendering for encoding 3D space and reconstructing 3D shapes, significantly enhancing performance in autonomous driving tasks like 3D object detection and semantic segmentation.

5.3.2 3D Point CLoud

Research on 3D point clouds can primarily be divided into three categories: one applies the foundational architecture of MIM to 3D point cloud data, another combines it with contrastive learning, and the last category utilizes different network architectures based on the MIM framework.

Basic MIM. To adapt the 2D MAE framework to 3D point clouds, a common approach is voxelization - converting the irregular point cloud into a regular 3D voxel grid that can then be masked. One method masks contiguous 3D voxels to extend patch masking. Encoder architectures like sparse 3D CNNs help capture 3D spatial context. Alternately, some methods work directly on raw point clouds using specialized encoders. For tokenization, point clouds are often voxelized first before applying 3D convolutional autoencoders to learn discrete voxel tokens. Other approaches cluster point cloud features into visual words without voxelization. Hybrid tokenizers combine both voxel and raw point features. Choosing the right tokenizer is key to learning useful representations.

MAE-Based: Voxel-MAE [159] introduces a distance-based random masking strategy and an occupancy prediction pretext task, which helps the model predict the occluded occupancy structure of 3D scenes. PointMAE [265] divides the input point cloud into patches, randomly masks them, and uses a Transformer-based autoencoder to learn high-level latent features from unmasked patches. I2P-MAE [266] focuses on geometric feature reconstruction and identifies three self-supervised learning objectives specific to point clouds: centroid prediction, normal estimation, and curvature prediction. ACT [45] utilizes pre-trained 2D image or language Transformers as teachers for 3D representation learning, transferring their latent features to a 3D Transformer student through masked point modeling. MaskPoint [135] introduces a discriminative masked pre-training Transformer framework that represents point clouds as discrete occupancy values and performs binary classification between points of masked objects and sampled noise. GeoMAE [204] randomly masks a set of points, employs a Transformer-based point cloud encoder, and then uses a lightweight Transformer decoder to predict the centroid, normals, and curvature for each voxel in the point, enabling the model to infer the fine-grained geometric structure of the point cloud. BEiT-Based: PointBERT [258] partitions point clouds into local point chunks and employs a point cloud Tokenizer with dVAE to generate discrete tokens. It randomly masks certain chunks of the input point cloud and trains the Transformer to recover the original point tokens at the masked positions, as shown in Figure 15.

Combined with contrastive Learning. PointCMP [193] integrates the learning of both local and global spatiotemporal features using a two-branch structure. A mutual similarity-based augmentation module is introduced to generate hard samples at the feature level. The framework achieves state-of-the-art performance on benchmark datasets and demonstrates the superiority of learned representations across different datasets and tasks. ReCon [173] combines the merits of both contrastive and generative modeling paradigms through ensemble distillation. It trains a generative student to guide a contrastive student using an encoder-decoder style RECON-block that transfers knowledge through cross attention with stop-gradient. This approach avoids overfitting and pattern difference issues, achieving state-of-the-art results in 3D representation learning and improving performance on downstream tasks.

Different Architecture. Point-M2AE [265]: The encoder and decoder are redesigned into a pyramid structure to capture the spatial geometry and semantic information of 3D shapes. Additionally, a multi-scale masking strategy is introduced to generate consistently visible regions across different scales, and skip connections are employed to reconstruct from a global to local-perspective.

Refer to caption
Figure 15: Illustration of MIM on 3D Point Cloud. Taking PointBERT [258] as an example, PointBERT partitions point clouds into local point chunks and employs a point cloud Tokenizer with dVAE to generate discrete tokens. The figure is reproduced from [258].

6 Masked Modeling on Other Modalities

This section further extends masked modeling pre-training to other mainstream domains beyond CV and NLP and summarizes the essential design and applications.

6.1 Audio and Speech

Combining CL with Masked Modeling. The concept of applying the masked modeling mechanism for SSL can be expanded to audio signals. VQ-wav2vec [8] introduces BERT-style masked modeling as pre-training on top of wav2vec [7]. In wav2vec, the input audio signal is first mapped into dense latent representations by an encoder network. Aggregating latent representations from multiple time steps, the context network generates a contextualized representation. A contrastive loss is adopted as the objective function motivated by Contrastive Predictive Coding (CPC) [167]. VQ-wav2vec [8] introduces a quantization module to replace the dense latent representations with discrete representations, similar to VQ-VAE. The resulting discretized audio representations facilitate a seamless application of the original BERT-style masked modeling, which requires a discrete vocabulary. wav2vec 2.0 adopts a transformer as the context network in contrast to the wav2vec, which uses CNNs for both networks. The output from the convolutional encoder is randomly masked before feeding into the transformer. InfoNCE is adopted to maximize the similarity between the contextualized representation at the masked time stamps and the corresponding quantized version of the localized representation where negative samples are drawn from other masked time steps. Apart from creating the discrete inputs as input to BERT using a quantization module, Hidden Unit BERT (HuBERT) [88] discretize the prediction target by coming up with cluster labels provided by applying K-means to Mel Frequency Cepstral Coefficients (MFCC) of the input audio. HuBERT adopts the same architecture design as in wav2vec 2.0, where a CNN is adopted as the encoder network and a transformer for the BERT encoder. The categorical cross-entropy loss is employed to assess the hidden cluster assignment performance for masked and unmasked tokens, similar to a frame-level acoustic unit discovery problem. It is essential to highlight that while the masking operation is a common element in VQ-wav2vec, wav2vec 2.0, and HuBERT, only VQ-wav2vec and HuBERT incorporate a BERT-style masked modeling approach, whereas wav2vec 2.0 employs the BERT-style masking operation as a means to enhance the performance of contrastive learning.

Masked Audio Modeling as MIM. In contrast to the common practice in MIM, where the prediction task usually takes the form of regression, regardless of whether the prediction target involves tokenizers, pixels, or features, it is worth noting that VQ-wav2vec and HuBERT, rigorously adhere to categorization. The pivotal connection uniting MIM and masked audio modeling (MAM) is the transformation from raw audio signals to a visual representation of either spectrogram or mel-spectrogram. Treating the spectrogram as a greyscale image, the problem of MAM can be naturally and directly transformed into the problem of MIM [134, 37, 27, 3, 38, 95]. The difference between these works again resides in the design of the modules for Mask, Target, Encoder, and Head. Since the spectrogram itself has already extracted features of the audio signal, the main difference is whether the masked patches are fed into the encoder. Only unmasked patches are fed into the encoder in Audio-MAE while works like Mockingjay [134] and Audio ALBERT [37] pass both masked and unmasked patches into the encoder. Audio-MAE [95] explores different masking strategies of unstructured masking (random patch masking), time masking (column-wise masking), and frequency masking (row-wise masking). The framework of Audio-MAE is shown in Figure 16. Combining MAM and MIM, Audiovisual MAE [69] proposed that the masked modeling could be simultaneously applied to audio and image for video pre-training.

6.2 Graph

Graph data are in real-world practice, e.g., social networks. Masked modeling has also achieved overwhelming success in graph data analysis. Initially, AttrMasking [90] first masks some proportions of nodes and edges within each graph and trains the GNN encoder to predict them. Analogously, GROVER [187] attempts to predict the masked subgraphs. Subsequently, GPT-GNN [91] proposes an autoregressive framework to perform node and edge reconstruction iteratively, which generates one masked node (atom) and its connected edges (bonds) and optimizes the likelihood of the node and edges generation in the next iteration. More recently, inspired by the huge success of MAE [78] in CV, GraphMAE [86] masks some input node features with special tokens and enforces the graph autoencoder to reconstruct the masked ones. GraphMAE2 [85] argues that GraphMAE is usually vulnerable to disturbance in the features. To mitigate this issue, they designed the multi-view random re-mask decoding and latent representation prediction to regularize the feature reconstruction. Similarly, MGAE [201] observes that a high masking ratio of the input graph edges could benefit the downstream tasks. Also, they propose a tailored cross-correlation decoder to reconstruct the large number of masked edges. With the increasing attention paid to graph transformer, GMAEs [267] designs an asymmetric graph transformer [160] architecture, where the encoder is a deep transformer and the decoder is a shallow transformer. Equipped with the masking mechanism, GMAE is more memory-efficient than conventional transformers. Despite the fruitful progress, the masking operations create an undesirable dispensary between pre-training and finetuning because the masks would not appear in the downstream tasks. It remains promising to tackle this crucial issue.

6.3 Biology and Chemistry

Masked modeling has recently been extended to various biological applications to accelerate biochemical experiments, especially for research on proteins and molecules.

Sequence Modeling for Protein Considering an amino acid in the protein sequence as a word in the sentence, a number of self-supervised tasks proposed for natural language can be naturally extended to protein sequences. TAPE [179] proposes to predict the type of the next amino acid based on a set of masked sequence fragments. ESM-1b [183] randomly masks out a single or a set of contiguous amino acids and then predicts the masked amino acids from the remaining sequences. Unlike random masking, AC-MLM [157] combines adversarial training with masked language modeling and proposes to mask amino acids in a learnable and adversarial manner. Taking into account the dependence between masked amino acids, Pairwise MLM (PMLM) [82] proposes to model the probability of a pair of masked amino acids instead of predicting the probability of a single amino acid. Different from these generative methods, CPCProt [148] applies different masking transformations on the input sequences to generate different views and then applies InfoNCE to maximize the similarity of two jointly sampled pairs. The antibody is a special kind of protein, and ABGNN [62] enables pre-training of antibody sequences by masking the residues on the Compound Determining Regions (CDRs) and predicting the types of masked residues.

Sequence-Structure Co-modeling for Protein The amino acid sequences of proteins can be folded into stable 3D structures in the real physicochemical world, forming a special kind of sequence-structure data. The concept of the masked modeling mechanism for SSL can also be expanded to protein structure pre-training. For example, GearNet [273] proposes multiview contrasting that randomly samples two sub-structures from each protein by masking, encoders them into two representations, and finally maximizes the similarity between representations from the same protein while minimizing the similarity between representations from different proteins. GraphComp [254] proposes graph completion, which takes as input a protein graph with partially masked residues and then makes predictions for those masked tokens. AlphaFold2 [103] takes masked language modeling as a pre-training task and full-atomic structure prediction as a downstream task. It was found by [89] that the representations from AlphFold2’s Evoformer could work well on various protein-related downstream tasks, including fold classification, stability prediction, etc. Moreover, Masked Inverse Folding (MIF) [249] trains a model to reconstruct the original amino acids conditioned on the masked sequence and the masked backbone structure. Similar to MAGE in CV, more recently proposed pre-training methods [199, 65] like FoldSeek [208] first expand the codebook for amino acid sequences with VQVAE and than perform masked modeling for the latent Transformer encoder.

Graph Representation for Molecules Most molecule data can be represented as SMILE sequences or 2D/3D graphs. Therefore, many methods developed for languages or graphs can also be directly transferred to molecules. AttrMasking [90] randomly masks the input node and edge attributes (e.g., atom type in the molecular graph) and applies GNNs to predict the masked attributes. For sequence-based masking, SMILES-BERT [219] and Molformer [188] randomly mask the characters in the SMILES sequences and then reconstruct them based on the output of the encoder. To alleviate the problem of imbalance atom types in nature, Mole-BERT [235] designs a context-aware tokenizer that encodes atoms as chemically meaningful discrete codes for masking modeling.

Refer to caption
Figure 16: Illustration of MIM on Audio. Taking Audio-MAE [96] as an example, it applies the MAE framework to audio directly. The figure is reproduced from [96].

7 Discussions and Future Directions

How to design an efficient MIM Model? This paper sets out from its main arguments to offer recommendations and heuristic considerations for designing efficient Masked Image Modeling models. The essence of Masked Modeling lies in the reconstruction using masked data. In NLP, the masked tokens are often several consecutive tokens, an operation grounded in a critical principle: preventing information leakage and enabling the model to work with minimal prior information, thereby increasing the difficulty of the reconstruction task. Therefore, when designing the structure of Masked Modeling, the Masked part should adhere to the principle of preventing information leakage. The attention-based masking strategy, while considering the avoidance of data information leakage, utilizes the least computational resources. Furthermore, as introduced in section 3, Masked Modeling’s task of reconstructing low-level features and details compensates for the inadequacies of Transformers in detail modeling. Coupled with the Transformer’s inherent global modeling capabilities, the combination of Masked Modeling and Transformer enables the model to accommodate both low-level modeling capabilities and global modeling abilities, thereby further raising the upper limit of model performance. The selection of Head and Target parts should be contingent upon the specific task at hand. Different Targets will induce varying biases in the model and yield different effects in diverse tasks. Feature maps are generally more suitable for detection tasks. As for whether the Head part should be combined with contrastive learning, this should depend on the choice of Target. If the selected Target necessitates the extraction of a feature map, contrastive learning could be conveniently used to enhance model performance. Conversely, if the model uses Pixels as the Target, employing contrastive learning would not significantly improve performance and would incur substantial computational costs.

Explainability of MIM. Compared to contrastive learning, Masked Modeling still lacks a more comprehensive explanation. The task of contrastive learning, utilizing the InfoNCE loss function, offers a complete loss function and a relatively unified architecture with clearer task objectives. In contrast, Masked Modeling involves complex processing techniques within its various modules and across different modalities. For Masked Modeling, employing different masking strategies and tokenization methods to compress data can result in significant structural and computational differences, making it challenging to develop a comprehensive and unified theoretical explanation. Currently, most theoretical explanations are specific to particular tasks or based on empirical studies, and they fail to generalize across various modalities. The prevailing explanatory approaches mainly unfold in three directions: interpretation based on hierarchical structures, explanations derived from the theoretical foundations of contrastive learning, and interpretations from the perspective of information compression. Although these research efforts provide a certain degree of interpretability to Masked Modeling, they still lack a profound theoretical basis. This makes the interpretability of Masked Modeling a challenging research direction.

Downstream Task Current research on downstream tasks mainly focuses on applying the MAE architecture to specific downstream task structures. However, with the robust growth of Masked Modeling, more complex technologies are gradually being introduced into these tasks. In video research, GPT and MAE are two critical backbones, but a series of studies combining VQ-based models with Masked Modeling are increasingly emerging in the field. These studies employ VQ technology for more efficient data compression and tokenize data to achieve higher-quality reconstruction. Therefore, we believe that research on 3D point clouds will follow this development trend, combining VQ-Based models with Masked Modeling to achieve better information compression efficiency.

Beyond Vision. Multimodal research is currently a significant direction in artificial intelligence, and the application of Masked Modeling in multimodal contexts is one of the most promising future directions. Early multimodal research primarily employed contrastive learning, aligning different modalities and computing contrastive loss. With the advancement of diffusion techniques, studies aligning different modalities through diffusion are also increasing. Masked Modeling holds potential in multimodal applications. The current research paradigm mainly involves aligning different modalities after masking them, increasing task complexity. A new research paradigm is also emerging, where data from different modalities are aligned to a central modality, and then Masked Modeling is applied using the central modality’s data. Moreover, applying Masked Modeling to various modalities technically poses more challenges. Extending masking to 3D, 4D, or even higher-dimensional data and tokenizing higher-dimensional data are technical details that need attention and resolution when expanding Masked Modeling to higher dimensions. Therefore, integration with multimodal approaches will be an important research direction for Masked Modeling.

8 Conclusion

This survey, grounded in Computer Vision CV, proposes a unified architecture for Masked Modeling, successfully integrating various technical details and data modalities within this framework. Additionally, we have meticulously organized and elucidated technologies related to Masked Modeling, such as contrastive learning, generative models, and autoregressive models, offering readers a more comprehensive perspective. This paper presents a complete exposition of Masked Modeling’s applications and theoretical aspects, detailing its use in various visual tasks as well as Beyond Vision tasks and discussing the current theoretical achievements and progress in Masked Modeling. Based on this, we propose promising future directions for Masked Modeling, aligned with current hot research topics in the artificial intelligence community, such as multimodality and large models, providing readers with ideas for proposing new models and methods based on this article.

ACKNOWLEDGMENTS

This work was supported by the National Key R&D Program of China (No. 2022ZD0115100), the National Natural Science Foundation of China Project (No. U21A20427), and Project (No. WU2022A009) from the Center of Synthetic Biology and Integrated Bioengineering of Westlake University. This work was done by Luyuan Zhang and Zedong Wang during their internship at Westlake University.

Model Category Type Mask Encoder Target MIM Head CL Head Loss Publish
MAE [78] BTPM AE Random Transformer Pixel Transformer - MSE CVPR’2022
iGPT [30] BTTM AR AR Mask Transformer Offline, Tokenizer Linear - CE ICML’2020
SimMIM [241] BTPM AE Random Transformer Pixel Linear - MSE CVPR’2022
iBOT [279] BTTM AE Random Transformer Tokenizer MLP - CE ICLR’2022
MST [130] ATPM AE Attention Transformer Feature, Pixel MLP - CE, MSE NIPS’2021
RePre [216] BTPM AE Random Transformer Pixel CNN Transformer - MSE arXiv’2022
ADIOS [194] ATPM AE Adversarial ResNet, Transformer Pixel MLP - MSE ICML’2022
MSN [1] BTFC AE Random Transformer Feature - Softmax CE arXiv’2022
AttMask [104] ATFM AE Attention Transformer Feature Transformer - CE ECCV’2022
CAE [32] BTFM AE Random Transformer Feature Transformer - CE, MSE IJCV’2023
UnMAE [128] ATPM AE Uniform Sampling Transformer Pixel Transformer - MSE arXiv’2022
SemMAE [121] ATPM AE Semantic Guided Transformer Pixel Transformer - MSE NIPS’2022
SIM [202] BTFM AE Random Transformer Feature Transformer - MSE arXiv’2022
ExtreMA [232] BTFC AE Random Transformer Feature - FC InfoNCE arXiv’2022
LoMaR [26] ATPM AE Local Mask Transformer Pixel Transformer - MSE arXiv’2022
CMAE [99] ATPC AE Local Mask Transformer Pixel - FC InfoNCE, MSE arXiv’2022
MaskCLIP [48] BTFB AE Random Transformer Feature Transformer FC InfoNCE, MSE arXiv’2022
BEiT [12] BTTM AE Random Transformer Offline Tokenizer Linear - CE ICLR’2022
BEiT.v2 [171] BTTM AE Random Transformer Offline Tokenizer Linear - CE arXiv’2022
BEiT.v3 [220] BTTM AE Random Transformer Tokenizer Linear - CE arXiv’2022
DMAE [231] BTPM AE Random Transformer Pixel Transformer - MSE arXiv’2022
MILAN [87] ATFM AE Attention Transformer Feature Transformer - MSE arXiv’2022
MimCo [59] BTFC AE Random Transformer Feature - FC InfoNCE arXiv’2022
dBOT [143] BTFM AE Random Transformer Feature Transformer - 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT arXiv’2022
RC-MAE [118] BTPM AE Random Transformer Pixel Transformer - MSE arXiv’2022
MaskDistill [172] BTFM AE Random Transformer Feature Transformer - 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Cosine arXiv’2022
i-MAE [263] ATPM AE Mixture Transformer Pixel Transformer - MSE arXiv’2022
CAE.V2 [269] BTFM AE Random Transformer Feature FC - Cosine arXiv’2022
FastMIM [73] BTFM AE Random Transformer HOG Feature Transformer - MSE arXiv’2022
FLIP [129] BTFC AE Random Transformer Text, Feature - FC InfoNCE CVPR’2023
data2vec [6] BTFM AE Random Transformer Feature Linear - 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ICML’2022
data2vec2.0 [5] ATFM AE Multi-Masking Transformer Feature CNN - MSE ICML’2023
A-CLIP [251] ATFC AE Attention Transformer Feature - FC InfoNCE arXiv’2022
ConMIM [253] BTPC AE Random Transformer Pixel - FC InfoNCE ICLR’2023
Layer Grafted [100] BTPC AE Random Transformer Pixel - FC InfoNCE, MSE ICLR’2023
ccMIM [268] ATPM AE Attention Transformer Pixel Transformer - MSE ICLR’2023
AutoMAE [25] ATPM AE Adversarial Transformer Pixel Transformer - MSE arXiv’2023
HPM [212] ATPM AE Hard Sampling Transformer Pixel Transformer - MSE CVPR’2023
MaPeT [13] BTTM AE, AR Random Transformer Tokenizer Transformer - Likehood arXiv’2023
R-MAE [164] BCPM AE Random Transformer Pixel Transformer - CE arXiv’2023
MFM [239] BCFM AE Random Transformer, CNN Fourier Feature Linear - Fourier Loss ICCV’2023
Hiera [190] BTPM AE Random Transformer Pixel Transformer - MSE ICML’2023
A2MIM [124] BCFM AE Random Transformer, CNN Fourier, HOG Feature Linear - 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Focal FFT ICML’2023
I-JEPA [2] ATPM AE Contextual Transformer Pixel Transformer - L2 CVPR’2023
RandSAC [92] BTTM AR Random Transformer Tokenizer Transformer - CE ICLR’2023
MAGE [125] BTTB AE, AR Random Transformer Tokenizer Transformer MLP CE, InfoNCE CVPR’2023
MaskGIT [20] BTTM AR Random Transformer Tokenizer Transformer - CE CVPR’2022
ConvNext.v2 [228] BCPM AE Random CNN Pixel CNN - MSE arXiv’2023
Spark [203] BCPM AE Random CNN Pixel CNN - MSE ICLR’2023
MixMIM [139] ATPM AE Mixture Transformer Pixel Transformer - MSE arXiv’2022
CIM [275] BCTM AE Random Transformer,CNN Tokenizer Transformer - CE ICLR’2023
MP3 [18] BTFM AE Random Transformer Feature Linear - MSE ICML’2022
mc-BEiT [127] BTTM AE Random Transformer Tokenizer MLP - CE ECCV’2022
BootMAE [47] BTPM AE Random Transformer Pixel, Feature Transformer - MSE ECCV’2022
Ge2AE [137] BTFB AE Random Transformer Fourier Feature Transformer FC Focal FFT, MSE AAAI’2023
SdAE [33] BTPM AE Random Transformer Pixel Transformer - Cosine ECCV’2022
MaskFeat [224] BTFM AE Random Transformer Feature Linear - MSE CVPR’2022
MultiMAE [4] BTFM AE Random Transformer Feature Transformer - MSE ECCV’2022
MVP [225] BTTM AE Random Transformer Token Linear - CE arXiv’2022
FD [226] BTFM AE Random Transformer Feature FC - 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT arXiv’2022
TTT-MAE [61] BTPM AE Random Transformer Pixel Transformer - MSE NIPS’2022
ObjMAE [229] ATPM AE Hard Sampling Transformer Pixel Transformer - MSE arXiv’2022
MaskVLM [112] BTPM AE Random Transformer Pixel, Feature Transformer - MSE ICLR’2023
SDMAE [102] ATFB AE Contextual Transformer Pixel, Feature Transformer FC InfoNCE, MSE arXiv’2022
DMJD [154] ATFM AE Disjoint Transformer Feature Transformer - MSE arXiv’2023
LocalMAE [213] BTFM AE Random Transformer Feature Transformer - MSE CVPR’2023
MaskAlign [244] ATFM AE Attention Transformer Feature MLP - MSE CVPR’2023
MOMA [252] BTFC AE Random Transformer Feature - FC InfoNCE arXiv’2023
PixMIM [145] BTFM AE Random Transformer Feature Transformer - MSE arXiv’2023
MAE-Lite [218] BTPM AE Random Transformer Pixel Transformer - MSE ICML’2023
SparseMAE [276] BTPM AE Random Transformer Pixel Transformer - MSE ICCV’2023
RobustMAE [97] BTFM AE Random Transformer Feature Transformer - CE ICCV’2023
CAN [161] BTPB AE Random Transformer Pixel Transformer FC InfoNCE, MSE ICCV’2023
DILEMMA [191] BTFM AE Random Transformer Feature Transformer - CE AAAI’2023
TinyMIM [182] BTFM AE Random Transformer Feature Transformer - MSE arXiv’2023
PeCo [46] BTTM AE Random Transformer Token Linear - CE AAAI’2023
DropPos [211] BTFM AE Random Transformer Feature MLP - CE NIPS’2023
MSCN [101] BTFM AE Random Transformer Feature MLP - MSE arXiv’2023
Img2vec [168] BTFM AE Random Transformer Feature MLP - MSE arXiv’2023
TABLE IV: Detailed information of fundamental masked image modeling (MIM) methods (view Table V to continue).
Model Category Type Mask Encoder Target MIM Head CL Head Loss Publish
CAE.v2 [269] BTFM AE Random Transformer Feature FC - Cosine arXiv’2022
GreenMIM [94] BTPM AE Random Transformer Pixel Transformer - MSE NIPS’2022
HiViT [271] BTPM AE Random Transformer Pixel Transformer - MSE ICLR’2023
ConvMAE [63] BCPM AE Random Transformer,CNN Pixel Transformer - MSE NIPS’2022
MFM [146] BTFM AE Random Transformer Feature Transformer - MSE ICCV’2023
SparseMAE [276] BTFM AE Random Transformer Pixel Transformer - MSE ICCV’2023
DeepMIM [181] BTFM AE Random Transformer Pixel, Feature Transformer - MSE arXiv’2023
D-iGPT [181] BTTB AE Random Transformer Tokenizer Transformer - CE arXiv’2023
RevColV2 [77] BTPM AE Random Transformer Pixel Transformer - MSE NIPS’2023
VL-GPT [281] BTTM AE Random Transformer Tokenizer Transformer - CE, MSE arXiv’2023
VL-BERT [200] BTTM AE Random Tansformer Tokenizer Linear - CE ICLR’2020
Unified-IO [150] BTFM AE Binary Transformer Feature Transformer - InfoNCE arXiv’2022
LVM[9] BTTM AR AR Mask Transformer Tokenizer Transformer - CE arXiv’2023
TABLE V: Detailed information of fundamental masked image modeling (MIM) methods (continue Table IV).
Model Task Type Category Mask Encoder Target Head Publication
MeshMAE [131] 3D Mesh AE BTPM Random Transformer Pixel MIM Head ECCV’2022
MIMDet [56] Detection AE RTTM Random Transformer Token MIM Head arXiv’2022
iTPN [205] Detection, Segmentation AE BTFM Random Transformer Feature MIM Head CVPR’2023
PiMAE [22] Detection AE BTFM Random Transformer Feature MIM Head ICCV’2023
imTED [270] Detection AE BTFM Random Transformer Feature MIM Head CVPR’2023
NXTP [259] Detection AR BTTM AR Mask Transformer Token MIM Head arXiv’2023
MRT [274] Detection AE ATFM Hard Sampling Transformer Feature MIM Head ICCV’2023
MKD [116] KD AE BTFM Random Transformer Feature MIM Head ICCV’2023
G2SD [98] KD AE BTFM Random Transformer Feature MIM Head CVPR’2023
MILES [66] Video AE ATFM Contextual Transformer Feature MIM Head arXiv’2022
VideoGPT [246] Video AR BTTM AR Mask Transformer Token MIM Head arXiv’2021
MAR [174] Video AE ATPM Cell Running Transformer Pixel MIM Head arXiv’2022
OmniMAE [70] Video AE BTPM Random Transformer Pixel MIM Head arXiv’2022
MaskViT [75] Video AE, AR BTTM Random Transformer Token MIM Head CVPR’2023
FMNet [223] Video AE BTFM Random Tranformer Feature MIM Head ACMMM’2022
MAE [58] Video AE BTPM Random Transformer Pixel MIM Head NIPS’2022
VideoMAE [206] Video AE BTPM Random Transformer Pixel MIM Head NIPS’2022
VideoMAE.v2 [215] Video AE BTPM Random Transformer Pixel MIM Head CVPR’2023
MotionMAE [247] Video AE BTPM Random Transformer Pixel MIM Head arXiv’2022
MAM2 [197] Video AE BTTM Random Transformer Token MIM Head arXiv’2022
CMAE-V [149] Video AE BTPB Random Transformer Pixel CL & MIM Head arXiv’2023
DropMAE [230] Video AE BTPM Random Transformer Pixel MIM Head CVPR’2023
MAGVIT [255] Video AE, AR BTTM Random Transformer Token MIM Head CVPR’2023
AdaMAE [11] Video AE BTPM Random Transformer Pixel MIM Head CVPR’2023
SiamMAE [76] Video AE BTPM Random Transformer Pixel MIM Head Arixiv’2023
MGMAE [93] Video AE BTFM Random Transformer Feature MIM Head ICCV’2023
Forecast-MAE [36] Video AE BTFM Random Transformer Feature MIM Head ICCV’2023
Traj-MAE [24] Video AE BTFM Random Transformer Feature MIM Head ICCV’2023
MGM [52] Video AE ATPM Motion Guided Transformer Pixel MIM Head ICCV’2023
HumanMAC [156] Video AE BTFM Random Transformer Feature MIM Head ICCV’2023
SkeletonMAE [245] Video AE ATFM Joint Mask Transformer Feature MIM Head ICCV’2023
MAMP [29] Video AE ATFM Motion Aware Transformer Feature MIM Head ICCV’2023
GeoMIM [140] Video AE BTFM Random Transformer Feature MIM Head ICCV’2023
SD-MAE [102] Medical Image AE BTPM Random Transformer Pixel MIM Head arXiv’2022
MedMAE [280] Medical Image AE BTPM Random Transformer Pixel MIM Head arXiv’2022
GCMAE [175] Medical Image AE BTPM Random Transformer Pixel MIM Head arXiv’2022
FreMAE [221] Medical Image AE BTFM Random Transformer Fourier Feature MIM Head arXiv’2023
MRM [250] Medical Image AE ATPM Relation Mask Transformer Pixel MIM Head ICCV’2023
DocMAE [141] OCR AE BTPM Random Transformer Pixel MIM Head ICME’2023
SatMAE [41] Remote Sensing AE BTPM Consistent Independent Transformer Pixel MIM Head arXiv’2022
CMID [163] Remote Sensing AE BTFB Random Transformer Fourier Feature CL & MIM Head TGRS’2023
Scale-MAE [180] Remote Sensing AE BTPM Random Transformer Pixel MIM Head ICCV’2023
MGViT [34] Few Shot AE BTPM Random Transformer Pixel MIM Head NIPS’2022
VoxelMAE [159] 3D Point AE BTFM Random Transformer Voxel MIM Head arXiv’2022
PointBERT [258] 3D Point AE BTTM Random Transformer Token MIM Head CVPR’2022
PointMAE [170] 3D Point AE BTFM Random Transformer Feature MIM Head ECCV’2022
MaskPoint [136] 3D Point AE BTFM Random Transformer Real & Fake MIM Head ECCV’2022
Point-M2AE [265] 3D Point AE BTPM Random Transformer Pixel MIM Head NIPS’2022
PointCMP [193] 3D Point AE BTTB Random Transformer Token CL & MIM Head CVPR’2023
I2P-MAE [266] 3D Point AE BTFM Random Transformer Feature MIM Head CVPR’2023
GeoMAE [204] 3D Point AE BTPM Random Transformer Pixel MIM Head CVPR’2023
ACT [45] 3D Point AE BTFM Random Transformer Feature MIM Head ICLR’2023
ReCon [173] 3D Point AE BTFB Random Transformer Feature CL & MIM Head ICML’2023
MGM [52] 3D Point AE BTPM Random Transformer Pixel MIM Head ICCV’2023
TABLE VI: Detailed information of MIM methods for vision downstream tasks.
Dataset Modality Type Pre-training Downstream Task Training Set Link
ImageNet-1K[189] CV Image CL MIM Classification 1,281,167 ImageNet
COCO 2014 Detection[133] CV Image CL MIM Detection, Segmentation 83000 COCO2014
COCO 2017 Detection[133] CV Image CL MIM Detection, Segmentation 118,000 COCO2017
PASCAL Content CV Image CL MIM Segmentation 4998 PASCAL Content
MNIST[217] CV Image - Classification 60,000 MNIST
Cityscapes[42] CV Image CL Segmentation 2975 Cityscapes
Kinetics700[105] CV Video CL, MIM Action Recognition 494,801 Kinetics
UCF101[198] CV Video CL, MIM Action Recognition 9,537 UCF-101
RareAct[158] CV Video CL MIM Action Recognition 7,607 RareAct
AID[234] CV Image CL, MIM Classification 10,000 AID
PASCAL VOC 2007 Classification[51] CV Image CL,MIM Classification 5011 PASCAL VOC
Oxford 102 Folwers [165] CV Image CL Classification 2040 Oxford 102 Flowers
SUN397[237] CV Image CL,MIM Classification 19,850 SUN397
Tiny-ImageNet[117] CV Image CL MIM Classification 100,000 TinyIN
CIFAR-10[110] CV Image CL Classification 50,000 CIFAR-10
CIFAR-100[110] CV Image CL Classification 50,000 CIFAR-100
STL-10[40] CV Image CL MIM Classification 1,000 STL
CUB-200-2011[210] CV Image CL MIM Classification 11,788 CUB-200-2011
FGVC-Aircraft[155] CV Imgae CL MIM Classification 6,770 Aircraft
StanfordCars[109] CV Image CL MIM Classification 8,144 StanfordCars
Places205[277] CV Image CL MIM Recognition 2,500,000 Places205
iNaturalist[84] CV Image CL MIM Classification 675,170 iNaturalist
AgeDB[162] CV Image MIM Age Estimation 16,488 AgeDB
Fashion-MNIST[236] CV Image MIM Classification 70,000 Fashion-MNIST
KITTI-360[132] CV 3D Point Cloud CL MIM Detection, Segmentation 43552 KITTI Vision
ShapeNet[19] CV 3D PointCloud CL MIM Recognition, Classification 220,000 ShapeNet
Caltech-101[57] CV Image CL MIM Classification 3060 Caltech-101
Charades[195] CV Video CL MIM Recognition 66,500 Charades
AVA[72] CV Video CL MIM Detection 211,000 AVA
LVIS [74] CV Image CL MIM Detection 118,000 LVIS
CC12M[21] CV, NLP Image, Text MM CL Classification 12,000,000 CC12M
LAION-5B[192] CV, NLP Image, Text MM CL Classification 400,000,000 LAION
Flickr30k [17] CV, NLP Image, Text MM CL Image-Text Retrieval 31783 Flickr30k
COCO Caption CV, NLP Image, Text MM CL Image-Text Retrieval 82783 COCO Caption
LSMDC[185] CV, NLP Video, Text MM CL Movie Description 118,081 LSMDC
ADE20K[278] CV, NLP Image, Text CL, MIM Scene Parsing 20,000 ADE-20K
TACoS[184] CV, NLP Text, Video CL, MM Detection 2,600 TACoS
RACE[114] NLP Text MLM Reading Comprehension 28,000 RACE
MS MARCO[16] NLP Text MLM Question Answering 1,000,000 MSMAECO
AudioSet[67] Audio, NLP Speech, Text MM, MLM Sound Classification 2,000,000 AudioSet
LibriSpeech[169] Audio Speech MLM Speech Recognition 1,789,621 LibriSpeech
TABLE VII: Summary of datasets for MIM pre-training and vision downstream tasks. Link to dataset websites is provided.
EVA[55] EVA-02[54] WSP[196] Painter[222] ViT-G[262] MAE(ViT-L) [78] LVM[9] InternVL[35]
Layer 40 24 24 24 48 16 26 48
Attention Head 16 16 32 16 16 24 32 25
Parameters 1011M 304M 1.89B 307M 1.84B 307M 3B 5903M
Pre-training IN-21K, CC3M, IN-21K, CC3M, IN-1K, IN-Real, ADE20K IN-1K IN-1K UVD LAION-COCO,COYO
Dataset CC12M CC12M NYUv2 JFT-3B ADE20K CC12M
Downstream ADE, COCO, ADE, COCO, COCO, ObjectNet COCO, Rain, ObjectNet COCO IN-1K IN-1K
Dataset Object365, Kinitics Object365, Kinetics Kinetics SIDD Real Kinetics ADE20K
Segmentation 62.3 mIoU 63.8 mIoU 51.8 mIoU 49.9 mIoU - 53.6 mIoU - 58.9 mIoU
Detection 64.7 AP 65.9 AP 58.0 AP 72.2AP - 53.3AP - -
Video Recognition 89.8 acc - 86.0 acc - - - - 71.5 acc
Classification 84.0 acc 85.5 acc 90.9 acc - 84.86 acc 87.8 acc - 82.5 acc
TABLE VIII: Experimental details and results of vision foundation models. IN denotes ImageNet datasets. LVM only performs comparison experiments of visual prompting and lacks standard benchmark results.
Model Modality Pre-trained Method Pre-trained Dataset Downstream Task
BEiT.v3[220] CV, NLP MIM, MLM IN-1K, ADE20K, Classification, Detection,
COCO, NLVR2 Segmentation
MaskVLM[112] CV, NLP MIM,MLM,CL CC,COCO, Image-Text Retrieval, Natural Language for Visual Reasoning,
SBU, Flickr30K Visual Entailment, Visual Question Answering
FLIP[129] CV, NLP MIM,CL,MLM LAION-5B, IN-1K, Classification, Image-Text Retrieval,
COCO, Flickr30K Image Captioning, Visual Question Answering
A-CLIP[251] CV, NLP MIM,CL IN-1K, YFCC100M, COCO, Classification (Zero-shot),
Flickr30K, Aircraft, MNIST Image-Text Retrieval
VL-BERT[200] CV, NLP MLM,MIM COCO, RefCOCO+, VCR Classification, Segmentation,
Visual Question Answering
MaskCLIP[48] CV, NLP MIM,MLM,CL IN-1K, ADE20K, Classification (Zero-shot),
COCO, Flickr30K Detection, Segmentation
MaskGIT[20] CV, NLP MIM IN-1K Image-Text Generation
VL-GPT[281] CV, NLP MIM CC3M,LAION-COCO,MMC4 Image Generation, Text-to-Image Generation
DALLE[178] CV, NLP MIM,MLM IN-1K, CC, COCO, CUB200 Text-Image Generation
LQAE[138] CV, NLP MIM,MLM IN-1K Text-Image Alignment
SPAE[256] CV, NLP MLM,MIM IN-1K, Kinetics Text-Image Generation
InstructCV [60] CV, NLP MLM IN-1K, MSCOCO, ADE20K Text-Image Generation
TABLE IX: Details of MIM methods with both image and text data modalities.

References

  • [1] M. Assran, M. Caron, I. Misra, P. Bojanowski, F. Bordes, P. Vincent, A. Joulin, M. G. Rabbat, and N. Ballas. Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, 2022.
  • [2] M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, June 2023.
  • [3] A. Baade, P. Peng, and D. F. Harwath. Mae-ast: Masked autoencoding audio spectrogram transformer. ArXiv, abs/2203.16691, 2022.
  • [4] R. Bachmann, D. Mizrahi, A. Atanov, and A. R. Zamir. Multimae: Multi-modal multi-task masked autoencoders. ArXiv, abs/2204.01678, 2022.
  • [5] A. Baevski, A. Babu, W.-N. Hsu, and M. Auli. Efficient self-supervised learning with contextualized target representations for vision, speech and language. 2022.
  • [6] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, 2022.
  • [7] A. Baevski, S. Schneider, and M. Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. ArXiv, abs/1910.05453, 2019.
  • [8] A. Baevski, H. Zhou, A. rahman Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. ArXiv, abs/2006.11477, 2020.
  • [9] Y. Bai, X. Geng, K. Mangalam, A. Bar, A. Yuille, T. Darrell, J. Malik, and A. A. Efros. Sequential modeling enables scalable learning for large vision models, 2023.
  • [10] Y. Bai, Z. Wang, J. Xiao, C. Wei, H. Wang, A. L. Yuille, Y. Zhou, and C. Xie. Masked autoencoders enable efficient knowledge distillers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24256–24265, 2022.
  • [11] W. G. C. Bandara, N. Patel, A. Gholami, M. Nikkhah, M. Agrawal, and V. M. Patel. Adamae: Adaptive masking for efficient spatiotemporal learning with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14507–14517, June 2023.
  • [12] H. Bao, L. Dong, and F. Wei. Beit: Bert pre-training of image transformers. In International Conference on Learning Representations (ICLR), 2022.
  • [13] L. Baraldi, R. Amoroso, M. Cornia, A. Pilzer, and R. Cucchiara. Learning to mask and permute visual tokens for vision transformer pre-training. ArXiv, abs/2306.07346, 2023.
  • [14] D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba. Gan dissection: Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597, 2018.
  • [15] J. Betker, G. Goh, L. Jing, TimBrooks, J. Wang, L. Li, LongOuyang, JuntangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, and A. Ramesh. Improving image generation with better captions.
  • [16] D. F. Campos, T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, and B. Mitra. Ms marco: A human generated machine reading comprehension dataset. ArXiv, abs/1611.09268, 2016.
  • [17] J. Carreira, E. Noland, C. Hillier, and A. Zisserman. A short note on the kinetics-700 human action dataset. ArXiv, abs/1907.06987, 2019.
  • [18] S. Casas, A. Sadat, and R. Urtasun. Mp3: A unified model to map, perceive, predict and plan. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14398–14407, 2021.
  • [19] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q.-X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. ArXiv, abs/1512.03012, 2015.
  • [20] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11305–11315, 2022.
  • [21] S. Changpinyo, P. K. Sharma, N. Ding, and R. Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3557–3567, 2021.
  • [22] A. Chen, K. Zhang, R. Zhang, Z. Wang, Y. Lu, Y. Guo, and S. Zhang. Pimae: Point cloud and image interactive masked autoencoders for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5291–5301, June 2023.
  • [23] H. Chen, J. Gu, Y. Liu, S. A. Magid, C. Dong, Q. Wang, H. Pfister, and L. Zhu. Masked image training for generalizable deep image denoising. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1692–1703, 2023.
  • [24] H. Chen, J. Wang, K. Shao, F. Liu, J. Hao, C. Guan, G. Chen, and P.-A. Heng. Traj-mae: Masked autoencoders for trajectory prediction. ArXiv, abs/2303.06697, 2023.
  • [25] H. Chen, W. Zhang, Y. Wang, and X. Yang. Improving masked autoencoders by learning where to mask. ArXiv, abs/2303.06583, 2023.
  • [26] J. Chen, M. Hu, B. Li, and M. Elhoseiny. Efficient self-supervised vision pretraining with local masked reconstruction. arXiv preprint arXiv:2206.00790, 2022.
  • [27] J. Chen, M. Ma, R. Zheng, and L. Huang. Mam: Masked acoustic modeling for end-to-end speech-to-text translation. ArXiv, abs/2010.11445, 2020.
  • [28] K. Chen, Z. Liu, L. Hong, H. Xu, Z. Li, and D.-Y. Yeung. Mixed autoencoder for self-supervised visual representation learning. ArXiv, abs/2303.17152, 2023.
  • [29] L. Chen, J. Zhang, Y. rong Li, Y. Pang, X. Xia, and T. Liu. Humanmac: Masked motion completion for human motion prediction. ArXiv, abs/2302.03665, 2023.
  • [30] M. Chen, A. Radford, J. Wu, H. Jun, P. Dhariwal, D. Luan, and I. Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, 2020.
  • [31] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  • [32] X. Chen, M. Ding, X. Wang, Y. Xin, S. Mo, Y. Wang, S. Han, P. Luo, G. Zeng, and J. Wang. Context autoencoder for self-supervised representation learning. ArXiv, abs/2202.03026, 2022.
  • [33] Y. Chen, Y. Liu, D. Jiang, X. Zhang, W. Dai, H. Xiong, and Q. Tian. Sdae: Self-distillated masked autoencoder. In European Conference on Computer Vision, 2022.
  • [34] Y. Chen, Z. Xiao, L. Zhao, L. Zhang, H. Dai, D. Liu, Z. Wu, C. Li, T. Zhang, C. Li, D. Zhu, T. Liu, and X. Jiang. Mask-guided vision transformer (mg-vit) for few-shot learning. ArXiv, abs/2205.09995, 2022.
  • [35] Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, Z. Muyan, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y. Qiao, and J. Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. 2023.
  • [36] J. Cheng, X. Mei, and M.-Y. Liu. Forecast-mae: Self-supervised pre-training for motion forecasting with masked autoencoders. ArXiv, abs/2308.09882, 2023.
  • [37] P.-H. Chi, P.-H. Chung, T.-H. Wu, C.-C. Hsieh, Y.-H. Chen, S.-W. Li, and H.-y. Lee. Audio albert: A lite bert for self-supervised learning of audio representation. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 344–350. IEEE, 2021.
  • [38] D. Chong, H. Wang, P. Zhou, and Q. jie Zeng. Masked spectrogram prediction for self-supervised audio pre-training. ArXiv, abs/2204.12768, 2022.
  • [39] Y.-A. Chung and J. R. Glass. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. ArXiv, abs/1803.08976, 2018.
  • [40] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics, 2011.
  • [41] Y. Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y. He, M. Burke, D. Lobell, and S. Ermon. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. ArXiv, abs/2207.08051, 2022.
  • [42] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
  • [43] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [44] C. Doersch, A. K. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1422–1430, 2015.
  • [45] R. Dong, Z. Qi, L. Zhang, J. Zhang, J. Sun, Z. Ge, L. Yi, and K. Ma. Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  • [46] X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen, F. Wen, and N. Yu. Peco: Perceptual codebook for bert pre-training of vision transformers. In AAAI Conference on Artificial Intelligence, 2021.
  • [47] X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen, F. Wen, and N. Yu. Bootstrapped masked autoencoders for vision bert pretraining. In European Conference on Computer Vision, 2022.
  • [48] X. Dong, Y. Zheng, J. Bao, T. Zhang, D. Chen, H. Yang, M. Zeng, W. Zhang, L. Yuan, D. Chen, F. Wen, and N. Yu. Maskclip: Masked self-distillation advances contrastive language-image pretraining. ArXiv, abs/2208.12262, 2022.
  • [49] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
  • [50] P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis, 2021.
  • [51] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. https://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
  • [52] D. Fan, J. Wang, S. Liao, Y. Zhu, V. Bhat, H. J. Santos-Villalobos, M. V. Rohith, and X. Li. Motion-guided masking for spatiotemporal representation learning. ArXiv, abs/2308.12962, 2023.
  • [53] Y. Fang, L. Dong, H. Bao, X. Wang, and F. Wei. Corrupted image modeling for self-supervised visual pre-training. arXiv preprint arXiv:2202.03382, 2022.
  • [54] Y. Fang, Q. Sun, X. Wang, T. Huang, X. Wang, and Y. Cao. Eva-02: A visual representation for neon genesis. ArXiv, abs/2303.11331, 2023.
  • [55] Y. Fang, W. Wang, B. Xie, Q.-S. Sun, L. Y. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao. Eva: Exploring the limits of masked visual representation learning at scale. ArXiv, abs/2211.07636, 2022.
  • [56] Y. Fang, S. Yang, S. Wang, Y. Ge, Y. Shan, and X. Wang. Unleashing vanilla vision transformer with masked image modeling for object detection. ArXiv, abs/2204.02964, 2022.
  • [57] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 Conference on Computer Vision and Pattern Recognition Workshop, pages 178–178, 2004.
  • [58] C. Feichtenhofer, H. Fan, Y. Li, and K. He. Masked autoencoders as spatiotemporal learners. ArXiv, abs/2205.09113, 2022.
  • [59] Q. feng Zhou, C. Yu, H. Luo, Z. Wang, and H. Li. Mimco: Masked image modeling pre-training with contrastive teacher. Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  • [60] Y. Gan, S. Park, A. Schubert, A. Philippakis, and A. M. Alaa. Instructcv: Instruction-tuned text-to-image diffusion models as vision generalists. ArXiv, abs/2310.00390, 2023.
  • [61] Y. Gandelsman, Y. Sun, X. Chen, and A. A. Efros. Test-time training with masked autoencoders. ArXiv, abs/2209.07522, 2022.
  • [62] K. Gao, L. Wu, J. Zhu, T. Peng, Y. Xia, L. He, S. Xie, T. Qin, H. Liu, K. He, et al. Pre-training antibody language models for antigen-specific computational antibody design. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 506–517, 2023.
  • [63] P. Gao, T. Ma, H. Li, J. Dai, and Y. J. Qiao. Convmae: Masked convolution meets masked autoencoders. ArXiv, abs/2205.03892, 2022.
  • [64] T. Gao, X. Yao, and D. Chen. Simcse: Simple contrastive learning of sentence embeddings. ArXiv, abs/2104.08821, 2021.
  • [65] Z. Gao, C. Tan, and S. Z. Li. Vqpl: Vector quantized protein language. ArXiv, abs/2310.04985, 2023.
  • [66] Y. Ge, Y. Ge, X. Liu, A. Wang, J. Wu, Y. Shan, X. Qie, and P. Luo. Miles: Visual bert pre-training with injected language semantics for video-text retrieval. ArXiv, abs/2204.12408, 2022.
  • [67] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017.
  • [68] Z. Geng, B. Yang, T. Hang, C. Li, S. Gu, T. Zhang, J. Bao, Z. Zhang, H. Hu, D. Chen, and B. Guo. Instructdiffusion: A generalist modeling interface for vision tasks. ArXiv, abs/2309.03895, 2023.
  • [69] M.-I. Georgescu, E. Fonseca, R. T. Ionescu, M. Lucic, C. Schmid, and A. Arnab. Audiovisual masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16144–16154, October 2023.
  • [70] R. Girdhar, A. El-Nouby, M. Singh, K. V. Alwala, A. Joulin, and I. Misra. Omnimae: Single model masked pretraining on images and videos. ArXiv, abs/2206.08356, 2022.
  • [71] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
  • [72] C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, and J. Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2017.
  • [73] J. Guo, K. Han, H. Wu, Y. Tang, Y. Wang, and C. Xu. Fastmim: Expediting masked image modeling pre-training for vision. 2022.
  • [74] A. Gupta, P. Dollár, and R. B. Girshick. Lvis: A dataset for large vocabulary instance segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5351–5359, 2019.
  • [75] A. Gupta, S. Tian, Y. Zhang, J. Wu, R. Mart’in-Mart’in, and L. Fei-Fei. Maskvit: Masked visual pre-training for video prediction. ArXiv, abs/2206.11894, 2022.
  • [76] A. Gupta, J. Wu, J. Deng, and L. Fei-Fei. Siamese masked autoencoders. ArXiv, abs/2305.14344, 2023.
  • [77] Q. Han, Y. Cai, and X. Zhang. Revcolv2: Exploring disentangled representations in masked image modeling. ArXiv, abs/2309.01005, 2023.
  • [78] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  • [79] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
  • [80] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
  • [81] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [82] L. He, S. Zhang, L. Wu, H. Xia, F. Ju, H. Zhang, S. Liu, Y. Xia, J. Zhu, P. Deng, et al. Pre-training co-evolutionary protein representation via a pairwise masked language model. arXiv preprint arXiv:2110.15527, 2021.
  • [83] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020.
  • [84] G. V. Horn, O. M. Aodha, Y. Song, A. Shepard, H. Adam, P. Perona, and S. J. Belongie. The inaturalist challenge 2017 dataset. ArXiv, abs/1707.06642, 2017.
  • [85] Z. Hou, Y. He, Y. Cen, X. Liu, Y. Dong, E. Kharlamov, and J. Tang. Graphmae2: A decoding-enhanced masked self-supervised graph learner. In Proceedings of the ACM Web Conference 2023, pages 737–746, 2023.
  • [86] Z. Hou, X. Liu, Y. Cen, Y. Dong, H. Yang, C. Wang, and J. Tang. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 594–604, 2022.
  • [87] Z. Hou, F. Sun, Y.-K. Chen, Y. Xie, and S. Y. Kung. Milan: Masked image pretraining on language assisted representation. ArXiv, abs/2208.06049, 2022.
  • [88] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  • [89] M. Hu, F. Yuan, K. K. Yang, F. Ju, J. Su, H. Wang, F. Yang, and Q. Ding. Exploring evolution-based &-free protein language models as protein function predictors. arXiv preprint arXiv:2206.06583, 2022.
  • [90] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec. Strategies for pre-training graph neural networks. In ICLR, 2019.
  • [91] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun. Gpt-gnn: Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1857–1867, 2020.
  • [92] T. Hua, Y. Tian, S. Ren, H. Zhao, and L. Sigal. Self-supervision through random segments with autoregressive coding (randsac). ArXiv, abs/2203.12054, 2022.
  • [93] B. Huang, Z. Zhao, G. Zhang, Y. Qiao, and L. Wang. Mgmae: Motion guided masking for video masked autoencoding. ArXiv, abs/2308.10794, 2023.
  • [94] L. Huang, S. You, M. Zheng, F. Wang, C. Qian, and T. Yamasaki. Green hierarchical vision transformer for masked image modeling. ArXiv, abs/2205.13515, 2022.
  • [95] P.-Y. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022.
  • [96] P.-Y. Huang, H. Xu, J. B. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer. Masked autoencoders that listen. ArXiv, abs/2207.06405, 2022.
  • [97] Q. Huang, X. Dong, D. Chen, Y. Chen, L. Yuan, G. Hua, W. Zhang, N. H. Yu, and M. Reaserch. Improving adversarial robustness of masked autoencoders via test-time frequency-domain prompting. ArXiv, abs/2308.10315, 2023.
  • [98] W. Huang, Z. Peng, L. Dong, F. Wei, J. Jiao, and Q. Ye. Generic-to-specific distillation of masked autoencoders. ArXiv, abs/2302.14771, 2023.
  • [99] Z. Huang, X. Jin, C. Lu, Q. Hou, M.-M. Cheng, D. Fu, X. Shen, and J. Feng. Contrastive masked autoencoders are stronger vision learners. ArXiv, abs/2207.13532, 2022.
  • [100] Z. Jiang, Y. Chen, M. Liu, D. Chen, X. Dai, L. Yuan, Z. Liu, and Z. Wang. Layer grafted pre-training: Bridging contrastive learning and masked image modeling for label-efficient representations. In The Eleventh International Conference on Learning Representations, 2023.
  • [101] L. Jing, J. Zhu, and Y. LeCun. Masked siamese convnets. ArXiv, abs/2206.07700, 2022.
  • [102] J. ju Mao, H. Zhou, X. Yin, Y. Chang, B. Nie, and R. Xu. Masked autoencoders are effective solution to transformer data-hungry. ArXiv, abs/2212.05677, 2022.
  • [103] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  • [104] I. Kakogeorgiou, S. Gidaris, B. Psomas, Y. Avrithis, A. Bursuc, K. Karantzalos, and N. Komodakis. What to hide from your students: Attention-guided masked image modeling. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  • [105] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, A. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017.
  • [106] M. O. Khan, J. Liang, C.-K. Wang, S. Yang, and Y. Lou. Mesa: Masked, geometric, and supervised pre-training for monocular depth estimation. ArXiv, abs/2310.04551, 2023.
  • [107] L. Kong, M. Q. Ma, G. Chen, E. P. Xing, Y. Chi, L.-P. Morency, and K. Zhang. Understanding masked autoencoders via hierarchical latent variable models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7918–7928, June 2023.
  • [108] X. Kong and X. Zhang. Understanding masked image modeling via learning occlusion invariant feature. ArXiv, abs/2208.04164, 2022.
  • [109] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
  • [110] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • [111] G. K. Kumar, S. S. Mullappilly, and A. S. Gehlot. An empirical study of self-supervised learning approaches for object detection with transformers. ArXiv, abs/2205.05543, 2022.
  • [112] G. Kwon, Z. Cai, A. Ravichandran, E. Bas, R. Bhotika, and S. . Soatto. Masked vision and language modeling for multi-modal representation learning. In International Conference on Learning Representations (ICLR), 2023.
  • [113] C.-I. Lai. Contrastive predictive coding based feature for automatic speaker verification. ArXiv, abs/1904.01575, 2019.
  • [114] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy. Race: Large-scale reading comprehension dataset from examinations. ArXiv, abs/1704.04683, 2017.
  • [115] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  • [116] S. Lao, G. Song, B. Liu, Y. Liu, and Y. Yang. Masked autoencoders are stronger knowledge distillers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6384–6393, October 2023.
  • [117] S. H. Lee, S. Lee, and B. C. Song. Vision transformer for small-size datasets. ArXiv, abs/2112.13492, 2021.
  • [118] Y. Lee, J. Willette, J. Kim, J. Lee, and S. J. Hwang. Exploring the role of mean teachers in self-supervised masked auto-encoders. ArXiv, abs/2210.02077, 2022.
  • [119] J. Lehner, B. Alkin, A. Fürst, E. Rumetshofer, L. Miklautz, and S. Hochreiter. Contrastive tuning: A little help to make masked autoencoders forget. ArXiv, abs/2304.10520, 2023.
  • [120] D. Li, H. Ling, A. Kar, D. Acuna, S. W. Kim, K. Kreis, A. Torralba, and S. Fidler. Dreamteacher: Pretraining image backbones with deep generative models. ArXiv, abs/2307.07487, 2023.
  • [121] G. Li, H. Zheng, D. Liu, B. Su, and C. Zheng. Semmae: Semantic-guided masking for learning masked autoencoders. ArXiv, abs/2206.10207, 2022.
  • [122] J. Li, P. Zhou, C. Xiong, R. Socher, and S. C. H. Hoi. Prototypical contrastive learning of unsupervised representations. ArXiv, abs/2005.04966, 2020.
  • [123] S. Li, Z. Wang, Z. Liu, C. Tan, H. Lin, D. Wu, Z. Chen, J. Zheng, and S. Z. Li. Efficient multi-order gated aggregation network. ArXiv, abs/2211.03295, 2022.
  • [124] S. Li, D. Wu, F. Wu, Z. Zang, K. Wang, L. Shang, B. Sun, H. Li, and Stan.Z.Li. Architecture-agnostic masked image modeling - from vit back to cnn. In International Conference on Machine Learning, 2023.
  • [125] T. Li, H. Chang, S. K. Mishra, H. Zhang, D. Katabi, and D. Krishnan. Mage: Masked generative encoder to unify representation learning and image synthesis. arXiv preprint arXiv:2211.09117, 2022.
  • [126] T. Li, D. Katabi, and K. He. Self-conditioned image generation via generating representations. 2023.
  • [127] X. Li, Y. Ge, K. Yi, Z. Hu, Y. Shan, and L. yu Duan. mc-beit: Multi-choice discretization for image bert pre-training. In European Conference on Computer Vision, 2022.
  • [128] X. Li, W. Wang, L. Yang, and J. Yang. Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality. ArXiv, abs/2205.10063, 2022.
  • [129] Y. Li, H. Fan, R. Hu, C. Feichtenhofer, and K. He. Scaling language-image pre-training via masking. ArXiv, abs/2212.00794, 2022.
  • [130] Z. Li, Z. Chen, F. Yang, W. Li, Y. Zhu, C. Zhao, R. Deng, L. Wu, R. Zhao, M. Tang, and J. Wang. Mst: Masked self-supervised transformer for visual representation. In Neural Information Processing Systems, 2021.
  • [131] Y. Liang, S. Zhao, B. Yu, J. Zhang, and F. He. Meshmae: Masked autoencoders for 3d mesh data analysis. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  • [132] Y. Liao, J. Xie, and A. Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:3292–3310, 2021.
  • [133] T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  • [134] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE, 2020.
  • [135] B. Liu, D. Hsu, P. Ravikumar, and A. Risteski. Masked prediction tasks: a parameter identifiability view. ArXiv, abs/2202.09305, 2022.
  • [136] H. Liu, M. Cai, and Y. J. Lee. Masked discrimination for self-supervised learning on point clouds. In European Conference on Computer Vision, 2022.
  • [137] H. Liu, X. Jiang, X. Li, A. Guo, D. Jiang, and B. Ren. The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training. In AAAI Conference on Artificial Intelligence, 2022.
  • [138] H. Liu, W. Yan, and P. Abbeel. Language quantized autoencoders: Towards unsupervised text-image alignment. ArXiv, abs/2302.00902, 2023.
  • [139] J. Liu, X. Huang, Y. Liu, and H. Li. Mixmim: Mixed and masked image modeling for efficient visual representation learning. ArXiv, abs/2205.13137, 2022.
  • [140] J. Liu, T. Wang, B. Liu, Q. Zhang, Y. Liu, and H. Li. Towards better 3d knowledge transfer via masked image modeling for multi-view 3d understanding. ArXiv, abs/2303.11325, 2023.
  • [141] S. Liu, H. Feng, W. gang Zhou, H. Li, C. Liu, and F. Wu. Docmae: Document image rectification via self-supervised representation learning. 2023.
  • [142] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1):857–876, 2021.
  • [143] X. Liu, J. Zhou, T. Kong, X. Lin, and R. Ji. Exploring target representations for masked autoencoders. arXiv preprint arXiv:2209.03917, 2022.
  • [144] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • [145] Y. Liu, S. Zhang, J. Chen, K. Chen, and D. Lin. Pixmim: Rethinking pixel reconstruction in masked image modeling. ArXiv, abs/2303.02416, 2023.
  • [146] Y. Liu, S. Zhang, J. Chen, Z. Yu, K. Chen, and D. Lin. Improving pixel-based mim by reducing wasted modeling capability. ArXiv, abs/2308.00261, 2023.
  • [147] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [148] A. X. Lu, H. Zhang, M. Ghassemi, and A. Moses. Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv, 2020.
  • [149] C. Lu, X. Jin, Z. Huang, Q. Hou, M.-M. Cheng, and J. Feng. Cmae-v: Contrastive masked autoencoders for video action recognition. ArXiv, abs/2301.06018, 2023.
  • [150] J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  • [151] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. V. Gool. Repaint: Inpainting using denoising diffusion probabilistic models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11451–11461, 2022.
  • [152] Y. Luo, Z. Chen, and X. Gao. Self-distillation augmented masked autoencoders for histopathological image classification. ArXiv, abs/2203.16983, 2022.
  • [153] P. Lyu, C. Zhang, S. Liu, M. Qiao, Y. Xu, L. Wu, K. Yao, J. Han, E. Ding, and J. Wang. Maskocr: Text recognition with masked encoder-decoder pretraining. ArXiv, abs/2206.00311, 2022.
  • [154] X. Ma, C.-S. Liu, C. Xie, L. Ye, Y. Deng, and X. Ji. Disjoint masking with joint distillation for efficient masked image modeling. ArXiv, abs/2301.00230, 2022.
  • [155] S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. ArXiv, abs/1306.5151, 2013.
  • [156] Y. Mao, J. Deng, W. gang Zhou, Y. Fang, W. Ouyang, and H. Li. Masked motion predictors are strong 3d action representation learners. ArXiv, abs/2308.07092, 2023.
  • [157] M. McDermott, B. Yap, H. Hsu, D. Jin, and P. Szolovits. Adversarial contrastive pre-training for protein sequences. arXiv preprint arXiv:2102.00466, 2021.
  • [158] A. Miech, J.-B. Alayrac, I. Laptev, J. Sivic, and A. Zisserman. Rareact: A video dataset of unusual interactions. ArXiv, abs/2008.01018, 2020.
  • [159] C. Min, X. Xu, D. Zhao, L. Xiao, Y. Nie, and B. Dai. Voxel-mae: Masked autoencoders for pre-training large-scale point clouds. ArXiv, abs/2206.09900, 2022.
  • [160] E. Min, R. Chen, Y. Bian, T. Xu, K. Zhao, W. Huang, P. Zhao, J. Huang, S. Ananiadou, and Y. Rong. Transformer for graphs: An overview from architecture perspective. arXiv preprint arXiv:2202.08455, 2022.
  • [161] S. K. Mishra, J. Robinson, H. Chang, D. Jacobs, A. Sarna, A. Maschinot, and D. Krishnan. A simple, efficient and scalable contrastive masked autoencoder for learning visual representations. ArXiv, abs/2210.16870, 2022.
  • [162] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou. Agedb: The first manually collected, in-the-wild age database. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1997–2005, 2017.
  • [163] D. Muhtar, X. liang Zhang, P. Xiao, Z. Li, and F. Gu. Cmid: A unified self-supervised learning framework for remote sensing image understanding. IEEE Transactions on Geoscience and Remote Sensing, 2023.
  • [164] D.-K. Nguyen, V. Aggarwal, Y. Li, M. R. Oswald, A. Kirillov, C. G. M. Snoek, and X. Chen. R-mae: Regions meet masked autoencoders. ArXiv, abs/2306.05411, 2023.
  • [165] M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  • [166] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. ArXiv, abs/1603.09246, 2016.
  • [167] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • [168] H. Pan, C. Liu, W. Wang, L. Yuan, H. Wang, Z. Li, and W. Liu. Img2vec: A teacher of high token-diversity helps masked autoencoders, 2023.
  • [169] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015.
  • [170] Y. Pang, W. Wang, F. E. H. Tay, W. Liu, Y. Tian, and L. Yuan. Masked autoencoders for point cloud self-supervised learning. In European Conference on Computer Vision, 2022.
  • [171] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei. Beit v2: Masked image modeling with vector-quantized visual tokenizers. ArXiv, abs/2208.06366, 2022.
  • [172] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei. A unified view of masked image modeling. 2022.
  • [173] Z. Qi, R. Dong, G. Fan, Z. Ge, X. Zhang, K. Ma, and L. Yi. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. ArXiv, abs/2302.02318, 2023.
  • [174] Z. Qing, S. Zhang, Z. Huang, X. Wang, Y. Wang, Y. Lv, C. Gao, and N. Sang. Mar: Masked autoencoders for efficient action recognition. ArXiv, abs/2207.11660, 2022.
  • [175] H. Quan, X. Li, W. Chen, Q. Bai, M. Zou, R. Yang, T. Zheng, R. Qi, X. Gao, and X. Cui. Global contrast masked autoencoders are powerful pathological representation learners. arXiv:2205.09048, 2022.
  • [176] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  • [177] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
  • [178] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. ArXiv, abs/2102.12092, 2021.
  • [179] R. Rao, N. Bhattacharya, N. Thomas, Y. Duan, P. Chen, J. Canny, P. Abbeel, and Y. Song. Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
  • [180] C. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, S. Candido, M. Uyttendaele, and T. Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. ArXiv, abs/2212.14532, 2022.
  • [181] S. Ren, Z. Wang, H. Zhu, J. Xiao, A. Yuille, and C. Xie. Rejuvenating image-gpt as strong visual representation learners, 2023.
  • [182] S. Ren, F. Wei, Z. Zhang, and H. Hu. Tinymim: An empirical study of distilling mim pre-trained models. 2023.
  • [183] A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  • [184] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. In German Conference on Pattern Recognition, 2014.
  • [185] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3212, 2015.
  • [186] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  • [187] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33, 2020.
  • [188] J. Ross, B. Belgodere, V. Chenthamarakshan, I. Padhi, Y. Mroueh, and P. Das. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12):1256–1264, 2022.
  • [189] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211 – 252, 2014.
  • [190] C. K. Ryali, Y.-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y. Huang, V. Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y. Li, and C. Feichtenhofer. Hiera: A hierarchical vision transformer without the bells-and-whistles. In International Conference on Machine Learning, 2023.
  • [191] S. Sameni, S. Jenni, and P. Favaro. Representation learning by detecting incorrect location embeddings. Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  • [192] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models. ArXiv, abs/2210.08402, 2022.
  • [193] Z. Shen, X. Sheng, L. Wang, Y. K. Guo, Q. Liu, and X. Zhou. Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos. In CVPR, 2023.
  • [194] Y. Shi, N. Siddharth, P. Torr, and A. R. Kosiorek. Adversarial masking for self-supervised learning. In International Conference on Machine Learning, pages 20026–20040. PMLR, 2022.
  • [195] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. K. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, 2016.
  • [196] M. Singh, Q. Duval, K. V. Alwala, H. Fan, V. Aggarwal, A. B. Adcock, A. Joulin, P. Doll’ar, C. Feichtenhofer, R. B. Girshick, R. Girdhar, and I. Misra. The effectiveness of mae pre-pretraining for billion-scale pretraining. ArXiv, abs/2303.13496, 2023.
  • [197] Y. Song, M. Yang, W. Wu, D. He, F. Li, and J. Wang. It takes two: Masked appearance-motion modeling for self-supervised video transformer pre-training. ArXiv, abs/2210.05234, 2022.
  • [198] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
  • [199] J. Su, C. Han, Y. Zhou, J. Shan, X. Zhou, and F. Yuan. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, 2023.
  • [200] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai. Vl-bert: Pre-training of generic visual-linguistic representations. ArXiv, abs/1908.08530, 2019.
  • [201] Q. Tan, N. Liu, X. Huang, R. Chen, S.-H. Choi, and X. Hu. Mgae: Masked autoencoders for self-supervised learning on graphs. arXiv preprint arXiv:2201.02534, 2022.
  • [202] C. Tao, X. Zhu, G. Huang, Y. Qiao, X. Wang, and J. Dai. Siamese image modeling for self-supervised vision representation learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2132–2141, 2022.
  • [203] K. Tian, Y. Jiang, Q. Diao, C. Lin, L. Wang, and Z. Yuan. Designing bert for convolutional networks: Sparse and hierarchical masked modeling. ArXiv, abs/2301.03580, 2023.
  • [204] X. Tian, H. Ran, Y. Wang, and H. Zhao. Geomae: Masked geometric target prediction for self-supervised point cloud pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13570–13580, June 2023.
  • [205] Y. Tian, L. Xie, Z. Wang, L. Wei, X. Zhang, J. Jiao, Y. Wang, Q. Tian, and Q. Ye. Integrally pre-trained transformer pyramid networks. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18610–18620, 2022.
  • [206] Z. Tong, Y. Song, J. Wang, and L. Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. ArXiv, abs/2203.12602, 2022.
  • [207] A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. ArXiv, abs/1711.00937, 2017.
  • [208] M. van Kempen, S. S. Kim, C. Tumescheit, M. Mirdita, J. Söding, and M. Steinegger. Foldseek: fast and accurate protein structure search. 2022.
  • [209] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
  • [210] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [211] H. Wang, J. Fan, Y. Wang, K. Song, T. Wang, and Z. Zhang. Droppos: Pre-training vision transformers by reconstructing dropped positions. ArXiv, abs/2309.03576, 2023.
  • [212] H. Wang, K. Song, J. Fan, Y. Wang, J. Xie, and Z. Zhang. Hard patches mining for masked image modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [213] H. Wang, Y. Tang, Y. Wang, J. Guo, Z. Deng, and K. Han. Masked image modeling with local multi-scale reconstruction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2122–2131, 2023.
  • [214] K. Wang, B. Zhao, X. Peng, Z. H. Zhu, J. Deng, X. Wang, H. Bilen, and Y. You. Facemae: Privacy-preserving face recognition via masked autoencoders. ArXiv, abs/2205.11090, 2022.
  • [215] L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2023.
  • [216] L. Wang, F. Liang, Y. Li, H. Zhang, W. Ouyang, and J. Shao. Repre: Improving self-supervised vision transformer with reconstructive pre-training. In International Joint Conference on Artificial Intelligence, 2022.
  • [217] M. Wang and W. Deng. Oracle-mnist: a realistic image dataset for benchmarking machine learning algorithms. ArXiv, abs/2205.09442, 2022.
  • [218] S. Wang, J. Gao, Z. Li, J. Sun, and W. Hu. A closer look at self-supervised lightweight vision transformers. ArXiv, abs/2205.14443, 2022.
  • [219] S. Wang, Y. Guo, Y. Wang, H. Sun, and J. Huang. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pages 429–436, 2019.
  • [220] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. Mohammed, S. Singhal, S. Som, and F. Wei. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. ArXiv, abs/2208.10442, 2022.
  • [221] W. Wang, J. Wang, C. Chen, J. Jiao, L. Sun, Y. Cai, S. Song, and J. Li. Fremae: Fourier transform meets masked autoencoders for medical image segmentation. 2023.
  • [222] X. Wang, W. Wang, Y. Cao, C. Shen, and T. Huang. Images speak in images: A generalist painter for in-context visual learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6830–6839, 2022.
  • [223] Y. Wang, Z. Pan, X. Li, Z. CAO, K. Xian, and J. Zhang. Less is more: Consistent video depth estimation with masked frames modeling. ArXiv, abs/2208.00380, 2022.
  • [224] C. Wei, H. Fan, S. Xie, C. Wu, A. L. Yuille, and C. Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14648–14658, 2021.
  • [225] L. Wei, L. Xie, W. gang Zhou, H. Li, and Q. Tian. Mvp: Multimodality-guided visual pre-training. ArXiv, abs/2203.05175, 2022.
  • [226] Y. Wei, H. Hu, Z. Xie, Z. Zhang, Y. Cao, J. Bao, D. Chen, and B. Guo. Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. ArXiv, abs/2205.14141, 2022.
  • [227] L. Wen, X. Yang, D. Fu, X. Wang, P. Cai, X. Li, T. Ma, Y. Li, L. Xu, D. Shang, Z. Zhu, S. Sun, Y. Bai, X. Cai, M. Dou, S. Hu, B. Shi, and Y. Qiao. On the road with gpt-4v(ision): Early explorations of visual-language model on autonomous driving, 2023.
  • [228] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I.-S. Kweon, and S. Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. ArXiv, abs/2301.00808, 2023.
  • [229] J. Wu and S. Mo. Object-wise masked autoencoders for fast pre-training. ArXiv, abs/2205.14338, 2022.
  • [230] Q. Wu, T. Yang, Z. Liu, B. Wu, Y. Shan, and A. B. Chan. Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14561–14571, 2023.
  • [231] Q. Wu, H. Ye, Y. Gu, H. Zhang, L. Wang, and D. He. Denoising masked autoencoders are certifiable robust vision learners. ArXiv, abs/2210.06983, 2022.
  • [232] Z. Wu, Z. Lai, X. Sun, and S. Lin. Extreme masking for learning instance and distributed visual representations. ArXiv, abs/2206.04667, 2022.
  • [233] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance-level discrimination. ArXiv, abs/1805.01978, 2018.
  • [234] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55:3965–3981, 2016.
  • [235] J. Xia, C. Zhao, B. Hu, Z. Gao, C. Tan, Y. Liu, S. Li, and S. Z. Li. Mole-bert: Rethinking pre-training graph neural networks for molecules. In The Eleventh International Conference on Learning Representations, 2022.
  • [236] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ArXiv, abs/1708.07747, 2017.
  • [237] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
  • [238] Y. Xiao, Z. Tang, P. Wei, C. Liu, and L. Lin. Masked images are counterfactual samples for robust fine-tuning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20301–20310, 2023.
  • [239] J. Xie, W. Li, X. Zhan, Z. Liu, Y. S. Ong, and C. C. Loy. Masked frequency modeling for self-supervised visual pre-training. ArXiv, abs/2206.07706, 2022.
  • [240] Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y. Cao. Revealing the dark secrets of masked image modeling. ArXiv, abs/2205.13543, 2022.
  • [241] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu. Simmim: a simple framework for masked image modeling. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9643–9653, 2021.
  • [242] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, Y. Wei, Q. Dai, and H. Hu. On data scaling in masked image modeling. ArXiv, abs/2206.04664, 2022.
  • [243] H. Xu, S. Ding, X. Zhang, H. Xiong, and Q. Tian. Masked autoencoders are robust data augmentors. ArXiv, abs/2206.04846, 2022.
  • [244] H. Xue, P. Gao, H. Li, Y. J. Qiao, H. Sun, H. Li, and J. Luo. Stare at what you see: Masked image modeling without reconstruction. ArXiv, abs/2211.08887, 2022.
  • [245] H. Yan, Y. Liu, Y. Wei, Z. Li, G. Li, and L. Lin. Skeletonmae: Graph-based masked autoencoder for skeleton sequence pre-training. ArXiv, abs/2307.08476, 2023.
  • [246] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas. Videogpt: Video generation using vq-vae and transformers. ArXiv, abs/2104.10157, 2021.
  • [247] H. Yang, D. Huang, B. Wen, J. Wu, H. Yao, Y. Jiang, X. Zhu, and Z. Yuan. Self-supervised video representation learning with motion-aware masked autoencoders. 2022.
  • [248] H. Yang, S. Zhang, D. Huang, X. Wu, H. Zhu, T. He, S. Tang, H. Zhao, Q. Qiu, B. Lin, X. He, and W. Ouyang. Unipad: A universal pre-training paradigm for autonomous driving. ArXiv, abs/2310.08370, 2023.
  • [249] K. K. Yang, N. Zanichelli, and H. Yeh. Masked inverse folding with sequence transfer for protein representation learning. bioRxiv, 2022.
  • [250] Q. Yang, W. Li, B. Li, and Y. Yuan. Mrm: Masked relation modeling for medical image pre-training with genetics. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21452–21462, October 2023.
  • [251] Y. Yang, W. Huang, Y. Wei, H. Peng, X. Jiang, H. Jiang, F. Wei, Y. Wang, H. Hu, L. Qiu, and Y. Yang. Attentive mask clip. 2022.
  • [252] Y. Yao, N. Desai, and M. S. Palaniswami. Moma: Distill from self-supervised teachers. ArXiv, abs/2302.02089, 2023.
  • [253] K. Yi, Y. Ge, X. Li, S. Yang, D. Li, J. Wu, Y. Shan, and X. Qie. Masked image modeling with denoising contrast. ArXiv, abs/2205.09616, 2022.
  • [254] Y. You and Y. Shen. Cross-modality and self-supervised protein embedding for compound–protein affinity and contact prediction. Bioinformatics, 38(Supplement_2):ii68–ii74, 2022.
  • [255] L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essa, and L. Jiang. Magvit: Masked generative video transformer. ArXiv, abs/2212.05199, 2022.
  • [256] L. Yu, Y. Cheng, Z. Wang, V. Kumar, W. Macherey, Y. Huang, D. A. Ross, I. Essa, Y. Bisk, M. Yang, K. P. Murphy, A. G. Hauptmann, and L. Jiang. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. ArXiv, abs/2306.17842, 2023.
  • [257] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn. Gradient surgery for multi-task learning. ArXiv, abs/2001.06782, 2020.
  • [258] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [259] K. Yue, B.-C. Chen, J. Geiping, H. Li, T. Goldstein, and S.-N. Lim. Object recognition as next token prediction. 2023.
  • [260] J.-T. Zhai, X. Liu, A. D. Bagdanov, K.-C. Li, and M.-M. Cheng. Masked autoencoders are efficient class incremental learners. ArXiv, abs/2308.12510, 2023.
  • [261] S. Zhai, N. Jaitly, J. Ramapuram, D. Busbridge, T. Likhomanenko, J. Y. Cheng, W. A. Talbott, C. Huang, H. Goh, and J. M. Susskind. Position prediction as an effective pretraining strategy. In International Conference on Machine Learning, 2022.
  • [262] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1204–1213, 2021.
  • [263] K. Zhang and Z. Shen. i-mae: Are latent representations in masked autoencoders linearly separable? ArXiv, abs/2210.11470, 2022.
  • [264] Q. Zhang, Y. Wang, and Y. Wang. How mask matters: Towards theoretical understandings of masked autoencoders. ArXiv, abs/2210.08344, 2022.
  • [265] R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y. J. Qiao, and H. Li. Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training. ArXiv, abs/2205.14401, 2022.
  • [266] R. Zhang, L. Wang, Y. J. Qiao, P. Gao, and H. Li. Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In CVPR, 2023.
  • [267] S. Zhang, H. Chen, H. Yang, X. Sun, P. S. Yu, and G. Xu. Graph masked autoencoders with transformers. arXiv preprint arXiv:2202.08391, 2022.
  • [268] S. Zhang, F. Zhu, R. Zhao, and J. Yan. Contextual image masking modeling via synergized contrasting without view augmentation for faster and better visual pretraining. In The Eleventh International Conference on Learning Representations, 2023.
  • [269] X. Zhang, J. Chen, J. Yuan, Q. Chen, J. Wang, X. Wang, S. Han, X. Chen, J. Pi, K. Yao, J. Han, E. Ding, and J. Wang. Cae v2: Context autoencoder with clip target. ArXiv, abs/2211.09799, 2022.
  • [270] X. Zhang, F. Liu, Z. Peng, Z. Guo, F. Wan, X.-W. Ji, and Q. Ye. Integrally migrating pre-trained transformer encoder-decoders for visual object detection. 2022.
  • [271] X. Zhang, Y. Tian, W. Huang, Q. Ye, Q. Dai, L. Xie, and Q. Tian. Hivit: Hierarchical vision transformer meets masked image modeling. ArXiv, abs/2205.14949, 2022.
  • [272] Y. Zhang, K. Gong, K. Zhang, H. Li, Y. J. Qiao, W. Ouyang, and X. Yue. Meta-transformer: A unified framework for multimodal learning. ArXiv, abs/2307.10802, 2023.
  • [273] Z. Zhang, M. Xu, A. Jamasb, V. Chenthamarakshan, A. Lozano, P. Das, and J. Tang. Protein representation learning by geometric structure pretraining. In International Conference on Learning Representations (ICLR), 2023.
  • [274] Z. Zhao, S. Wei, Q. Chen, D. Li, Y. Yang, Y. Peng, and Y. Liu. Masked retraining teacher-student framework for domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19039–19049, October 2023.
  • [275] X. Zheng, X. Ma, and C. Wang. Cim: Constrained intrinsic motivation for sparse-reward continuous control. ArXiv, abs/2211.15205, 2022.
  • [276] A. Zhou, Y. Li, Z. Qin, J. Liu, J. Pan, R. Zhang, R. Zhao, P. Gao, and H. Li. Sparsemae: Sparse training meets masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16176–16186, October 2023.
  • [277] B. Zhou, A. Khosla, À. Lapedriza, A. Torralba, and A. Oliva. Places: An image database for deep scene understanding. ArXiv, abs/1610.02055, 2016.
  • [278] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017.
  • [279] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.
  • [280] L. Zhou, H. Liu, J. Bae, J. He, D. Samaras, and P. Prasanna. Self pre-training with masked autoencoders for medical image analysis. ArXiv, abs/2203.05573, 2022.
  • [281] J. Zhu, X. Ding, Y. Ge, Y. Ge, S. Zhao, H. Zhao, X. Wang, and Y. Shan. Vl-gpt: A generative pre-trained transformer for vision and language understanding and generation. 2023.