RegionViT: Regional-to-Local Attention for Vision Transformers

Chen, Chun-Fu; Panda, Rameswar; Fan, Quanfu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2106.02689 (cs)

[Submitted on 4 Jun 2021 (v1), last revised 31 Mar 2022 (this version, v3)]

Title:RegionViT: Regional-to-Local Attention for Vision Transformers

Authors:Chun-Fu Chen, Rameswar Panda, Quanfu Fan

View PDF

Abstract:Vision transformer (ViT) has recently shown its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on four vision tasks, including image classification, object and keypoint detection, semantics segmentation and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models are available at this https URL.

Comments:	add more results and link to codes and models. this https URL, formatted with ICLR style
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2106.02689 [cs.CV]
	(or arXiv:2106.02689v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2106.02689

Submission history

From: Chun-Fu (Richard) Chen [view email]
[v1] Fri, 4 Jun 2021 19:57:11 UTC (129 KB)
[v2] Thu, 16 Dec 2021 22:16:46 UTC (213 KB)
[v3] Thu, 31 Mar 2022 03:20:15 UTC (178 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RegionViT: Regional-to-Local Attention for Vision Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RegionViT: Regional-to-Local Attention for Vision Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators