CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Dong, Xiaoyi; Bao, Jianmin; Chen, Dongdong; Zhang, Weiming; Yu, Nenghai; Yuan, Lu; Chen, Dong; Guo, Baining

Computer Science > Computer Vision and Pattern Recognition

arXiv:2107.00652v2 (cs)

[Submitted on 1 Jul 2021 (v1), revised 15 Jul 2021 (this version, v2), latest version 9 Jan 2022 (v3)]

Title:CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Authors:Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo

View PDF

Abstract:We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a detailed mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 51.7 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and state-of-the-art segmentation performance on ADE20K with 55.7 mIoU. The code and models will be available at this https URL.

Comments:	The code repo is available at this https URL, SOTA performance on ADE20k Segmentation benchmark is updated
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2107.00652 [cs.CV]
	(or arXiv:2107.00652v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2107.00652

Submission history

From: Dongdong Chen [view email]
[v1] Thu, 1 Jul 2021 17:59:56 UTC (506 KB)
[v2] Thu, 15 Jul 2021 17:59:49 UTC (506 KB)
[v3] Sun, 9 Jan 2022 05:49:30 UTC (269 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators