Unofficial PyTorch Implementation of Exploring Plain Vision Transformer Backbones for Object Detection
Results | Updates | Usage | Todo | Acknowledge
This branch contains the unofficial pytorch implementation of Exploring Plain Vision Transformer Backbones for Object Detection. Thanks for their wonderful work!
The models are trained on 4 A100 machines with 2 images per gpu, which makes a batch size of 64 during training.
Model | Pretrain | Machine | FrameWork | Box mAP | Mask mAP | config | log | weight |
---|---|---|---|---|---|---|---|---|
ViT-Base | IN1K+MAE | TPU | Mask RCNN | 51.1 | 45.5 | config | log | OneDrive |
ViT-Base | IN1K+MAE | GPU | Mask RCNN | 51.1 | 45.4 | config | log | OneDrive |
ViTAE-Base | IN1K+MAE | GPU | Mask RCNN | 51.6 | 45.8 | config | log | OneDrive |
ViTAE-Small | IN1K+Sup | GPU | Mask RCNN | 45.6 | 40.1 | config | log | OneDrive |
[2022-04-18] Explore using small 1K supervised trained models (20M parameters) for ViTDet (45.6 mAP). The results with multi-stage structure is 46.0 mAP for Swin-T and 47.8 mAP for ViTAEv2-S with Mask RCNN on COCO.
[2022-04-17] Release the pretrained weights and logs for ViT-B and ViTAE-B on MS COCO. The models are totally trained with PyTorch on GPU.
[2022-04-16] Release the initial unofficial implementation of ViTDet with ViT-Base model! It obtains 51.1 mAP and 45.5 mAP on detection and segmentation, respectively. The weights and logs will be uploaded soon.
Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting
We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/ViTAE-Transformer/ViTDet.git
cd ViTDet
pip install -v -e .
After install the two repos, install timm and einops, i.e.,
pip install timm==0.4.9 einops
Download the pretrained models from MAE or ViTAE, and then conduct the experiments by
# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH>
# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch
This repo current contains modifications including:
- using LN for the convolutions in RPN and heads
- using large scale jittor for augmentation
- using RPE from MViT
- using longer training epochs and 1024 test size
- using global attention layers
There are other things to do:
-
Implement the conv blocks for global information communication
-
Tune the models for Cascade RCNN
-
Train ViT models for the LVIS dataset
-
Train ViTAE model with the ViTDet framework
We acknowledge the excellent implementation from mmdetection, MAE, MViT, and BeiT.
@article{Li2022ExploringPV,
title={Exploring Plain Vision Transformer Backbones for Object Detection},
author={Yanghao Li and Hanzi Mao and Ross B. Girshick and Kaiming He},
journal={ArXiv},
year={2022},
volume={abs/2203.16527}
}
For ViTAE and ViTAEv2, please refer to:
@article{xu2021vitae,
title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
journal={Advances in Neural Information Processing Systems},
volume={34},
year={2021}
}
@article{zhang2022vitaev2,
title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
journal={arXiv preprint arXiv:2202.10108},
year={2022}
}