Faster Swin-Transformer

The Faster Swin-Transformer contains the Swin-Transformer model, a state-of-the-art vision transformer model which was presented in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. The abstract of the paper is the following:

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-theart by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.

Our implementation is aligned with the official PyTorch implementation Swin-Transformer Github.

Swin-Transformer Computation Flow

Fig. 1 Flowchart of FP16/FP32 Swin-Transformer.

Fig. 2 Flowchart of INT8 Swin-Transformer (with fused MHA and int8-mode=1).

Demo

In this demo, you can run Faster Swin-Transformer as a C++ program.

Requirements

CMake >= 3.13 for PyTorch
CUDA 11.0 or newer version
NCCL 2.10 or newer version
Python 3 is recommended because some features are not supported in python 2
PyTorch: Verify on 1.10.0, >= 1.5.0 should work.

Recommend to use image nvcr.io/nvidia/pytorch:22.09-py3.

Setup

Start the docker container, ensure mounting the project directory into it. For example:

docker run \
    -it \
    --shm-size 5g \
    --rm \
    --gpus=all \
    -v {YOUR_FASTER_TRANSFORMER_PROJECT_DIR_ON_HOST}:/workspace/FasterTransformer \
    --workdir /workspace/FasterTransformer \
    nvcr.io/nvidia/pytorch:22.09-py3 bash
export WORKSPACE = /workspace/FasterTransformer

Here, we use nvcr.io/nvidia/pytorch:22.09-py3, you can also switch it to another CUDA-enabled PyTorch containers, but need to comply with the previous requirements.

Build the FasterTransformer with C++:

cd $WORKSPACE
git submodule update --init
mkdir -p build
cd build
cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_TRT=ON ..
make -j12

Note: xx is the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100).

Run

Run Swin-Transformer on C++

Firstly we use ./bin/swin_gemm as the tool to search the best GEMM configuration. And then run ./bin/swin_example or ./bin/swin_int8_example.
For FP32/FP16 swin v1 & swin v2

# data_type=0 indicates FP32, data_type=1 indicates FP16
# model_type={0,1,2,3}
#   0: tiny
#   1: small
#   2: base
#   3: large
./bin/swin_gemm <batch_size> <image_width> <window_width> <head_number of the first block> <size_per_head> <data_type> <is_use_int8> 
# ./bin/swin_example version(1/2) data_type(0/1, fp32/fp16) model_type(0-3) window_size(7/8/12/16/24) img_size(192/224/256/384) batch_size
./bin/swin_example <version[1-2]> <data_type> <model_type[0-3]> <window_size> <img_size> <batch_size>

For INT8 swin v1 & swin v2

# For INT8 cases, should set `is_use_int8=1`.
# `datatype` can be set either 0 or 1 because it does not have effect now.
# model_type={0,1,2,3}
#   0: tiny
#   1: small
#   2: base
#   3: large
./bin/swin_gemm <batch_size> <image_width> <window_width> <head_number of the first block> <size_per_head> <data_type> <is_use_int8>
# ./bin/swin_int8_example version(1/2) int8_mode(1/2) model_type(0-3) window_size(7/8/12/16/24) img_size(192/224/256/384) batch_size
./bin/swin_int8_example <version[1-2]> <int8_mode> <model_type[0-3]> <window_size> <img_size> <batch_size>

Take swin-TINY with batch=32 as an example:

# Run Swin-Transformer(TINY) version 2 under FP32 on C++:
./bin/swin_gemm 32 256 8 3 32 0 0
./bin/swin_example 2 0 0 8 256 32 

# Run Swin-Transformer(TINY) version 2 under FP16 on C++
./bin/swin_gemm 32 256 8 3 32 1 0
./bin/swin_example 2 1 0 8 256 32 

# Run Swin-Transformer(TINY) version 2 under INT8 on C++
./bin/swin_gemm 32 256 8 3 32 0 1
./bin/swin_int8_example 2 1 0 8 256 32

Run with PyTorch op

Download checkpoint

cd $WORKSPACE/examples/pytorch/swin/Swin-Transformer-Quantization
wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth
wget https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_tiny_patch4_window8_256.pth

Run FP16/FP32 pytorch op of swin v1 and swin v2

cd $WORKSPACE/examples/pytorch/swin
pip install timm==0.4.12
pip install termcolor==1.1.0

#test with random input
bash -x run_test_v1.sh <batch_size> ##profile of FP16/FP32 model with version 1
bash -x run_test_v2.sh <batch_size> ##profile of FP16/FP32 model with version 2

#test with ImageNet dataset
bash -x run_test_fp32_accuracy.sh <path_to_imagenet_dataset>
bash -x run_test_fp16_accuracy.sh <path_to_imagenet_dataset>

Run INT8 pytorch op of swin v1

Get calibrated checkpoint

Refer to Guide of Swin-Transformer Quantization Toolkit when installing dependencies and setting up datasets

cd $WORKSPACE/examples/pytorch/swin/Swin-Transformer-Quantization
python -m torch.distributed.launch --nproc_per_node 1 \
  --master_port 12345 main.py \
  --calib \
  --cfg SwinTransformer/configs/swin/swin_tiny_patch4_window7_224.yaml \
  --resume swin_tiny_patch4_window7_224.pth \
  --data-path <imagenet-path> \
  --num-calib-batch 10 \
  --calib-batchsz 8 \
  --int8-mode 1 \
  --version 1\
  --calib-output-path calib-checkpoint

NOTE: If you ONLY want to use PTQ instead of QAT: when calibrating TINY/SMALL/BASE model, --int8-mode 1 suffices. When calibrating LARGE model, we have to specify --int8-mode 2 instead of --int8-mode 1. The reason is, Swin-L is much harder to quantize, and we have to disable more quantization nodes in order to obtain satisfactory PTQ accuracy results.

(When int8_mode=1, all GEMMs are INT8-in-INT8-out. When int8_mode=2, GEMM of all fc2 layers and patchMerge are relaxed to INT8-in-INT32-out, while other GEMMs keep INT8-I/O. )

If you want to insists on using --int8-mode 1 for LARGE model (because speed of mode=1 is much faster), we recommend using QAT to finetune parameters of LARGE checkpoint.

--int8-mode 2 is only developed for swin-v1-LARGE, when it comes to swin-v2, the CUDA implementation only provides --int8-mode 1 and setting --int8-mode 2 does not change the behavior.

Run test

cd $WORKSPACE/examples/pytorch/swin
pip install timm==0.4.12
pip install termcolor==1.1.0

bash -x run_test_v1_int8.sh <batch_size> ##profile of swin-v1 INT8 model
bash -x run_test_v1_int8_accuracy.sh <batch_size> ##test accuracy of swin-v1 INT8 model

bash -x run_test_v2_int8.sh <batch_size> ##profile of swin-v2 INT8 model
bash -x run_test_v2_int8_accuracy.sh <batch_size> ##test accuracy of swin-v2 INT8 model

Note: When testing PTQ accuracy for INT8 swin-v1-LARGE, we have to specify --int8-mode 2 instead of --int8-mode 1 in run_test_int8.sh.

However, if you have finetuned a Swin-LARGE using QAT and --int8-mode 1, then you can do inference with --int8-mode 1 too. In a word, consistency have to be ensured.

Accuracy loss of INT8 Post-Training-Quantization(PTQ) compared with FP16

swin-v1

name	pretrain	resolution	acc@1	acc@5	acc@1(I)	acc@5(I)
Swin-T	ImageNet-1K	224x224	81.182	95.522	80.748(-0.434)	95.328(-0.194)
Swin-S	ImageNet-1K	224x224	83.214	96.242	82.904(-0.288)	96.196(-0.036)
Swin-B	ImageNet-1K	224x224	83.424	96.446	83.116(-0.354)	96.322(-0.144)
Swin-B	ImageNet-1K	384x384	84.474	96.956	84.034(-0.440)	96.782(-0.174)
Swin-L	ImageNet-22K	224x224	86.246	97.876	85.892(-0.354)	97.822(-0.054)
Swin-L	ImageNet-22K	384x384	87.246	98.248	86.916(-0.330)	98.174(-0.074)

swin-v2

name	pretrain	resolution	window	acc@1	acc@1(I)
SwinV2-T	ImageNet-1K	256x256	8x8	81.8	81.15(-0.65)
SwinV2-S	ImageNet-1K	256x256	8x8	83.7	83.48(-0.22)
SwinV2-B	ImageNet-1K	256x256	8x8	84.2	84.17(-0.03)
SwinV2-T	ImageNet-1K	256x256	16x16	82.8	82.22(-0.58)
SwinV2-S	ImageNet-1K	256x256	16x16	84.1	83.90(-0.30)
SwinV2-B	ImageNet-1K	256x256	16x16	84.6	84.53(-0.07)
SwinV2-B	ImageNet-22K	256x256	16x16	86.2	85.61(-0.59)
SwinV2-B	ImageNet-22K	384x384	24x24	87.1	86.46(-0.64)
SwinV2-L	ImageNet-22K	256x256	16x16	86.9	86.35(-0.55)
SwinV2-L	ImageNet-22K	384x384	24x24	87.6	86.90(-0.70)

Run with TensorRT plugin

FP16/FP32 TensorRT plugin of swin v1 and swin v2

cd $WORKSPACE/examples/tensorrt/swin

#build FP32/FP16 trt engine
sh run_builder_fp32_v1.sh # build engine of swin v1
sh run_builder_fp32_v2.sh # build engine of swin v2

sh run_builder_fp16_v1.sh # build engine of swin v1
sh run_builder_fp16_v2.sh # build engine of swin v2

#infer
sh run_infer_fp32.sh <version> <batch_size>
sh run_infer_fp16.sh <version> <batch_size>

INT8 TensorRT plugin of swin v1 and swin v2

cd $WORKSPACE/examples/tensorrt/swin

#INT8 engine build
sh run_builder_int8_v1.sh # build engine of swin v1
sh run_builder_int8_v2.sh # build engine of swin v2

#infer
sh run_infer_int8.sh <version> <batch_size>

Performance

Hardware settings:

T4 (with mclk 5000MHz, pclk 1590MHz) with Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
A100 (with mclk 1215, pclk 1410MHz) with Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

Software settings:

CUDA 11.4

We here compared the performance between Swin-Transformer and FT Swin-Transformer on T4 & A100. Here we used Swin-TINY as an example, and the hyper-parameters of the model are:

head_num = {3,6,12,24}
size_per_head = 32
num_of_blocks = {2,2,6,2}

Swin Performance on T4

Here, torch.jit.trace means using tracing to convert Torch model to TorchScript model and then profile its performance.

Swin-V1 FP32

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	11.70	4.49	2.61	4.64	2.52	4.68	2.50
8	36.89	27.39	1.35	29.23	1.26	29.04	1.27
16	72.26	52.55	1.38	57.32	1.26	56.90	1.27
32	140.80	102.15	1.38	111.31	1.27	111.33	1.26

Swin-V1 FP16

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	10.49	1.43	7.34	1.59	6.60	2.12	4.95
8	18.73	6.60	2.84	7.47	2.51	7.03	2.66
16	37.60	13.12	2.87	14.84	2.53	13.87	2.71
32	74.85	25.25	2.96	28.74	2.60	27.64	2.71

Swin-V1 INT8

Batch_size	cpp	speedup(vs FP16)	trt plugin	speedup(vs FP16)	torch op	speedup(vs FP16)
1	1.18	1.21	1.40	1.14	1.59	1.33
8	4.26	1.55	5.26	1.42	4.87	1.44
16	8.57	1.53	10.58	1.40	9.80	1.42
32	16.81	1.50	20.48	1.40	19.36	1.43

INT8 vs. FP16 speedup on Swin-V1 TINY/SMALL/BASE/LARGE:

Batch_size	TINY (FP16)	TINY (INT8)	Speedup	SMALL (FP16)	SMALL (INT8)	Speedup	BASE (FP16)	BASE (INT8)	Speedup	LARGE (FP16)	LARGE (INT8)	Speedup
1	1.43	1.18	1.21	2.90	2.15	1.34	3.41	2.60	1.31	5.98	3.67	1.63
8	6.60	4.26	1.55	11.02	6.63	1.66	16.62	9.78	1.70	30.81	16.95	1.82
16	13.12	8.57	1.53	21.85	13.59	1.61	31.03	19.12	1.62	59.76	33.86	1.76
32	25.25	16.81	1.50	42.41	26.87	1.58	60.99	38.21	1.60	115.07	70.23	1.64

Swin-V2 FP32

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	12.33	5.47	2.25	5.69	2.17	5.70	2.16
8	49.57	37.17	1.33	39.01	1.27	38.73	1.28
16	97.15	72.03	1.35	75.50	1.29	75.60	1.29
32	193.09	141.31	1.37	148.65	1.30	150.63	1.28

Swin-V2 FP16

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	16.38	1.66	9.86	1.86	8.80	2.09	7.84
8	24.51	9.10	2.69	10.20	2.40	9.60	2.55
16	48.93	17.64	2.77	19.75	2.48	19.08	2.56
32	98.46	34.67	2.84	39.58	2.48	38.76	2.54

Swin-V2 INT8

Batch_size	cpp	speedup(vs FP16)	trt plugin	speedup(vs FP16)	torch op	speedup(vs FP16)
1	1.32	1.25	1.52	1.22	1.61	1.30
8	5.81	1.57	7.12	1.43	6.61	1.45
16	11.84	1.49	14.35	1.38	13.36	1.43
32	23.59	1.47	28.23	1.40	26.52	1.46

INT8 vs. FP16 speedup of Swin-V2:

Model	window_size	input_size	FP16	INT8	Speedup
TINY	8	256	34.67	23.59	1.47
SMALL	8	256	57.81	38.44	1.50
BASE	8	256	82.88	54.88	1.51
TINY	16	256	37.84	26.58	1.42
SMALL	16	256	63.65	43.11	1.48
BASE	16	256	90.40	60.82	1.49
BASE	24	384	-	596.62	-
LARGE	16	256	169.22	108.34	1.56
LARGE	24	384	-	-	-

Swin Performance on A10

Swin-V1 TF32

On chips with Ampere architectures (like A30, A100), user can use export NVIDIA_TF32_OVERRIDE=1 to enforce the program run under TF32, otherwise FP32 GEMM is used by default, which is much slower.

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	4.97	1.72	2.89	1.69	2.94	1.76	2.82
8	13.20	9.17	1.44	9.21	1.43	9.30	1.42
16	25.80	17.64	1.46	17.72	1.45	17.80	1.45
32	50.41	33.90	1.49	34.06	1.48	34.30	1.47

Swin-V1 FP16

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	5.04	0.92	5.48	0.97	5.20	1.39	3.63
8	7.53	3.20	2.24	3.58	2.10	3.28	2.30
16	14.80	6.10	2.43	6.73	2.20	6.40	2.31
32	29.27	11.89	2.46	12.99	2.25	12.31	2.38

Swin-V1 INT8

Batch_size	cpp	speedup(vs FP16)	trt plugin	speedup(vs FP16)	torch op	speedup(vs FP16)
1	1.01	0.91	1.22	0.80	1.02	1.36
8	2.15	1.49	2.62	1.36	2.36	1.39
16	3.96	1.54	4.74	1.42	4.37	1.46
32	7.40	1.61	9.09	1.43	8.49	1.45

INT8 vs. FP16 speedup on Swin-V1 TINY/SMALL/BASE/LARGE:

Batch_size	TINY (FP16)	TINY (INT8)	Speedup	SMALL (FP16)	SMALL (INT8)	Speedup	BASE (FP16)	BASE (INT8)	Speedup	LARGE (FP16)	LARGE (INT8)	Speedup
1	0.91	1.01	0.91	1.67	1.90	0.88	1.86	2.17	0.86	2.93	2.74	1.07
8	3.20	2.15	1.49	5.38	3.57	1.51	7.54	4.62	1.63	14.56	7.89	1.85
16	6.10	3.96	1.54	10.32	6.41	1.61	15.10	8.73	1.73	27.94	15.25	1.83
32	11.89	7.40	1.61	19.90	11.99	1.66	28.60	16.94	1.69	52.56	29.82	1.76
64	23.20	15.28	1.52	39.33	24.98	1.57	56.91	35.59	1.60	106.81	64.29	1.66

Swin-V2 TF32

On chips with Ampere architectures (like A30, A100), user can use export NVIDIA_TF32_OVERRIDE=1 to enforce the program run under TF32, otherwise FP32 GEMM is used by default, which is much slower.

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	6.35	1.94	3.27	1.93	3.29	1.96	3.24
8	16.50	11.20	1.47	11.34	1.46	11.26	1.47
16	32.66	22.13	1.48	22.34	1.46	22.43	1.46
32	63.60	43.35	1.47	43.89	1.45	44.00	1.45

Swin-V2 FP16

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	6.14	1.00	6.14	1.10	5.58	1.38	4.45
8	10.83	4.17	2.60	4.65	2.33	4.42	2.45
16	20.88	8.06	2.59	8.86	2.36	8.34	2.50
32	40.93	15.99	2.56	17.48	2.34	16.04	2.55

Swin-V2 INT8

Batch_size	cpp	speedup(vs FP16)	trt plugin	speedup(vs FP16)	torch op	speedup(vs FP16)
1	1.10	0.91	1.28	0.86	1.10	1.25
8	2.76	1.51	3.38	1.38	3.04	1.45
16	5.29	1.52	6.34	1.40	5.75	1.45
32	10.42	1.54	12.17	1.44	11.37	1.41

INT8 vs. FP16 speedup of Swin-V2:

Model	window_size	input_size	FP16	INT8	Speedup
TINY	8	256	15.99	10.42	1.54
SMALL	8	256	26.84	16.65	1.61
BASE	8	256	37.59	23.29	1.61
TINY	16	256	17.46	11.85	1.54
SMALL	16	256	29.34	19.01	1.54
BASE	16	256	40.84	26.31	1.55
BASE	24	384	-	225.99	-
LARGE	16	256	74.66	45.90	1.63
LARGE	24	384	-	-	-

Swin Performance on A100

Here, torch.jit.trace means using tracing to convert Torch model to TorchScript model and then profile its performance.

Swin-V1 TF32

On chips with Ampere architectures (like A30, A100), user can use export NVIDIA_TF32_OVERRIDE=1 to enforce the program run under TF32, otherwise FP32 GEMM is used by default, which is much slower.

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	4.90	1.59	3.08	1.73	2.83	1.77	2.77
8	7.79	5.52	1.41	5.57	1.40	5.53	1.41
16	13.40	9.81	1.37	9.47	1.41	9.69	1.38
32	25.25	18.23	1.39	18.25	1.38	19.09	1.32

Swin-V1 FP16

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	5.01	1.23	4.07	1.47	3.40	1.65	3.04
8	5.47	1.94	2.81	2.79	1.96	1.92	2.85
16	8.97	3.17	2.83	3.11	2.88	2.87	3.13
32	16.62	5.46	3.04	6.24	2.66	5.73	2.90

Swin-V1 INT8

Batch_size	cpp	speedup(vs FP16)	trt plugin	speedup(vs FP16)	torch op	speedup(vs FP16)
1	1.07	1.14	1.45	1.01	1.20	1.38
8	1.62	1.20	1.88	1.48	1.85	1.04
16	2.63	1.21	2.80	1.11	2.61	1.10
32	4.46	1.23	5.05	1.24	3.88	1.48

INT8 vs. FP16 speedup on Swin-V1 TINY/SMALL/BASE/LARGE:

Batch_size	TINY (FP16)	TINY (INT8)	Speedup	SMALL (FP16)	SMALL (INT8)	Speedup	BASE (FP16)	BASE (INT8)	Speedup	LARGE (FP16)	LARGE (INT8)	Speedup
1	1.23	1.07	1.14	1.88	2.06	0.91	2.00	3.74	0.53	2.17	2.96	0.73
8	1.94	1.62	1.20	3.28	3.21	1.02	3.98	3.96	1.01	6.15	4.58	1.34
16	3.17	2.63	1.21	5.40	4.39	1.23	7.30	5.33	1.37	11.76	8.11	1.45
32	5.46	4.46	1.23	8.83	6.56	1.34	11.07	9.08	1.22	20.88	16.28	1.28
64	10.35	8.17	1.27	17.15	13.22	1.29	21.93	16.39	1.34	41.78	29.29	1.43

Swin-V2 TF32

On chips with Ampere architectures (like A30, A100), user can use export NVIDIA_TF32_OVERRIDE=1 to enforce the program run under TF32, otherwise FP32 GEMM is used by default, which is much slower.

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	6.38	1.74	3.67	1.48	4.31	1.78	3.58
8	8.26	7.02	1.18	4.96	1.66	5.57	1.48
16	15.24	10.78	1.41	8.85	1.72	10.42	1.46
32	28.30	19.72	1.44	16.84	1.68	19.43	1.46

Swin-V2 FP16

Batch_size	torch.jit.trace	cpp	speedup	trt plugin	speedup	torch op	speedup
1	6.33	1.05	6.03	1.34	4.72	1.35	4.69
8	7.64	2.40	3.18	2.67	2.86	2.40	3.18
16	12.91	3.43	3.76	4.36	2.96	4.12	3.13
32	24.84	7.29	3.40	8.46	2.94	7.94	3.12

Swin-V2 INT8

Batch_size	cpp	speedup(vs FP16)	trt plugin	speedup(vs FP16)	torch op	speedup(vs FP16)
1	1.15	0.91	1.55	0.86	1.21	1.11
8	1.95	1.23	2.54	1.05	2.27	1.06
16	2.98	1.15	3.81	1.14	3.41	1.21
32	5.17	1.41	6.65	1.27	6.22	1.28

INT8 vs. FP16 speedup of Swin-V2:

Model	window_size	input_size	FP16	INT8	Speedup
TINY	8	256	7.29	5.17	1.41
SMALL	8	256	11.55	9.63	1.20
BASE	8	256	16.78	12.74	1.31
TINY	16	256	8.64	7.52	1.15
SMALL	16	256	13.86	11.66	1.19
BASE	16	256	19.27	14.97	1.29
BASE	24	384	-	135.38	-
LARGE	16	256	31.37	24.20	1.30
LARGE	24	384	-	-	-

Files

swin_guide.md

Latest commit

History

swin_guide.md

File metadata and controls

Faster Swin-Transformer

Table of Contents

Swin-Transformer Computation Flow

Demo

Requirements

Setup

Run

Run Swin-Transformer on C++

Run with PyTorch op

Run with TensorRT plugin

Performance

Swin Performance on T4

Swin-V1 FP32

Swin-V1 FP16

Swin-V1 INT8

INT8 vs. FP16 speedup on Swin-V1 TINY/SMALL/BASE/LARGE:

Swin-V2 FP32

Swin-V2 FP16

Swin-V2 INT8

INT8 vs. FP16 speedup of Swin-V2:

Swin Performance on A10

Swin-V1 TF32

Swin-V1 FP16

Swin-V1 INT8

INT8 vs. FP16 speedup on Swin-V1 TINY/SMALL/BASE/LARGE:

Swin-V2 TF32

Swin-V2 FP16

Swin-V2 INT8

INT8 vs. FP16 speedup of Swin-V2:

Swin Performance on A100

Swin-V1 TF32

Swin-V1 FP16

Swin-V1 INT8

INT8 vs. FP16 speedup on Swin-V1 TINY/SMALL/BASE/LARGE:

Swin-V2 TF32

Swin-V2 FP16

Swin-V2 INT8

INT8 vs. FP16 speedup of Swin-V2: