rebenchmark for Pyslowfast comparison (open-mmlab#28)

* add resolution for K400 * rebenchmark on 32G V100 * add new i3d config and i3d benchmakr minor minor Co-authored-by: linjintao <[email protected]>
sibozhang · Jul 19, 2020 · b96aab9 · b96aab9
1 parent 3aefca8
commit b96aab9
Show file tree

Hide file tree

Showing 3 changed files with 137 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -65,12 +65,13 @@ We compare with other popular codebases and the [results](https://mmaction2.read
 | Model | MMAction2 (s/iter) | MMAction (s/iter) | Temporal-Shift-Module (s/iter) | PySlowFast (s/iter) |
 | :--- | :---------------: | :--------------------: | :----------------------------: | :-----------------: |
 | [TSN](/configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py) | **0.29** | 0.36 | 0.45 | x |
-| [I3D (setting1)](/configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py) | **0.45** | 0.58 | x | x |
-| [I3D (setting2)](/configs/recognition/i3d/i3d_r50_8x8x1_100e_kinetics400_rgb.py) | **0.32** | x | x | 0.56 |
+| [I3D (video)](/configs/recognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py) | **0.31** | x | x | 0.59 |
+| [I3D (rawframe)](/configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py) | **0.45** | 0.58 | x | x |
 | [TSM](/configs/recognition/tsm/tsm_r50_1x1x8_50e_kinetics400_rgb.py) | **0.30** | x | 0.38 | x |
-| [Slowonly](/configs/recognition/slowonly/slowonly_r50_4x16x1_256e_kinetics400_rgb.py) | **0.30** | x | x | 1.03 |
-| [Slowfast](/configs/recognition/slowfast/slowfast_r50_4x16x1_256e_kinetics400_rgb.py) | **0.80** | x | x | 1.40 |
-| [R(2+1)D](/configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py) | **0.48** | x | x | x |
+| [Slowonly](/configs/recognition/slowonly/slowonly_r50_video_4x16x1_256e_kinetics400_rgb.py) | **0.27** | x | x | 0.89 |
+| [Slowfast](/configs/recognition/slowfast/slowfast_r50_video_4x16x1_256e_kinetics400_rgb.py) | **0.68** | x | x | 1.07 |
+| [R(2+1)D](/configs/recognition/r2plus1d/r2plus1d_r34_video_8x8x1_180e_kinetics400_rgb.py) | **0.45** | x | x | x |
+
 
 Supported methods for action recognition:
 - [x] [TSN](configs/recognition/tsn/README.md)

diff --git a/configs/recognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py b/configs/recognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py
@@ -0,0 +1,125 @@
+# model settings
+model = dict(
+ type='Recognizer3D',
+ backbone=dict(
+ type='ResNet3d',
+ pretrained2d=True,
+ pretrained='torchvision:https://resnet50',
+ depth=50,
+ conv_cfg=dict(type='Conv3d'),
+ norm_eval=False,
+ inflate=((1, 1, 1), (1, 0, 1, 0), (1, 0, 1, 0, 1, 0), (0, 1, 0)),
+ zero_init_residual=False),
+ cls_head=dict(
+ type='I3DHead',
+ num_classes=400,
+ in_channels=2048,
+ spatial_type='avg',
+ dropout_ratio=0.5,
+ init_std=0.01))
+# model training and testing settings
+train_cfg = None
+test_cfg = dict(average_clips=None)
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root = 'data/kinetics400/videos_train'
+data_root_val = 'data/kinetics400/videos_val'
+ann_file_train = 'data/kinetics400/kinetics400_train_list_videos.txt'
+ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt'
+ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'
+img_norm_cfg = dict(
+ mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)
+train_pipeline = [
+ dict(type='DecordInit'),
+ dict(type='SampleFrames', clip_len=8, frame_interval=8, num_clips=1),
+ dict(type='DecordDecode'),
+ dict(type='Resize', scale=(-1, 256)),
+ dict(
+ type='MultiScaleCrop',
+ input_size=224,
+ scales=(1, 0.8),
+ random_crop=False,
+ max_wh_scale_gap=0),
+ dict(type='Resize', scale=(224, 224), keep_ratio=False),
+ dict(type='Flip', flip_ratio=0.5),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='FormatShape', input_format='NCTHW'),
+ dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
+ dict(type='ToTensor', keys=['imgs', 'label'])
+]
+val_pipeline = [
+ dict(type='DecordInit'),
+ dict(
+ type='SampleFrames',
+ clip_len=8,
+ frame_interval=8,
+ num_clips=1,
+ test_mode=True),
+ dict(type='DecordDecode'),
+ dict(type='Resize', scale=(-1, 256)),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='Flip', flip_ratio=0),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='FormatShape', input_format='NCTHW'),
+ dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
+ dict(type='ToTensor', keys=['imgs'])
+]
+test_pipeline = [
+ dict(type='DecordInit'),
+ dict(
+ type='SampleFrames',
+ clip_len=8,
+ frame_interval=8,
+ num_clips=10,
+ test_mode=True),
+ dict(type='DecordDecode'),
+ dict(type='Resize', scale=(-1, 256)),
+ dict(type='ThreeCrop', crop_size=256),
+ dict(type='Flip', flip_ratio=0),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='FormatShape', input_format='NCTHW'),
+ dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
+ dict(type='ToTensor', keys=['imgs'])
+]
+data = dict(
+ videos_per_gpu=8,
+ workers_per_gpu=4,
+ train=dict(
+ type=dataset_type,
+ ann_file=ann_file_train,
+ data_prefix=data_root,
+ pipeline=train_pipeline),
+ val=dict(
+ type=dataset_type,
+ ann_file=ann_file_val,
+ data_prefix=data_root_val,
+ pipeline=val_pipeline),
+ test=dict(
+ type=dataset_type,
+ ann_file=ann_file_val,
+ data_prefix=data_root_val,
+ pipeline=test_pipeline))
+# optimizer
+optimizer = dict(
+ type='SGD', lr=0.01, momentum=0.9,
+ weight_decay=0.0001) # this lr is used for 8 gpus
+optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
+# learning policy
+lr_config = dict(policy='step', step=[40, 80])
+total_epochs = 100
+checkpoint_config = dict(interval=5)
+evaluation = dict(
+ interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'], topk=(1, 5))
+log_config = dict(
+ interval=20,
+ hooks=[
+ dict(type='TextLoggerHook'),
+ # dict(type='TensorboardLoggerHook'),
+ ])
+# runtime settings
+dist_params = dict(backend='nccl')
+log_level = 'INFO'
+work_dir = './work_dirs/i3d_r50_video_3d_32x2x1_100e_kinetics400_rgb/'
+load_from = None
+resume_from = None
+workflow = [('train', 1)]
diff --git a/docs/benchmark.md b/docs/benchmark.md
@@ -8,7 +8,7 @@ Here we compare our MMAction2 repo with other video understanding toolboxes in t
 by the training time per iteration. Here, we use
 - commit id [7f3490d](https://github.com/open-mmlab/mmaction/tree/7f3490d3db6a67fe7b87bfef238b757403b670e3)(1/5/2020) of MMAction
 - commit id [8d53d6f](https://github.com/mit-han-lab/temporal-shift-module/tree/8d53d6fda40bea2f1b37a6095279c4b454d672bd)(5/5/2020) of Temporal-Shift-Module
-- commit id [133e40f](https://github.com/facebookresearch/SlowFast/tree/133e40f8349ce37b0e6168639da0811a413579c8)(30/5/2020) of PySlowFast
+- commit id [8299c98](https://github.com/facebookresearch/SlowFast/tree/8299c9862f83a067fa7114ce98120ae1568a83ec)(7/7/2020) of PySlowFast
 - commit id [f13707f](https://github.com/wzmsltw/BSN-boundary-sensitive-network/tree/f13707fbc362486e93178c39f9c4d398afe2cb2f)(12/12/2018) of BSN(boundary sensitive network)
 - commit id [45d0514](https://github.com/JJBOY/BMN-Boundary-Matching-Network/tree/45d05146822b85ca672b65f3d030509583d0135a)(17/10/2019) of BMN(boundary matching network)
 
@@ -24,12 +24,12 @@ The training speed is measure with s/iter. The lower, the better.
 | Model | MMAction2 (s/iter) | MMAction (s/iter) | Temporal-Shift-Module (s/iter) | PySlowFast (s/iter) |
 | :--- | :---------------: | :--------------------: | :----------------------------: | :-----------------: |
 | [TSN](/configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py) | **0.29** | 0.36 | 0.45 | x |
-| [I3D (setting1)](/configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py) | **0.45** | 0.58 | x | x |
-| [I3D (setting2)](/configs/recognition/i3d/i3d_r50_8x8x1_100e_kinetics400_rgb.py) | **0.32** | x | x | 0.56 |
+| [I3D (video)](/configs/recognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py) | **0.31** | x | x | 0.59 |
+| [I3D (rawframe)](/configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py) | **0.45** | 0.58 | x | x |
 | [TSM](/configs/recognition/tsm/tsm_r50_1x1x8_50e_kinetics400_rgb.py) | **0.30** | x | 0.38 | x |
-| [Slowonly](/configs/recognition/slowonly/slowonly_r50_4x16x1_256e_kinetics400_rgb.py) | **0.30** | x | x | 1.03 |
-| [Slowfast](/configs/recognition/slowfast/slowfast_r50_4x16x1_256e_kinetics400_rgb.py) | **0.80** | x | x | 1.40 |
-| [R(2+1)D](/configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py) | **0.48** | x | x | x |
+| [Slowonly](/configs/recognition/slowonly/slowonly_r50_video_4x16x1_256e_kinetics400_rgb.py) | **0.27** | x | x | 0.89 |
+| [Slowfast](/configs/recognition/slowfast/slowfast_r50_video_4x16x1_256e_kinetics400_rgb.py) | **0.68** | x | x | 1.07 |
+| [R(2+1)D](/configs/recognition/r2plus1d/r2plus1d_r34_video_8x8x1_180e_kinetics400_rgb.py) | **0.45** | x | x | x |
 
 ## Localizers