Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero mAP without mask #11

Open
dereyly opened this issue May 17, 2019 · 11 comments
Open

Zero mAP without mask #11

dereyly opened this issue May 17, 2019 · 11 comments

Comments

@dereyly
Copy link

dereyly commented May 17, 2019

Hello.
Thank you for nice work, I try to use non local nets (GCNet) on practice
This config DCN + GCNet r4 + scale_augmentation and without mask -- faster RCNN (cascade)
mAP =0
I read the log and its strange acc = 97.6621 from begining to end -- maybe it is trivial solution always 0

# model settings
model = dict(
    type='CascadeRCNN',
    num_stages=3,
    pretrained='modelzoo:https://resnet50',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        style='pytorch',
        ct=dict(
            insert_pos='after_1x1',
            ratio=1./4.,
        ),
        stage_with_ct=(False, True, True, True),
        dcn=dict(
            modulated=False,
            groups=32,
            deformable_groups=1,
            fallback_on_stride=False),
        stage_with_dcn=(False, True, True, True),
        normalize=dict(type='SyncBN', frozen=False),
        norm_eval=False,
    ),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_scales=[8],
        anchor_ratios=[0.5, 1.0, 2.0],
        anchor_strides=[4, 8, 16, 32, 64],
        target_means=[.0, .0, .0, .0],
        target_stds=[1.0, 1.0, 1.0, 1.0],
        use_sigmoid_cls=True),
    bbox_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', out_size=7, sample_num=2),
        out_channels=256,
        featmap_strides=[4, 8, 16, 32]),
    bbox_head=[
        dict(
            type='SharedFCBBoxHead',
            num_fcs=2,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=81,
            target_means=[0., 0., 0., 0.],
            target_stds=[0.1, 0.1, 0.2, 0.2],
            reg_class_agnostic=True),
        dict(
            type='SharedFCBBoxHead',
            num_fcs=2,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=81,
            target_means=[0., 0., 0., 0.],
            target_stds=[0.05, 0.05, 0.1, 0.1],
            reg_class_agnostic=True),
        dict(
            type='SharedFCBBoxHead',
            num_fcs=2,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=81,
            target_means=[0., 0., 0., 0.],
            target_stds=[0.033, 0.033, 0.067, 0.067],
            reg_class_agnostic=True)
    ])
# model training and testing settings
train_cfg = dict(
    rpn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.7,
            neg_iou_thr=0.3,
            min_pos_iou=0.3,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=256,
            pos_fraction=0.5,
            neg_pos_ub=-1,
            add_gt_as_proposals=False),
        allowed_border=0,
        pos_weight=-1,
        smoothl1_beta=1 / 9.0,
        debug=False),
    rcnn=[
        dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=-1,
            debug=False),
        dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.6,
                neg_iou_thr=0.6,
                min_pos_iou=0.6,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=-1,
            debug=False),
        dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.7,
                min_pos_iou=0.7,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=-1,
            debug=False)
    ],
    stage_loss_weights=[1, 0.5, 0.25])
test_cfg = dict(
    rpn=dict(
        nms_across_levels=False,
        nms_pre=2000,
        nms_post=2000,
        max_num=2000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        score_thr=0.05, nms=dict(type='nms', iou_thr=0.5), max_per_img=100),
    keep_all_stages=False)
# dataset settings
dataset_type = 'CocoDataset'
data_root = 'data/COCO/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_train2017.json',
        img_prefix=data_root + 'train2017/',
        img_scale=[(1600, 400), (1600, 1400)],
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0.5,
        with_mask=True,
        with_crowd=True,
        with_label=True),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0,
        with_mask=True,
        with_crowd=True,
        with_label=True),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0,
        with_mask=False,
        with_label=False,
        test_mode=True))
# optimizer
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    step=[8, 11])
checkpoint_config = dict(interval=1)
# yapf:disable
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
# runtime settings
total_epochs = 12
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = '/media/HD2/nsergievskiy/models/cascde_gcnet_r50'
load_from = None
resume_from = None
workflow = [('train', 1)]

20190514_202710.log

@rainofmine
Copy link

@dereyly I met the same problem. I see your train log. The loss becomes very large at 500 iters and finally the loss_rpn_cls is still about 0.3. It means the model is not trained correctly. The case is similar to me. Do you know the reason?

@xvjiarui
Copy link
Owner

xvjiarui commented Jul 3, 2019

Sorry for the late reply

As mentioned in #1 (comment), the SyncBN has some stability issues. You are suggested to train the SyncBN model with batch size a least 16.

Also, it is not the trivial solution in my opinion. You are suggested to train a model without GC first to see if the mAP with drops.

@FishLikeApple
Copy link

Hi, thanks for your work. I'm dealing with this issue too, can GN be used to replace SyncBN?

@xvjiarui
Copy link
Owner

xvjiarui commented Aug 5, 2019

Hi, thanks for your work. I'm dealing with this issue too, can GN be used to replace SyncBN?

Hi, we haven't try it. We may try it in the future. You are welcome to try it by yourself.

@FishLikeApple
Copy link

Because I only have access to one GPU (P40), I can't use SyncBN. I have tried GN with a batch size of 1 and 10241024 image input and default normalization setting with a batch size of 6 and 512512 image input (small input due to memory limitation), the later worked fine, but the first just gave no bbox prediction. Then by print, I found that when using GN, the feature map after the backbone net will become with the same pixels (like [-3.1547, -3.1547, -3.1547 .....]) after some training.

@FishLikeApple
Copy link

By the way the model is cascade_mask_rcnn_r16_gcb_dconv_c3-c5_x101_32x4d_fpn_syncbn_1x

@FishLikeApple
Copy link

I want to train a SyncBN model in a single P40, any suggestion is welcome. Thank you.

@xvjiarui
Copy link
Owner

xvjiarui commented Aug 6, 2019

There is no need for SyncBN when only one GPU is available. You could use BN instead. Linear scale up strategy should be adopted according to your effective batch size.

I don't think non SyncBN would have Zero mAP issue. The fixBN results could be found in the updated readme.

@FishLikeApple
Copy link

It turns out that the batch size of 1 works if I use a smaller learning rate with longer time consumed. So I think this issue may have two causes, one is a too large learning rate, the other is some data input error (as mentioned in tensorflow/models#6273).

@FishLikeApple
Copy link

In my machine learning course, I have learned that small learning rates may lead to local optima, but is no prediction a local optimum for object detection tasts like this?
By the way, in the paper https://arxiv.org/pdf/1803.08494.pdf, one can see the shortcoming of small batchs without GN, and my machine learning course also taught me that a smaller batch will cause a worse result if the dataset is not large enough, so I want use some method to deal with the problem of a small batch. Thanks for you replying.

@FishLikeApple
Copy link

By the way, the batch size of 1 didn't work very well in the first 1/3 period of training. I'm still finding better settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants