Discobox evaluation failed with custom dataset #5

ameyparanjape · 2022-06-11T19:58:04Z

Thanks to the authors for making DiscoBox code public.
I am trying to finetune the COCO checkpoint on my custom COCO style dataset.
My training seems to be running fine, but as soon as the scripts gets into evaluation (during train), it gets stuck and after a minutes wait, process is terminated abruptly.
Here is the train command I use:

bash tools/dist_train.sh configs/discobox/custom_discobox_solov2_r50_fpn_3x.py 2

Error I get -

  [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                     ] 166/279, 1.3 task/s, elapsed: 124s, ETA:    85sTraceback (most recent call last):
  File "/opt/conda/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/opt/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/envs/open-mmlab/bin/python', '-u', 'tools/test.py', '--local_rank=0', 'configs/discobox/custom_solov2_r50_fpn_3x.py', 'work_dirs/roboflow_data/epoch_1.pth', '--launcher', 'pytorch', '--eval', 'bbox', 'segm']' died with <Signals.SIGKILL: 9>.

To further investigate this, I also tried running training on 1 GPU,

bash tools/dist_train.sh configs/discobox/custom_discobox_solov2_r50_fpn_3x.py 1

This failed too with the same error.

Another thing I thought is wroth trying is to run a separate eval -

bash tools/dist_test.sh configs/discobox/custom_solov2_r50_fpn_3x.py work_dirs/roboflow_data/epoch_1.pth 1 --eval bbox segm

Again, resulted into the same error.

Has anyone come across this error? Please help, thanks!

The text was updated successfully, but these errors were encountered:

ameyparanjape · 2022-06-11T20:10:55Z

If it helps, here is the config file I'm using:

fp16 = dict(loss_scale=512.)
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
model = dict(
    type='DiscoBoxSOLOv2',
    pretrained='torchvision:https://resnet50',
    train_cfg=dict(),
    test_cfg = dict(
        nms_pre=500,
        score_thr=0.1,
        mask_thr=0.4,
        update_thr=0.05,
        kernel='gaussian',  # gaussian/linear
        sigma=2.0,
        max_per_img=100),
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3), # C2, C3, C4, C5
        frozen_stages=1,
        style='pytorch'),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=0,
        num_outs=5),
    bbox_head=dict(
        type='DiscoBoxSOLOv2Head',
        num_classes=6,
        in_channels=256,
        stacked_convs=4,
        seg_feat_channels=512,
        strides=[8, 8, 16, 32, 32],
        scale_ranges=((1, 96), (48, 192), (96, 384), (192, 768), (384, 2048)),
        sigma=0.2,
        num_grids=[40, 36, 24, 16, 12],
        ins_out_channels=256,
        loss_ins=dict(
            type='DiceLoss',
            use_sigmoid=True,
            loss_weight=1.0),
        loss_ts=dict(
            type='DiceLoss',
            momentum=0.999,
            use_ind_teacher=True,
            loss_weight=1.0,
            kernel=3,
            max_iter=10,
            alpha0=2.0,
            theta0=0.5,
            theta1=30.0,
            theta2=20.0,
            base=0.10,
            crf_height=28,
            crf_width=28,
        ),
        loss_cate=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_corr=dict(
            type='InfoNCE',
            loss_weight=1.0,
            corr_exp=1.0,
            corr_eps=0.05,
            gaussian_filter_size=3,
            low_score=0.3,
            corr_num_iter=10,
            corr_num_smooth_iter=1,
            save_corr_img=False,
            dist_kernel=9,
            obj_bank=dict(
                img_norm_cfg=img_norm_cfg,
                len_object_queues=100,
                fg_iou_thresh=0.7,
                bg_iou_thresh=0.7,
                ratio_range=[0.9, 1.2],
                appear_thresh=0.7,
                min_retrieval_objs=2,
                max_retrieval_objs=5,
                feat_height=7,
                feat_width=7,
                mask_height=28,
                mask_width=28,
                img_height=200,
                img_width=200,
                min_size=32,
                num_gpu_bank=20,
            )
        )
    ),
    mask_feat_head=dict(
            type='DiscoBoxMaskFeatHead',
            in_channels=256,
            out_channels=128,
            start_level=0,
            end_level=3,
            num_classes=256,
            norm_cfg=dict(type='GN', num_groups=32, requires_grad=True)),
    )

# dataset settings
dataset_type = 'CocoDataset'
data_root = 'data/'
classes = ('a','b','c','d','e','f',)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='GenerateBoxMask'),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=8,
    workers_per_gpu=0,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/train.json',
        img_prefix=data_root + 'train/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val.json',
        img_prefix=data_root + 'val/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/test.json',
        img_prefix=data_root + 'test/',
        pipeline=test_pipeline))
# optimizer
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=2000,
    warmup_ratio=0.01,
    step=[8, 9])
checkpoint_config = dict(interval=1)
# yapf:disable
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
# runtime settings
runner = dict(type='EpochBasedRunner', max_epochs=20)
evaluation = dict(interval=1, metric=['bbox', 'segm'])
device_ids = range(8)
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = './work_dirs/exp1'
load_from = None
resume_from = None
workflow = [('train', 1)]

My environment specs:

Linux - Debian machine
mmcv-full                1.3.17
mmdet                     2.25.0
pytorch                    1.6.0
openmim                 0.1.5
mmpycocotools      12.0.3

voidrank · 2022-06-13T08:33:53Z

Hi @ameyparanjape,

Thanks for your interest in our work.

I noticed that the log "died with <Signals.SIGKILL: 9>.". This is saying the Linux scheduler killed the job you were running. In terms of my experience, it is usually caused by Insufficient CPU memory. You might need to check the memory usage during eval.

Best,

Shiyi

voidrank closed this as completed Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discobox evaluation failed with custom dataset #5

Discobox evaluation failed with custom dataset #5

ameyparanjape commented Jun 11, 2022

ameyparanjape commented Jun 11, 2022 •

edited

Loading

voidrank commented Jun 13, 2022

Discobox evaluation failed with custom dataset #5

Discobox evaluation failed with custom dataset #5

Comments

ameyparanjape commented Jun 11, 2022

ameyparanjape commented Jun 11, 2022 • edited Loading

voidrank commented Jun 13, 2022

ameyparanjape commented Jun 11, 2022 •

edited

Loading