RuntimeError: CUDA error: an illegal memory access was encountered when evaluate the beyond bounding-box ? #190

Anm-pinellia · 2022-04-06T03:57:22Z

Describe the bug
I am trying to train and test the clf model which using the configs of rotated_reppoints_r50_fpn_1x_dota_oc, it is successed to finished the train process, but an error occured during the evaluation, here the log is:

File "E:/Experiment/目标检测实验/MMDet/OBB_Detectors/test.py", line 87, in predict
outputs = single_gpu_test(model, data_loader, args.show, args.show_dir, args.show_score_thr)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmdet\apis\test.py", line 31, in single_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmcv\parallel\data_parallel.py", line 50, in forward
return super().forward(*inputs, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in forward
return self.module(*inputs[0], **kwargs[0])
File "E:\Anaconda\envs\mmdet2\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmcv\runner\fp16_utils.py", line 109, in new_func
return old_func(*args, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmdet\models\detectors\base.py", line 174, in forward
return self.forward_test(img, img_metas, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmdet\models\detectors\base.py", line 147, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmrotate\models\detectors\single_stage.py", line 100, in simple_test
bbox_list = self.bbox_head.get_bboxes(*outs, img_metas, rescale=rescale)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmcv\runner\fp16_utils.py", line 197, in new_func
return old_func(*args, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmrotate\models\dense_heads\rotated_reppoints_head.py", line 1044, in get_bboxes
results = self._get_bboxes_single(cls_score_list, point_pred_list,
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmrotate\models\dense_heads\rotated_reppoints_head.py", line 1136, in _get_bboxes_single
mlvl_bboxes[..., :4] /= mlvl_bboxes[..., :4].new_tensor(scale_factor)
RuntimeError: CUDA error: an illegal memory access was encountered

Details

I have tired to find reason for this error, and find this is caused a specific data.
Specifically, a function that named min_area_polygons in rotated_reppoints_head.py cannot handle the variables pts in the second iteration of processing.

The num_class in the model config is changed to 16 in my test to fit the DOTA V1.5
The data that caused this error is uploaded to baidu cloud driver if you need to test it:
链接：https://pan.baidu.com/s/1CCnthEl-kzOIXfJnU-PxRQ?pwd=l13h
提取码：l13h

What dataset did you use?
DOTA_V1.5 splited with 1024 size and 500 gaps

Environment
Other images can be predict correctly and the results of mmrotate/utils/collect_env.py are shown here:
sys.platform: win32
Python: 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)]
CUDA available: True
GPU 0: GeForce GTX 1660 Ti
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: n/a
PyTorch: 1.8.1
PyTorch compiling details: PyTorch built with:

C++ Version: 199711
MSVC 192829913
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 2019
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.5
Magma 2.5.4
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -DNDEBUG -DUSE_FBGEMM -DUSE_XNNPACK, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,
TorchVision: 0.9.1
OpenCV: 4.5.5
MMCV: 1.4.5
MMCV Compiler: MSVC 192930136
MMCV CUDA Compiler: 10.2
MMRotate: 0.1.1+

zytx121 · 2022-04-07T04:15:31Z

Please add CUDA_LAUNCH_BLOCKING=1 and paste the error log.

Anm-pinellia · 2022-04-07T07:16:26Z

Please add CUDA_LAUNCH_BLOCKING=1 and paste the error log.

the same error occurred in this setting.
1
load checkpoint from local path: E:\Experiment\目标检测实验\MMDet\OBB_Detectors\Beyond_BoundingBox\work_dir\latest.pth
[ ] 0/1, elapsed: 0s, ETA:Traceback (most recent call last):
File "E:/Experiment/目标检测实验/MMDet/OBB_Detectors/test.py", line 143, in

File "E:/Experiment/目标检测实验/MMDet/OBB_Detectors/test.py", line 87, in predict
outputs = single_gpu_test(model, data_loader, args.show, args.show_dir, args.show_score_thr)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmdet\apis\test.py", line 29, in single_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmcv\parallel\data_parallel.py", line 50, in forward
return super().forward(*inputs, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in forward
return self.module(*inputs[0], **kwargs[0])
File "E:\Anaconda\envs\mmdet2\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmcv\runner\fp16_utils.py", line 109, in new_func
return old_func(*args, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmdet\models\detectors\base.py", line 174, in forward
return self.forward_test(img, img_metas, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmdet\models\detectors\base.py", line 147, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmrotate\models\detectors\single_stage.py", line 100, in simple_test
bbox_list = self.bbox_head.get_bboxes(*outs, img_metas, rescale=rescale)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmcv\runner\fp16_utils.py", line 197, in new_func
return old_func(*args, **kwargs)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmrotate\models\dense_heads\rotated_reppoints_head.py", line 1044, in get_bboxes
results = self._get_bboxes_single(cls_score_list, point_pred_list,
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmrotate\models\dense_heads\rotated_reppoints_head.py", line 1125, in _get_bboxes_single
poly_pred = self.points2rotrect(points_pred, y_first=True)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmrotate\models\dense_heads\rotated_reppoints_head.py", line 211, in points2rotrect
rotrect_pred = min_area_polygons(pts)
File "E:\Anaconda\envs\mmdet2\lib\site-packages\mmcv\ops\min_area_polygons.py", line 17, in min_area_polygons
ext_module.min_area_polygons(pointsets, polygons)
RuntimeError: CUDA error: an illegal memory access was encountered

zytx121 · 2022-04-08T01:08:40Z

This error is caused by incorrect input of min_area_polygons. Could you print the pts in rotrect_pred = min_area_polygons(pts) and check it format?

Anm-pinellia · 2022-04-08T01:34:42Z

This error is caused by incorrect input of min_area_polygons. Could you print the pts in rotrect_pred = min_area_polygons(pts) and check it format?

the first pts are shown here:
tensor([[-1.0003, -0.4358, 0.6460, ..., 1.0339, -0.3023, -0.3141],
[-0.3291, -1.7686, 0.5437, ..., 1.3618, -0.1748, -0.6429],
[-0.1825, -0.4383, 1.5624, ..., 0.8692, 0.1711, -0.2527],
...,
[ 0.0894, -0.0731, 1.9433, ..., 0.7824, 0.4239, -0.0639],
[-1.6038, -0.4877, 0.4154, ..., 3.1096, -0.4383, -0.1936],
[-1.1776, -0.4550, 0.5818, ..., 0.3581, -0.4686, -0.7048]],
device='cuda:0')

and the second pts are shown here:
tensor([[-1.2346, -3.3156, 1.1556, ..., 2.6584, -0.0107, -0.4711],
[-0.7851, -0.7225, 1.2545, ..., 0.6979, -0.0919, -0.3629],
[-1.1499, -2.4557, 1.1019, ..., 3.6952, -0.1184, -0.0699],
...,
[-1.8137, -2.7932, 2.7604, ..., 4.1961, 0.4080, 0.2027],
[-2.8470, -1.2553, 3.2741, ..., 3.0502, -0.2177, -0.1452],
[-1.8971, -0.9886, 2.0489, ..., 1.6850, -0.1780, -0.2618]],
device='cuda:0')

I tested their performance in the function of min_area_polygons and found that the second pts will caused this error.

zytx121 · 2022-04-14T01:26:32Z

It looks like a bug in min_area_polygons, we will check it.

qixiong-wang · 2022-04-21T08:11:53Z

I met the same error when testing rotate reppoint model.

Anm-pinellia · 2022-04-21T08:13:45Z

I met the same error when testing rotate reppoint model.

it seems a bug。。。

TuanTNG · 2022-05-12T23:56:48Z

I met the same error when testing rotate reppoint model.

I also have same error

19990101lrk · 2022-08-06T09:27:06Z

I have the same problem, is there a solution?

yangxue0827 · 2022-08-14T03:23:27Z

A successful solution: set smaller nms_pre

test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

austinmw · 2022-09-26T15:30:42Z

@yangxue0827 That change did not solve the issue for me as mentioned in #405 (comment)

GisRookie · 2022-10-15T14:00:22Z

I also have the same problem after change nms_pre.

jiayuan666 · 2022-12-21T06:43:15Z

same error as mine, I have tried on Tesla V100-PCIE 32GB and RTX3090 26G
when I modified the nms_pre to 20, it worked, but same problem occurred when num_pre > 30
here is error log where i add CUDA_LAUNCH_BLOCKING=1

[                           ] 232/16540, 19.4 task/s, elapsed: 12s, ETA:   840sTraceback (most recent call last):
  File "/home/jiayuan666/PycharmProjects/mmrotate/tools/test.py", line 257, in <module>
    main()
  File "/home/jiayuan666/PycharmProjects/mmrotate/tools/test.py", line 222, in main
    outputs = single_gpu_test(model, data_loader, args.show, args.show_dir,
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmdet/apis/test.py", line 29, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 51, in forward
    return super().forward(*inputs, **kwargs)
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
    return old_func(*args, **kwargs)
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 174, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 147, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/home/jiayuan666/PycharmProjects/mmrotate/mmrotate/models/detectors/single_stage.py", line 101, in simple_test
    bbox_list = self.bbox_head.get_bboxes(
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 208, in new_func
    return old_func(*args, **kwargs)
  File "/home/jiayuan666/PycharmProjects/mmrotate/mmrotate/models/dense_heads/rotated_reppoints_head.py", line 1066, in get_bboxes
    results = self._get_bboxes_single(cls_score_list, point_pred_list,
  File "/home/jiayuan666/PycharmProjects/mmrotate/mmrotate/models/dense_heads/rotated_reppoints_head.py", line 1148, in _get_bboxes_single
    poly_pred = self.points2rotrect(points_pred, y_first=True)
  File "/home/jiayuan666/PycharmProjects/mmrotate/mmrotate/models/dense_heads/rotated_reppoints_head.py", line 211, in points2rotrect
    rotrect_pred = min_area_polygons(pts)
  File "/home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/ops/min_area_polygons.py", line 19, in min_area_polygons
    ext_module.min_area_polygons(pointsets, polygons)
RuntimeError: CUDA error: an illegal memory access was encountered
Exception raised from MinAreaPolygonsCUDAKernelLauncher at /tmp/mmcv/mmcv/ops/csrc/pytorch/cuda/min_area_polygons.cu:20 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5dae0c5497 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::CUDAError::Error(c10::SourceLocation, std::string) + 0x30 (0x7f5d6b111f84 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #2: MinAreaPolygonsCUDAKernelLauncher(at::Tensor, at::Tensor) + 0x17e (0x7f5d6b186b35 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #3: min_area_polygons_cuda(at::Tensor, at::Tensor) + 0x49 (0x7f5d6b14b9c9 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #4: auto Dispatch<DeviceRegistry<void (*)(at::Tensor, at::Tensor), &(min_area_polygons_impl(at::Tensor, at::Tensor))>, at::Tensor const&, at::Tensor&>(DeviceRegistry<void (*)(at::Tensor, at::Tensor), &(min_area_polygons_impl(at::Tensor, at::Tensor))> const&, char const*, at::Tensor const&, at::Tensor&) + 0xb7 (0x7f5d6b29b527 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #5: min_area_polygons(at::Tensor, at::Tensor) + 0x49 (0x7f5d6b29b3a9 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x2bb89d (0x7f5d6b2c089d in /home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x2d12db (0x7f5d6b2d62db in /home/jiayuan666/.conda/envs/mmrotate0.3.3/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #8: PyCFunction_Call + 0x52 (0x4dfd82 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #9: _PyObject_MakeTpCall + 0x3eb (0x4d0c5b in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #10: _PyEval_EvalFrameDefault + 0x5265 (0x4cc005 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #11: _PyFunction_Vectorcall + 0x106 (0x4d9d16 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #12: _PyEval_EvalFrameDefault + 0x907 (0x4c76a7 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #13: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #14: /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python() [0x4e8224]
frame #15: _PyEval_EvalFrameDefault + 0x172a (0x4c84ca in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #16: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #17: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #18: /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python() [0x4e8197]
frame #19: PyObject_Call + 0x5e (0x4ec16e in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x2051 (0x4c8df1 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #21: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #22: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #23: PyObject_Call + 0x5e (0x4ec16e in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x2051 (0x4c8df1 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #26: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #27: /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python() [0x4e8197]
frame #28: PyObject_Call + 0x5e (0x4ec16e in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x2051 (0x4c8df1 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #30: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #31: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #32: /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python() [0x4e8197]
frame #33: PyObject_Call + 0x5e (0x4ec16e in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x2051 (0x4c8df1 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #35: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #36: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #37: /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python() [0x4e8197]
frame #38: PyObject_Call + 0x5e (0x4ec16e in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x2051 (0x4c8df1 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #40: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #41: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #42: PyObject_Call + 0x5e (0x4ec16e in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x2051 (0x4c8df1 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #44: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #45: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #46: /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python() [0x4e8197]
frame #47: PyObject_Call + 0x5e (0x4ec16e in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x2051 (0x4c8df1 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #49: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #50: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #51: _PyObject_FastCallDict + 0x25f (0x4d028f in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #52: _PyObject_Call_Prepend + 0x60 (0x4e4720 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #53: /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python() [0x541f07]
frame #54: PyObject_Call + 0x272 (0x4ec382 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #55: _PyEval_EvalFrameDefault + 0x2051 (0x4c8df1 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #56: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #57: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #58: /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python() [0x4e8197]
frame #59: PyObject_Call + 0x5e (0x4ec16e in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x2051 (0x4c8df1 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #61: _PyEval_EvalCodeWithName + 0x1f5 (0x4c5c45 in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #62: _PyFunction_Vectorcall + 0x19c (0x4d9dac in /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python)
frame #63: /home/jiayuan666/.conda/envs/mmrotate0.3.3/bin/python() [0x4e8197]

sunny-sjj · 2023-02-21T08:33:09Z

I want to know if this bug is solved?

pphgood · 2023-03-20T07:31:43Z

I also have the same problem after change nms_pre. I want to know if this bug is solved?

pphgood · 2023-03-20T07:35:40Z

I want to know if this bug is solved?

A successful solution: set smaller nms_pre

test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

I also have the same problem after change nms_pre. I found this error when I was testing a certain image, but not other images. I want to know if this bug is solved?

freshn · 2023-06-16T12:12:47Z

Same here.

ToneZe · 2023-10-14T09:06:39Z

The same problem.
sys.platform: linux
Python: 3.8.13 (default, Apr 19 2022, 00:53:22) [GCC 7.5.0]
CUDA available: True
GPU 0: Tesla V100-PCIE-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.8
GCC: x86_64-linux-gnu-gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.1+cu102
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
CuDNN 7.6.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.1+cu102
OpenCV: 4.8.0
MMCV: 1.6.1
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMRotate: 0.3.4+

File "./tools/train.py", line 192, in
if name == 'main':
File "./tools/train.py", line 181, in main
model.CLASSES = datasets[0].CLASSES
File "/hy-tmp/LSKNet/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 58, in train
self.call_hook('after_train_epoch')
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/hooks/evaluation.py", line 271, in after_train_epoch
self._do_evaluate(runner)
File "/usr/local/lib/python3.8/dist-packages/mmdet/core/evaluation/eval_hooks.py", line 126, in _do_evaluate
results = multi_gpu_test(
File "/usr/local/lib/python3.8/dist-packages/mmdet/apis/test.py", line 109, in multi_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmdet/models/detectors/base.py", line 174, in forward
return self.forward_test(img, img_metas, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmdet/models/detectors/base.py", line 147, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "/hy-tmp/LSKNet/mmrotate/models/detectors/single_stage.py", line 101, in simple_test
bbox_list = self.bbox_head.get_bboxes(
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/fp16_utils.py", line 205, in new_func
return old_func(args, kwargs)
File "/hy-tmp/LSKNet/mmrotate/models/dense_heads/sam_reppoints_head.py", line 734, in get_bboxes
results = self._get_bboxes_single(cls_score_list, point_pred_list,
File "/hy-tmp/LSKNet/mmrotate/models/dense_heads/sam_reppoints_head.py", line 828, in _get_bboxes_single
mlvl_bboxes[..., :4] /= mlvl_bboxes[..., :4].new_tensor(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f76727062f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5b (0x7f767270367b in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x809 (0x7f767295e1f9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f76726ee3a4 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: + 0x6e43ca (0x7f76226593ca in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: + 0x6e4461 (0x7f7622659461 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: /usr/bin/python() [0x53f4c3]
frame #7: /usr/bin/python() [0x58d35c]
frame #8: /usr/bin/python() [0x57945b]
frame #9: /usr/bin/python() [0x53de5c]
frame #10: /usr/bin/python() [0x5abb4c]
frame #11: /usr/bin/python() [0x5bcee8]
frame #12: /usr/bin/python() [0x5bce4e]
frame #13: /usr/bin/python() [0x5bce4e]
frame #14: /usr/bin/python() [0x5bce4e]
frame #15: /usr/bin/python() [0x5bce4e]
frame #16: /usr/bin/python() [0x5bce4e]
frame #17: /usr/bin/python() [0x5bce4e]
frame #18: /usr/bin/python() [0x5bce4e]
frame #19: /usr/bin/python() [0x5bce4e]
frame #20: /usr/bin/python() [0x5bce4e]
frame #21: /usr/bin/python() [0x5bce4e]
frame #22: /usr/bin/python() [0x56ed56]
frame #23: PyDict_SetItemString + 0x50 (0x5751a0 in /usr/bin/python)
frame #24: PyImport_Cleanup + 0x76 (0x64e5a6 in /usr/bin/python)
frame #25: Py_FinalizeEx + 0x6e (0x6407ce in /usr/bin/python)
frame #26: Py_RunMain + 0xf9 (0x671c79 in /usr/bin/python)
frame #27: Py_BytesMain + 0x29 (0x672009 in /usr/bin/python)
frame #28: __libc_start_main + 0xe7 (0x7f7684b63bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: _start + 0x2a (0x5e201a in /usr/bin/python)

When training in epoch=50, this error is uncomfortable
Killing subprocess 506
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 340, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

sgtojd · 2023-11-29T14:16:49Z

I wonder that have the this bug been solved?

walkerinrain · 2024-03-12T08:14:10Z

I find this bug result from some bad labels.
After image split to small size, some bounding box will be cut apart，then some extremely narrow bounding boxes will be generated nearby the new image border. After forward propagation through the network, these bounding boxes will cause the network to generate prediction tensors with particularly large dimensions, then the training process was terminated due to exceeding the computational capacity.
once we know the reason, then the solution is easy to find. Below are two solutions:
1.we can find the image and label being processed once training process was terminated, then delete them;
2.find bad labels like below and delete them

Anm-pinellia · 2024-03-12T08:19:17Z

I find this bug result from some bad labels. After image split to small size, some bounding box will be cut apart，then some extremely narrow bounding boxes will be generated nearby the new image border. After forward propagation through the network, these bounding boxes will cause the network to generate prediction tensors with particularly large dimensions, then the training process was terminated due to exceeding the computational capacity. once we know the reason, then the solution is easy to find. Below are two solutions: 1.we can find the image and label being processed once training process was terminated, then delete them; 2.find bad labels like below and delete them

Thanks for your kind suggestion. This is a good way to solve this problem.

zytx121 added the bug Something isn't working label Apr 8, 2022

zytx121 assigned zhanggefan Apr 14, 2022

Anm-pinellia closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: an illegal memory access was encountered when evaluate the beyond bounding-box ? #190

RuntimeError: CUDA error: an illegal memory access was encountered when evaluate the beyond bounding-box ? #190

Anm-pinellia commented Apr 6, 2022

zytx121 commented Apr 7, 2022

Anm-pinellia commented Apr 7, 2022

zytx121 commented Apr 8, 2022 •

edited

Loading

Anm-pinellia commented Apr 8, 2022

zytx121 commented Apr 14, 2022

qixiong-wang commented Apr 21, 2022

Anm-pinellia commented Apr 21, 2022

TuanTNG commented May 12, 2022

19990101lrk commented Aug 6, 2022

yangxue0827 commented Aug 14, 2022

austinmw commented Sep 26, 2022

GisRookie commented Oct 15, 2022

jiayuan666 commented Dec 21, 2022 •

edited

Loading

sunny-sjj commented Feb 21, 2023

pphgood commented Mar 20, 2023

pphgood commented Mar 20, 2023

freshn commented Jun 16, 2023

ToneZe commented Oct 14, 2023 •

edited

Loading

sgtojd commented Nov 29, 2023

walkerinrain commented Mar 12, 2024

Anm-pinellia commented Mar 12, 2024

RuntimeError: CUDA error: an illegal memory access was encountered when evaluate the beyond bounding-box ? #190

RuntimeError: CUDA error: an illegal memory access was encountered when evaluate the beyond bounding-box ? #190

Comments

Anm-pinellia commented Apr 6, 2022

zytx121 commented Apr 7, 2022

Anm-pinellia commented Apr 7, 2022

zytx121 commented Apr 8, 2022 • edited Loading

Anm-pinellia commented Apr 8, 2022

zytx121 commented Apr 14, 2022

qixiong-wang commented Apr 21, 2022

Anm-pinellia commented Apr 21, 2022

TuanTNG commented May 12, 2022

19990101lrk commented Aug 6, 2022

yangxue0827 commented Aug 14, 2022

austinmw commented Sep 26, 2022

GisRookie commented Oct 15, 2022

jiayuan666 commented Dec 21, 2022 • edited Loading

sunny-sjj commented Feb 21, 2023

pphgood commented Mar 20, 2023

pphgood commented Mar 20, 2023

freshn commented Jun 16, 2023

ToneZe commented Oct 14, 2023 • edited Loading

sgtojd commented Nov 29, 2023

walkerinrain commented Mar 12, 2024

Anm-pinellia commented Mar 12, 2024

zytx121 commented Apr 8, 2022 •

edited

Loading

jiayuan666 commented Dec 21, 2022 •

edited

Loading

ToneZe commented Oct 14, 2023 •

edited

Loading