Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) #1618

Open
deepakkupanda opened this issue May 26, 2022 · 6 comments
Assignees

Comments

@deepakkupanda
Copy link

deepakkupanda commented May 26, 2022

I am trying to run beit algorithm using duts dataset

tools/dist_train.sh configs/beit/upernet_beit-base_640x640_80k_duts_ms.py 1 --work-dir work_dirs/upernet_beit-base_640x640_80k_duts/ --deterministic

2022-05-26 12:51:23,588 - mmseg - INFO - Checkpoints will be saved to /mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/work_dirs/upernet_beit-base_640x640_80k_duts by HardDiskBackend.
2022-05-26 12:54:05,915 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/tools/train.py", line 240, in
main()
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/tools/train.py", line 229, in main
train_segmentor(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/apis/train.py", line 191, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/parallel/distributed.py", line 59, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 143, in forward_train
loss_decode = self._decode_head_forward_train(x, img_metas,
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 86, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 198, in new_func
return old_func(*args, kwargs)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 264, in losses
loss['acc_seg'] = accuracy(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1646755897462/work/c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f79ad35d1bd in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x1f037 (0x7f79df9aa037 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x23a (0x7f79df9ae3ea in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x2ecd68 (0x7f7a303a3d68 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f79ad343fb5 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: + 0x1db609 (0x7f7a30292609 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x4c671c (0x7f7a3057d71c in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object
) + 0x292 (0x7f7a3057da22 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x13e79b (0x564387c4079b in /anaconda/envs/open-mmlab/bin/python)
frame #9: + 0x13de78 (0x564387c3fe78 in /anaconda/envs/open-mmlab/bin/python)
frame #10: + 0x13dd53 (0x564387c3fd53 in /anaconda/envs/open-mmlab/bin/python)
frame #11: + 0x13e0fc (0x564387c400fc in /anaconda/envs/open-mmlab/bin/python)
frame #12: + 0x13ec11 (0x564387c40c11 in /anaconda/envs/open-mmlab/bin/python)
frame #13: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #14: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #15: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #16: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #17: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #18: + 0x15673e (0x564387c5873e in /anaconda/envs/open-mmlab/bin/python)
frame #19: PyDict_SetItemString + 0x64 (0x564387ca0e04 in /anaconda/envs/open-mmlab/bin/python)
frame #20: + 0x28d46d (0x564387d8f46d in /anaconda/envs/open-mmlab/bin/python)
frame #21: Py_FinalizeEx + 0x175 (0x564387d8f9c5 in /anaconda/envs/open-mmlab/bin/python)
frame #22: Py_RunMain + 0x1af (0x564387d9440f in /anaconda/envs/open-mmlab/bin/python)
frame #23: Py_BytesMain + 0x39 (0x564387d947d9 in /anaconda/envs/open-mmlab/bin/python)
frame #24: __libc_start_main + 0xe7 (0x7f7a68a07bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: + 0x2125d4 (0x564387d145d4 in /anaconda/envs/open-mmlab/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 12206) of binary: /anaconda/envs/open-mmlab/bin/python
tools/dist_train.sh: line 19: 12194 Segmentation fault (core dumped) python -m torch.distributed.launch --nnodes=$NNODES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --nproc_per_node=$GPUS --master_port=$PORT $(dirname "$0")/train.py $CONFIG --seed 0 --launcher pytorch ${@:3}

@deepakkupanda
Copy link
Author

I am able to run the pspnet on ade20k dataset.
tools/dist_train.sh configs/pspnet/pspnet_r101-d8_512x512_80k_ade20k.py 1

@deepakkupanda
Copy link
Author

@xiaoachen98 Please help me in solving the error.

@deepakkupanda
Copy link
Author

@donglixp Please help in solving the error.

@MeowZheng MeowZheng assigned MengzhangLI and unassigned xiaoachen98 May 27, 2022
@MeowZheng
Copy link
Collaborator

base on you error log,

"/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

I think something might be wrong about ignore_index. What's the ignore_index in your cfg?

@deepakkupanda
Copy link
Author

@MeowZheng Can you tell where exactly to look for ignore_index ? I am not able to locate.

@deepakkupanda
Copy link
Author

@MeowZheng Gentle reminder to relocate ignore_index

aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this issue Mar 27, 2023
…checkpoint to avoid crash when running fp16 (open-mmlab#1618)

* dreambooth: fix open-mmlab#1566: maintain fp32 wrapper when saving a checkpoint to avoid crash when running fp16

* dreambooth: guard against passing keep_fp32_wrapper arg to older versions of accelerate. part of fix for open-mmlab#1566

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <[email protected]>

* Update examples/dreambooth/train_dreambooth.py

Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023
* add task-level README files

* update README.md

* update compiling commands

* update doc dependency

* fix bugs

* update cn readme

* update cn readme

* update sphinx version

* fix bug

* modify doc structure

* fix bug

* add cn doc skeleton

* update cn docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants