Non-reproducible CUDA `c10::Error` when evaluating segmentation models #395

ant0nsc · 2021-02-12T13:22:30Z

Karl is reporting errors when running evaluation on segmentation models. They appear when running a multi-process loop over the raw GPU model outputs, and computing evaluation metrics. At this point, the GPU should not be utilized anymore, but still we get a GPU exception.
This happens on datasets that have relatively small volumes (16 x 512 x 512).

2021-02-10T02:43:56Z INFO     Evaluating predictions for patient 81
2021-02-10T02:43:56Z INFO     Evaluating predictions for patient 78
2021-02-10T02:43:56Z INFO     Evaluating predictions for patient 91
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1595629427478/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fb79cf9177d in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1130 (0x7fb79d1e2370 in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb79cf7db1d in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53f0ea (0x7fb7daaf50ea in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x1ad98f (0x55c23070f98f in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/bin/python)

[SNIP]
frame #63: _PyFunction_FastCallDict + 0x400 (0x55c230679800 in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1595629427478/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fb79cf9177d in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1130 (0x7fb79d1e2370 in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb79cf7db1d in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10.so)

AB#3786

The text was updated successfully, but these errors were encountered:

ant0nsc · 2021-02-15T11:03:43Z

When running the code in PR #396: "If I submit with —num_nodes=4 then, both with use_mixed_precision=False and with use_mixed_precision=True, the child jobs all fail while evaluating predictions for the test data, and all fail after printing out “evaluating predictions for patient n” for 24 of the 39 patients in the test set. The stack trace is the same as in the GitHub issue that you submitted."
Suspicion: The multi-node communication could be the culprit. It is still set up to talk to the 4 other nodes, but we are only using it on one.

ant0nsc · 2021-02-23T11:15:35Z

Update: This issue also pops up with single node jobs, hence multi-node communication is not the culprit.

ant0nsc · 2021-03-04T12:20:09Z

Example job: HD_757afc23-96f1-402f-9fc8-046ad2d5f0fd_1 in antonsc_back_to_daily_build in the radiomics workspace, fails with a c10::Error on the Prostate daily build

ant0nsc · 2021-03-16T10:24:52Z

This appears to have been fixed with the PyTorch 1.8 upgrade in #411.

ant0nsc mentioned this issue Feb 12, 2021

Draft: Workaround for CUDA initialization error #396

Closed

ant0nsc linked a pull request Feb 12, 2021 that will close this issue

Draft: Workaround for CUDA initialization error #396

Closed

ant0nsc added the bug Something isn't working label Feb 15, 2021

ant0nsc added this to Planned in InnerEye via automation Feb 15, 2021

ant0nsc moved this from Planned to Bugs & Feature Parity in InnerEye Feb 15, 2021

ant0nsc changed the title ~~Non-reproducible error when evaluating segmentation models~~ Non-reproducible CUDA error c10::Error when evaluating segmentation models Mar 5, 2021

ant0nsc changed the title ~~Non-reproducible CUDA error c10::Error when evaluating segmentation models~~ Non-reproducible CUDA c10::Error when evaluating segmentation models Mar 5, 2021

ant0nsc self-assigned this Mar 6, 2021

ant0nsc added the triaged An item on Azure Boards has been created and prioritized label Mar 9, 2021

ant0nsc closed this as completed Mar 16, 2021

InnerEye automation moved this from Bugs & Feature Parity to Done Mar 16, 2021

ant0nsc linked a pull request Mar 16, 2021 that will close this issue

Upgrade to Pytorch 1.8 #411

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-reproducible CUDA `c10::Error` when evaluating segmentation models #395

Non-reproducible CUDA `c10::Error` when evaluating segmentation models #395

ant0nsc commented Feb 12, 2021 •

edited by azure-boards bot

Loading

ant0nsc commented Feb 15, 2021

ant0nsc commented Feb 23, 2021

ant0nsc commented Mar 4, 2021

ant0nsc commented Mar 16, 2021

Non-reproducible CUDA c10::Error when evaluating segmentation models #395

Non-reproducible CUDA c10::Error when evaluating segmentation models #395

Comments

ant0nsc commented Feb 12, 2021 • edited by azure-boards bot Loading

ant0nsc commented Feb 15, 2021

ant0nsc commented Feb 23, 2021

ant0nsc commented Mar 4, 2021

ant0nsc commented Mar 16, 2021

Non-reproducible CUDA `c10::Error` when evaluating segmentation models #395

Non-reproducible CUDA `c10::Error` when evaluating segmentation models #395

ant0nsc commented Feb 12, 2021 •

edited by azure-boards bot

Loading