Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Non-reproducible CUDA c10::Error when evaluating segmentation models #395

Closed
ant0nsc opened this issue Feb 12, 2021 · 4 comments · Fixed by #411
Closed

Non-reproducible CUDA c10::Error when evaluating segmentation models #395

ant0nsc opened this issue Feb 12, 2021 · 4 comments · Fixed by #411
Assignees
Labels
bug Something isn't working triaged An item on Azure Boards has been created and prioritized
Projects

Comments

@ant0nsc
Copy link
Contributor

ant0nsc commented Feb 12, 2021

Karl is reporting errors when running evaluation on segmentation models. They appear when running a multi-process loop over the raw GPU model outputs, and computing evaluation metrics. At this point, the GPU should not be utilized anymore, but still we get a GPU exception.
This happens on datasets that have relatively small volumes (16 x 512 x 512).

2021-02-10T02:43:56Z INFO     Evaluating predictions for patient 81
2021-02-10T02:43:56Z INFO     Evaluating predictions for patient 78
2021-02-10T02:43:56Z INFO     Evaluating predictions for patient 91
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1595629427478/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fb79cf9177d in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1130 (0x7fb79d1e2370 in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb79cf7db1d in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53f0ea (0x7fb7daaf50ea in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x1ad98f (0x55c23070f98f in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/bin/python)

[SNIP]
frame #63: _PyFunction_FastCallDict + 0x400 (0x55c230679800 in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1595629427478/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fb79cf9177d in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1130 (0x7fb79d1e2370 in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb79cf7db1d in /azureml-envs/azureml_d3cd66f6cb9fc8a62aad64d8c6792150/lib/python3.7/site-packages/torch/lib/libc10.so)

AB#3786

@ant0nsc
Copy link
Contributor Author

ant0nsc commented Feb 15, 2021

When running the code in PR #396: "If I submit with —num_nodes=4 then, both with use_mixed_precision=False and with use_mixed_precision=True, the child jobs all fail while evaluating predictions for the test data, and all fail after printing out “evaluating predictions for patient n” for 24 of the 39 patients in the test set. The stack trace is the same as in the GitHub issue that you submitted."
Suspicion: The multi-node communication could be the culprit. It is still set up to talk to the 4 other nodes, but we are only using it on one.

@ant0nsc ant0nsc added the bug Something isn't working label Feb 15, 2021
@ant0nsc ant0nsc added this to Planned in InnerEye via automation Feb 15, 2021
@ant0nsc ant0nsc moved this from Planned to Bugs & Feature Parity in InnerEye Feb 15, 2021
@ant0nsc
Copy link
Contributor Author

ant0nsc commented Feb 23, 2021

Update: This issue also pops up with single node jobs, hence multi-node communication is not the culprit.

@ant0nsc
Copy link
Contributor Author

ant0nsc commented Mar 4, 2021

Example job: HD_757afc23-96f1-402f-9fc8-046ad2d5f0fd_1 in antonsc_back_to_daily_build in the radiomics workspace, fails with a c10::Error on the Prostate daily build

@ant0nsc ant0nsc changed the title Non-reproducible error when evaluating segmentation models Non-reproducible CUDA error c10::Error when evaluating segmentation models Mar 5, 2021
@ant0nsc ant0nsc changed the title Non-reproducible CUDA error c10::Error when evaluating segmentation models Non-reproducible CUDA c10::Error when evaluating segmentation models Mar 5, 2021
@ant0nsc ant0nsc self-assigned this Mar 6, 2021
@ant0nsc ant0nsc added the triaged An item on Azure Boards has been created and prioritized label Mar 9, 2021
@ant0nsc
Copy link
Contributor Author

ant0nsc commented Mar 16, 2021

This appears to have been fixed with the PyTorch 1.8 upgrade in #411.

@ant0nsc ant0nsc closed this as completed Mar 16, 2021
InnerEye automation moved this from Bugs & Feature Parity to Done Mar 16, 2021
@ant0nsc ant0nsc linked a pull request Mar 16, 2021 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working triaged An item on Azure Boards has been created and prioritized
Projects
No open projects
InnerEye
  
Done
1 participant