-
Notifications
You must be signed in to change notification settings - Fork 142
Non-reproducible CUDA c10::Error
when evaluating segmentation models
#395
Comments
When running the code in PR #396: "If I submit with —num_nodes=4 then, both with use_mixed_precision=False and with use_mixed_precision=True, the child jobs all fail while evaluating predictions for the test data, and all fail after printing out “evaluating predictions for patient n” for 24 of the 39 patients in the test set. The stack trace is the same as in the GitHub issue that you submitted." |
Update: This issue also pops up with single node jobs, hence multi-node communication is not the culprit. |
Example job: |
c10::Error
when evaluating segmentation models
c10::Error
when evaluating segmentation modelsc10::Error
when evaluating segmentation models
This appears to have been fixed with the PyTorch 1.8 upgrade in #411. |
Karl is reporting errors when running evaluation on segmentation models. They appear when running a multi-process loop over the raw GPU model outputs, and computing evaluation metrics. At this point, the GPU should not be utilized anymore, but still we get a GPU exception.
This happens on datasets that have relatively small volumes (16 x 512 x 512).
AB#3786
The text was updated successfully, but these errors were encountered: