You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
Training a model with this params: --model=HeadAndNeck --train=True --cluster=training-nd40-v2 --run_recovery_id=jaalvare_bigcluster:jaalvare_bigcluster_1607291058_70ac48b7 --start_epoch=120
Results on:
2021-01-05T13:13:20Z ERROR Model training/testing failed. Exception: CUDA out of memory. Tried to allocate 12.74 GiB (GPU 0; 31.75 GiB total capacity; 12.10 GiB already allocated; 8.44 GiB free; 22.06 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/runner.py", line 384, in run_in_situ
self.create_ml_runner().run()
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/run_ml.py", line 263, in run
model_train(self.model_config, checkpoint_handler)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training.py", line 145, in model_train
train_epoch_results = train_or_validate_epoch(training_steps)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training.py", line 285, in train_or_validate_epoch
sample, batch_index, train_val_params.epoch)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training_steps.py", line 661, in forward_and_backward_minibatch
allow_multiple_classes_for_each_pixel=True).cpu().numpy()
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/metrics.py", line 428, in compute_dice_across_patches
intersection = 2.0 * torch.sum(one_hot_segmentation & ground_truth, dim=-1).float()
RuntimeError: CUDA out of memory. Tried to allocate 12.74 GiB (GPU 0; 31.75 GiB total capacity; 12.10 GiB already allocated; 8.44 GiB free; 22.06 GiB reserved in total by PyTorch)
Starting the daemon thread to refresh tokens in background for process with pid = 82
javier-alvarez
changed the title
Restarting from checkpoint on large model goes out of memory
Restarting training from checkpoint on large model goes out of memory
Jan 5, 2021
Training a model with this params: --model=HeadAndNeck --train=True --cluster=training-nd40-v2 --run_recovery_id=jaalvare_bigcluster:jaalvare_bigcluster_1607291058_70ac48b7 --start_epoch=120
Results on:
2021-01-05T13:13:20Z ERROR Model training/testing failed. Exception: CUDA out of memory. Tried to allocate 12.74 GiB (GPU 0; 31.75 GiB total capacity; 12.10 GiB already allocated; 8.44 GiB free; 22.06 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/runner.py", line 384, in run_in_situ
self.create_ml_runner().run()
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/run_ml.py", line 263, in run
model_train(self.model_config, checkpoint_handler)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training.py", line 145, in model_train
train_epoch_results = train_or_validate_epoch(training_steps)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training.py", line 285, in train_or_validate_epoch
sample, batch_index, train_val_params.epoch)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training_steps.py", line 661, in forward_and_backward_minibatch
allow_multiple_classes_for_each_pixel=True).cpu().numpy()
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/metrics.py", line 428, in compute_dice_across_patches
intersection = 2.0 * torch.sum(one_hot_segmentation & ground_truth, dim=-1).float()
RuntimeError: CUDA out of memory. Tried to allocate 12.74 GiB (GPU 0; 31.75 GiB total capacity; 12.10 GiB already allocated; 8.44 GiB free; 22.06 GiB reserved in total by PyTorch)
Starting the daemon thread to refresh tokens in background for process with pid = 82
AB#3907
The text was updated successfully, but these errors were encountered: