Restarting training from checkpoint on large model goes out of memory #351

javier-alvarez · 2021-01-05T20:13:06Z

Training a model with this params: --model=HeadAndNeck --train=True --cluster=training-nd40-v2 --run_recovery_id=jaalvare_bigcluster:jaalvare_bigcluster_1607291058_70ac48b7 --start_epoch=120

Results on:

2021-01-05T13:13:20Z ERROR Model training/testing failed. Exception: CUDA out of memory. Tried to allocate 12.74 GiB (GPU 0; 31.75 GiB total capacity; 12.10 GiB already allocated; 8.44 GiB free; 22.06 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/runner.py", line 384, in run_in_situ
self.create_ml_runner().run()
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/run_ml.py", line 263, in run
model_train(self.model_config, checkpoint_handler)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training.py", line 145, in model_train
train_epoch_results = train_or_validate_epoch(training_steps)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training.py", line 285, in train_or_validate_epoch
sample, batch_index, train_val_params.epoch)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training_steps.py", line 661, in forward_and_backward_minibatch
allow_multiple_classes_for_each_pixel=True).cpu().numpy()
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/metrics.py", line 428, in compute_dice_across_patches
intersection = 2.0 * torch.sum(one_hot_segmentation & ground_truth, dim=-1).float()
RuntimeError: CUDA out of memory. Tried to allocate 12.74 GiB (GPU 0; 31.75 GiB total capacity; 12.10 GiB already allocated; 8.44 GiB free; 22.06 GiB reserved in total by PyTorch)
Starting the daemon thread to refresh tokens in background for process with pid = 82

AB#3907

peterhessey · 2022-05-19T14:18:50Z

This needs more investigation - I will test that the run_recovery flag still works as expected before looking into this specific case.

javier-alvarez added the bug Something isn't working label Jan 5, 2021

javier-alvarez changed the title ~~Restarting from checkpoint on large model goes out of memory~~ Restarting training from checkpoint on large model goes out of memory Jan 5, 2021

ant0nsc added this to the Run recovery and pre-emption recovery work fine milestone Feb 3, 2021

ant0nsc added this to Backlog in InnerEye via automation Feb 3, 2021

ant0nsc added no changelog needed CHANGELOG.md does not need to be updated in this PR and removed no changelog needed CHANGELOG.md does not need to be updated in this PR labels Apr 9, 2021

peterhessey self-assigned this May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting training from checkpoint on large model goes out of memory #351

Restarting training from checkpoint on large model goes out of memory #351

javier-alvarez commented Jan 5, 2021 •

edited by azure-boards bot

Loading

peterhessey commented May 19, 2022

Restarting training from checkpoint on large model goes out of memory #351

Restarting training from checkpoint on large model goes out of memory #351

Comments

javier-alvarez commented Jan 5, 2021 • edited by azure-boards bot Loading

peterhessey commented May 19, 2022

javier-alvarez commented Jan 5, 2021 •

edited by azure-boards bot

Loading