Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Restarting training from checkpoint on large model goes out of memory #351

Open
javier-alvarez opened this issue Jan 5, 2021 · 1 comment
Assignees
Labels
bug Something isn't working
Projects

Comments

@javier-alvarez
Copy link
Contributor

javier-alvarez commented Jan 5, 2021

Training a model with this params: --model=HeadAndNeck --train=True --cluster=training-nd40-v2 --run_recovery_id=jaalvare_bigcluster:jaalvare_bigcluster_1607291058_70ac48b7 --start_epoch=120

Results on:

2021-01-05T13:13:20Z ERROR Model training/testing failed. Exception: CUDA out of memory. Tried to allocate 12.74 GiB (GPU 0; 31.75 GiB total capacity; 12.10 GiB already allocated; 8.44 GiB free; 22.06 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/runner.py", line 384, in run_in_situ
self.create_ml_runner().run()
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/run_ml.py", line 263, in run
model_train(self.model_config, checkpoint_handler)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training.py", line 145, in model_train
train_epoch_results = train_or_validate_epoch(training_steps)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training.py", line 285, in train_or_validate_epoch
sample, batch_index, train_val_params.epoch)
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/model_training_steps.py", line 661, in forward_and_backward_minibatch
allow_multiple_classes_for_each_pixel=True).cpu().numpy()
File "/mnt/batch/tasks/shared/LS_root/jobs/radiomicsnn/azureml/jaalvare_bigcluster_1609851623_cabf55b6/mounts/workspaceblobstore/azureml/jaalvare_bigcluster_1609851623_cabf55b6/innereye-deeplearning/InnerEye/ML/metrics.py", line 428, in compute_dice_across_patches
intersection = 2.0 * torch.sum(one_hot_segmentation & ground_truth, dim=-1).float()
RuntimeError: CUDA out of memory. Tried to allocate 12.74 GiB (GPU 0; 31.75 GiB total capacity; 12.10 GiB already allocated; 8.44 GiB free; 22.06 GiB reserved in total by PyTorch)
Starting the daemon thread to refresh tokens in background for process with pid = 82

AB#3907

@javier-alvarez javier-alvarez added the bug Something isn't working label Jan 5, 2021
@javier-alvarez javier-alvarez changed the title Restarting from checkpoint on large model goes out of memory Restarting training from checkpoint on large model goes out of memory Jan 5, 2021
@ant0nsc ant0nsc added this to Backlog in InnerEye via automation Feb 3, 2021
@ant0nsc ant0nsc added no changelog needed CHANGELOG.md does not need to be updated in this PR and removed no changelog needed CHANGELOG.md does not need to be updated in this PR labels Apr 9, 2021
@peterhessey
Copy link
Contributor

This needs more investigation - I will test that the run_recovery flag still works as expected before looking into this specific case.

@peterhessey peterhessey self-assigned this May 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
Status: No status
InnerEye
  
Planned
Development

No branches or pull requests

3 participants