Run recovery of a typical PR model fails with a cuda/cpu error #198

ant0nsc · 2020-09-04T21:49:03Z

Running training recovery on a BasicModel2Epochs fails with

2020-09-04T21:10:54Z ERROR    Model training/testing failed. Exception: expected device cpu but got device cuda:0
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/runner.py", line 313, in run_in_situ
    self.create_ml_runner().run()
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/run_ml.py", line 199, in run
    model_train(self.model_config, run_recovery)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 147, in model_train
    train_epoch_results = train_or_validate_epoch(training_steps)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 283, in train_or_validate_epoch
    sample, batch_index, train_val_params.epoch)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training_steps.py", line 610, in forward_and_backward_minibatch
    mask=mask)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 101, in forward_pass_patches
    result = self._forward_pass_with_anomaly_detection(patches=patches, mask=mask, labels=labels)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 119, in _forward_pass_with_anomaly_detection
    return self._forward_pass(patches, mask, labels)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 144, in _forward_pass
    single_optimizer_step(self.config, loss, self.optimizer)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 188, in single_optimizer_step
    optimizer.step(closure=None)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 51, in wrapper
    return wrapped(*args, **kwargs)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/_initialize.py", line 242, in new_step
    output = old_step(*args, **kwargs)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/adam.py", line 95, in step
    exp_avg.mul_(beta1).add_(1 - beta1, grad)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/wrap.py", line 101, in wrapper
    return orig_fn(arg0, *args, **kwargs)
RuntimeError: expected device cpu but got device cuda:0

The text was updated successfully, but these errors were encountered:

Fixes issue #198 by moving the optimizer parameters to cuda.

- Separates the logic used to determine from what checkpoint/checkpoint path we will recover - Separates model creation, and model checkpoint loading from optimizer creation and checkpoint loading and keeps all this under class ModelAndInfo. - Optimizers created after model is moved to GPU - Fixes #198 - Test added to train_via_submodule.yml which continues training from a previous run using run recovery.

ant0nsc added the bug Something isn't working label Sep 4, 2020

Shruthi42 mentioned this issue Sep 8, 2020

Move optimizer parameters to cuda #205

Merged

ant0nsc added this to To do in InnerEye Sep 10, 2020

Shruthi42 added a commit that referenced this issue Sep 14, 2020

Move optimizer parameters to cuda (#205)

f37136f

Fixes issue #198 by moving the optimizer parameters to cuda.

ant0nsc moved this from Backlog to Current Sprint in InnerEye Sep 22, 2020

Shruthi42 mentioned this issue Sep 30, 2020

Refactor to separate checkpoint, model and optimizer logic #259

Merged

ant0nsc closed this as completed in #259 Oct 2, 2020

InnerEye automation moved this from Current Sprint to Done Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run recovery of a typical PR model fails with a cuda/cpu error #198

Run recovery of a typical PR model fails with a cuda/cpu error #198

ant0nsc commented Sep 4, 2020

Run recovery of a typical PR model fails with a cuda/cpu error #198

Run recovery of a typical PR model fails with a cuda/cpu error #198

Comments

ant0nsc commented Sep 4, 2020