Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Run recovery of a typical PR model fails with a cuda/cpu error #198

Closed
ant0nsc opened this issue Sep 4, 2020 · 0 comments · Fixed by #259
Closed

Run recovery of a typical PR model fails with a cuda/cpu error #198

ant0nsc opened this issue Sep 4, 2020 · 0 comments · Fixed by #259
Labels
bug Something isn't working
Projects

Comments

@ant0nsc
Copy link
Contributor

ant0nsc commented Sep 4, 2020

Running training recovery on a BasicModel2Epochs fails with

2020-09-04T21:10:54Z ERROR    Model training/testing failed. Exception: expected device cpu but got device cuda:0
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/runner.py", line 313, in run_in_situ
    self.create_ml_runner().run()
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/run_ml.py", line 199, in run
    model_train(self.model_config, run_recovery)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 147, in model_train
    train_epoch_results = train_or_validate_epoch(training_steps)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 283, in train_or_validate_epoch
    sample, batch_index, train_val_params.epoch)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training_steps.py", line 610, in forward_and_backward_minibatch
    mask=mask)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 101, in forward_pass_patches
    result = self._forward_pass_with_anomaly_detection(patches=patches, mask=mask, labels=labels)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 119, in _forward_pass_with_anomaly_detection
    return self._forward_pass(patches, mask, labels)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 144, in _forward_pass
    single_optimizer_step(self.config, loss, self.optimizer)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 188, in single_optimizer_step
    optimizer.step(closure=None)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 51, in wrapper
    return wrapped(*args, **kwargs)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/_initialize.py", line 242, in new_step
    output = old_step(*args, **kwargs)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/adam.py", line 95, in step
    exp_avg.mul_(beta1).add_(1 - beta1, grad)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/wrap.py", line 101, in wrapper
    return orig_fn(arg0, *args, **kwargs)
RuntimeError: expected device cpu but got device cuda:0
@ant0nsc ant0nsc added the bug Something isn't working label Sep 4, 2020
@ant0nsc ant0nsc added this to To do in InnerEye Sep 10, 2020
Shruthi42 added a commit that referenced this issue Sep 14, 2020
Fixes issue #198 by moving the optimizer parameters to cuda.
@ant0nsc ant0nsc moved this from Backlog to Current Sprint in InnerEye Sep 22, 2020
InnerEye automation moved this from Current Sprint to Done Oct 2, 2020
ant0nsc pushed a commit that referenced this issue Oct 2, 2020
- Separates the logic used to determine from what checkpoint/checkpoint path we will recover
- Separates model creation, and model checkpoint loading from optimizer creation and checkpoint loading and keeps all this under class ModelAndInfo.
- Optimizers created after model is moved to GPU - Fixes #198
- Test added to train_via_submodule.yml which continues training from a previous run using run recovery.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
No open projects
InnerEye
  
Done
Development

Successfully merging a pull request may close this issue.

1 participant