You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
Running training recovery on a BasicModel2Epochs fails with
2020-09-04T21:10:54Z ERROR Model training/testing failed. Exception: expected device cpu but got device cuda:0
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/runner.py", line 313, in run_in_situ
self.create_ml_runner().run()
File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/run_ml.py", line 199, in run
model_train(self.model_config, run_recovery)
File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 147, in model_train
train_epoch_results = train_or_validate_epoch(training_steps)
File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 283, in train_or_validate_epoch
sample, batch_index, train_val_params.epoch)
File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training_steps.py", line 610, in forward_and_backward_minibatch
mask=mask)
File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 101, in forward_pass_patches
result = self._forward_pass_with_anomaly_detection(patches=patches, mask=mask, labels=labels)
File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 119, in _forward_pass_with_anomaly_detection
return self._forward_pass(patches, mask, labels)
File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 144, in _forward_pass
single_optimizer_step(self.config, loss, self.optimizer)
File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 188, in single_optimizer_step
optimizer.step(closure=None)
File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 51, in wrapper
return wrapped(*args, **kwargs)
File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/_initialize.py", line 242, in new_step
output = old_step(*args, **kwargs)
File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/adam.py", line 95, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/wrap.py", line 101, in wrapper
return orig_fn(arg0, *args, **kwargs)
RuntimeError: expected device cpu but got device cuda:0
The text was updated successfully, but these errors were encountered:
- Separates the logic used to determine from what checkpoint/checkpoint path we will recover
- Separates model creation, and model checkpoint loading from optimizer creation and checkpoint loading and keeps all this under class ModelAndInfo.
- Optimizers created after model is moved to GPU - Fixes#198
- Test added to train_via_submodule.yml which continues training from a previous run using run recovery.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Running training recovery on a BasicModel2Epochs fails with
The text was updated successfully, but these errors were encountered: