Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Memory errors & Focal Loss error with lung segmentation model #406

Closed
shreyasingh1 opened this issue Feb 23, 2021 · 7 comments
Closed

Memory errors & Focal Loss error with lung segmentation model #406

shreyasingh1 opened this issue Feb 23, 2021 · 7 comments

Comments

@shreyasingh1
Copy link

shreyasingh1 commented Feb 23, 2021

Hello,

I've been having some trouble getting the sample lung segmentation experiment to complete succesfully. Submitting the task to STANDARD_DS3_V2 and STANDARD_DS12_V2 CPUs yielded a dataloader error (shown below).
ValueError: At least one component of the runner failed: Training failed: DataLoader worker (pid(s) 275) exited unexpectedly

--num_dataload_workers is set at 8 by default, so I lowered it by passing --num_dataload_workers=0 and --train_batch_size=1. This then yielded a memory error (shown below) when I ran it on either CPU.

ValueError: At least one component of the runner failed: Training failed: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 2774532096 bytes. Error code 12 (Cannot allocate memory)

Then, I tried running it on the STANDARD_NC6_GPU. Here, with parameters --num_dataload_workers=0 and train_batch_size=1 I received the below error. It looks to be raised here in InnerEye's code. I've attached the driver_log for the run that produced this error as well.
ValueError: At least one component of the runner failed: Training failed: Focal loss is supported only for one-hot encoded targets

Also note that if there are too many workers or batch size is too high, even the STANDARD_NC6_GPU will produce a CUDA memory error (shown below).
Training failed: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 11.17 GiB total capacity; 10.77 GiB already allocated; 57.81 MiB free; 10.79 GiB reserved in total by PyTorch)

Is there a particular compute target that should be used to avoid these memory errors? And is there a way to get around the focal loss error?

AB#3881

@ant0nsc
Copy link
Contributor

ant0nsc commented Feb 23, 2021

Standard_ND24s are the machines that we normally use, they fit the models as they are in the repository. If you want to / need to use VMs with smaller GPUs, I'd suggest lowering the batch size. I would not recommend running those models on a CPU.

Data load workers should not have an impact on memory consumption. Outside of unit tests, I would not recommend --num_dataload_workers=0 (this runs data loading in the main thread), rather use 1 or 2.
If focal loss fails, it can mean that your structures are not mutually exclusive. At present, the code does NOT check that this is the case, see #339

@shreyasingh1
Copy link
Author

Thanks for the suggestions! Upon trying to resubmit the tasks to the new compute target, I came across a new error within the AML Driver Log. ModuleNotFoundError: No module named 'InnerEye' After that, it hangs at this line initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4 until the task is manually terminated.

I've installed the repo in my local environment (pip install -e .), and cleared my environment/recloned the repo a few times as well. Any suggestions?

@ant0nsc
Copy link
Contributor

ant0nsc commented Feb 25, 2021

That's odd. I'm surprised that you did not hit that error on your first runs.
How are you calling the InnerEye runner? Did you fork the codebase and submit, or did you choose the alternative route to have your own codebase that uses InnerEye as a submodule? Can you let me know what is in your snapshot (the code that is running on AML)? In AML for the failing run, go to the "Snapshot" tab, and list the names of the top-level directories?

@shreyasingh1
Copy link
Author

Yeah agreed. I only had seen the error in my command line initially (never AML) and it went away after pip install -e . I call the InnerEye runner like this: python InnerEyeLocal/ML/runner.py --azureml=True --model=LungExt --train=True and I didn't fork the repo, I just made a copy of InnerEye called InnerEyeLocal. And my snapshot has the top-level directories: .github, .idea, azure-pipelines, docs, InnerEye, innereye.egg-info, InnerEyeLocal, sphinx-docs, Tests, TestsOutsidePackage, and TestSubmodule.
driver log is here as well

@ant0nsc
Copy link
Contributor

ant0nsc commented Feb 26, 2021

OK, the issue is that the "pip install -e ." is effectively not happening in AzureML. The simplest solution is that you directly work with the runner in InnerEye/ML/runner.py, and make all changes (add models) there.

@shreyasingh1
Copy link
Author

Awesome thank you, I was able to get back to the earlier Training failed: Focal loss is supported only for one-hot encoded targets error, which looks to be because of the issue you referenced. I'll take a stab at #339 and open a PR if I'm successful!

@ant0nsc
Copy link
Contributor

ant0nsc commented Apr 6, 2021

Closing because the remainder of the issue is covered in #339

@ant0nsc ant0nsc closed this as completed Apr 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants