Memory errors & Focal Loss error with lung segmentation model #406

shreyasingh1 · 2021-02-23T17:38:35Z

Hello,

I've been having some trouble getting the sample lung segmentation experiment to complete succesfully. Submitting the task to STANDARD_DS3_V2 and STANDARD_DS12_V2 CPUs yielded a dataloader error (shown below).
ValueError: At least one component of the runner failed: Training failed: DataLoader worker (pid(s) 275) exited unexpectedly

--num_dataload_workers is set at 8 by default, so I lowered it by passing --num_dataload_workers=0 and --train_batch_size=1. This then yielded a memory error (shown below) when I ran it on either CPU.

ValueError: At least one component of the runner failed: Training failed: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 2774532096 bytes. Error code 12 (Cannot allocate memory)

Then, I tried running it on the STANDARD_NC6_GPU. Here, with parameters --num_dataload_workers=0 and train_batch_size=1 I received the below error. It looks to be raised here in InnerEye's code. I've attached the driver_log for the run that produced this error as well.
ValueError: At least one component of the runner failed: Training failed: Focal loss is supported only for one-hot encoded targets

Also note that if there are too many workers or batch size is too high, even the STANDARD_NC6_GPU will produce a CUDA memory error (shown below).
Training failed: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 11.17 GiB total capacity; 10.77 GiB already allocated; 57.81 MiB free; 10.79 GiB reserved in total by PyTorch)

Is there a particular compute target that should be used to avoid these memory errors? And is there a way to get around the focal loss error?

AB#3881

The text was updated successfully, but these errors were encountered:

ant0nsc · 2021-02-23T21:56:06Z

Standard_ND24s are the machines that we normally use, they fit the models as they are in the repository. If you want to / need to use VMs with smaller GPUs, I'd suggest lowering the batch size. I would not recommend running those models on a CPU.

Data load workers should not have an impact on memory consumption. Outside of unit tests, I would not recommend --num_dataload_workers=0 (this runs data loading in the main thread), rather use 1 or 2.
If focal loss fails, it can mean that your structures are not mutually exclusive. At present, the code does NOT check that this is the case, see #339

shreyasingh1 · 2021-02-25T06:28:52Z

Thanks for the suggestions! Upon trying to resubmit the tasks to the new compute target, I came across a new error within the AML Driver Log. ModuleNotFoundError: No module named 'InnerEye' After that, it hangs at this line initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4 until the task is manually terminated.

I've installed the repo in my local environment (pip install -e .), and cleared my environment/recloned the repo a few times as well. Any suggestions?

ant0nsc · 2021-02-25T09:26:11Z

That's odd. I'm surprised that you did not hit that error on your first runs.
How are you calling the InnerEye runner? Did you fork the codebase and submit, or did you choose the alternative route to have your own codebase that uses InnerEye as a submodule? Can you let me know what is in your snapshot (the code that is running on AML)? In AML for the failing run, go to the "Snapshot" tab, and list the names of the top-level directories?

shreyasingh1 · 2021-02-25T21:02:21Z

Yeah agreed. I only had seen the error in my command line initially (never AML) and it went away after pip install -e . I call the InnerEye runner like this: python InnerEyeLocal/ML/runner.py --azureml=True --model=LungExt --train=True and I didn't fork the repo, I just made a copy of InnerEye called InnerEyeLocal. And my snapshot has the top-level directories: .github, .idea, azure-pipelines, docs, InnerEye, innereye.egg-info, InnerEyeLocal, sphinx-docs, Tests, TestsOutsidePackage, and TestSubmodule.
driver log is here as well

ant0nsc · 2021-02-26T14:18:52Z

OK, the issue is that the "pip install -e ." is effectively not happening in AzureML. The simplest solution is that you directly work with the runner in InnerEye/ML/runner.py, and make all changes (add models) there.

shreyasingh1 · 2021-03-01T06:34:44Z

Awesome thank you, I was able to get back to the earlier Training failed: Focal loss is supported only for one-hot encoded targets error, which looks to be because of the issue you referenced. I'll take a stab at #339 and open a PR if I'm successful!

ant0nsc · 2021-04-06T13:43:50Z

Closing because the remainder of the issue is covered in #339

ant0nsc closed this as completed Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory errors & Focal Loss error with lung segmentation model #406

Memory errors & Focal Loss error with lung segmentation model #406

shreyasingh1 commented Feb 23, 2021 •

edited by azure-boards bot

Loading

ant0nsc commented Feb 23, 2021

shreyasingh1 commented Feb 25, 2021

ant0nsc commented Feb 25, 2021 •

edited

Loading

shreyasingh1 commented Feb 25, 2021

ant0nsc commented Feb 26, 2021

shreyasingh1 commented Mar 1, 2021

ant0nsc commented Apr 6, 2021

Memory errors & Focal Loss error with lung segmentation model #406

Memory errors & Focal Loss error with lung segmentation model #406

Comments

shreyasingh1 commented Feb 23, 2021 • edited by azure-boards bot Loading

ant0nsc commented Feb 23, 2021

shreyasingh1 commented Feb 25, 2021

ant0nsc commented Feb 25, 2021 • edited Loading

shreyasingh1 commented Feb 25, 2021

ant0nsc commented Feb 26, 2021

shreyasingh1 commented Mar 1, 2021

ant0nsc commented Apr 6, 2021

shreyasingh1 commented Feb 23, 2021 •

edited by azure-boards bot

Loading

ant0nsc commented Feb 25, 2021 •

edited

Loading