-
Notifications
You must be signed in to change notification settings - Fork 141
Memory errors & Focal Loss error with lung segmentation model #406
Comments
Data load workers should not have an impact on memory consumption. Outside of unit tests, I would not recommend |
Thanks for the suggestions! Upon trying to resubmit the tasks to the new compute target, I came across a new error within the AML Driver Log. I've installed the repo in my local environment (pip install -e .), and cleared my environment/recloned the repo a few times as well. Any suggestions? |
That's odd. I'm surprised that you did not hit that error on your first runs. |
Yeah agreed. I only had seen the error in my command line initially (never AML) and it went away after |
OK, the issue is that the "pip install -e ." is effectively not happening in AzureML. The simplest solution is that you directly work with the runner in |
Awesome thank you, I was able to get back to the earlier |
Closing because the remainder of the issue is covered in #339 |
Hello,
I've been having some trouble getting the sample lung segmentation experiment to complete succesfully. Submitting the task to STANDARD_DS3_V2 and STANDARD_DS12_V2 CPUs yielded a dataloader error (shown below).
ValueError: At least one component of the runner failed: Training failed: DataLoader worker (pid(s) 275) exited unexpectedly
--num_dataload_workers
is set at 8 by default, so I lowered it by passing--num_dataload_workers=0
and--train_batch_size=1
. This then yielded a memory error (shown below) when I ran it on either CPU.ValueError: At least one component of the runner failed: Training failed: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 2774532096 bytes. Error code 12 (Cannot allocate memory)
Then, I tried running it on the STANDARD_NC6_GPU. Here, with parameters
--num_dataload_workers=0
andtrain_batch_size=1
I received the below error. It looks to be raised here in InnerEye's code. I've attached the driver_log for the run that produced this error as well.ValueError: At least one component of the runner failed: Training failed: Focal loss is supported only for one-hot encoded targets
Also note that if there are too many workers or batch size is too high, even the STANDARD_NC6_GPU will produce a CUDA memory error (shown below).
Training failed: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 11.17 GiB total capacity; 10.77 GiB already allocated; 57.81 MiB free; 10.79 GiB reserved in total by PyTorch)
Is there a particular compute target that should be used to avoid these memory errors? And is there a way to get around the focal loss error?
AB#3881
The text was updated successfully, but these errors were encountered: