Continuing Training from ResEnc presets is slower and utilizes less GPU memory #2294

AustinTapp · 2024-06-14T15:58:11Z

As recently suggested in https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md I am using nnUNetPlannerResEncL for training. I reconfigured the planner using nnUNetv2_plan_experiment -d 602 -pl nnUNetPlannerResEncL -gpu_memory_target 24 -overwrite_plans_name nnUNetResEncUNetPlans_24G. The approach was working as expected, utilizing 24 GB of GPU memory (nvidia-smi) and each epoch lasted around 238 seconds InitialTraining.txt.

During the training process, my system lost power. I restarted training using the following nnUNetv2_train 602 3d_fullres 1 --npz -p nnUNetResEncUNetPlans_24G --c. However, my current GPU utilization is 22 GB and the training time has increased by about 60 seconds ContinueTraining.txt. Given the network must train for another 300 epochs, this adds 5 hours to the training process.

Why is the GPU utilization less after restarting training with the --c flag?

Thanks for your support!

The text was updated successfully, but these errors were encountered:

FabianIsensee · 2024-06-18T07:47:55Z

Hey, this is a bit difficult to debug because we don't see this behavior on our end. If your GPU only has 24GB (so VRAM is rather tight), cudnn.benchmark (which we use) can sometimes do funny things. This is beyond our power to address.
My recommendation would be to restart again and see if that fixes it. Before you do, make sure nothing else runs on the GPU. If the GPU is in your workstation, please make sure to close all unnecessary programs before you restart.
240s is pretty slow for a nnU-Net ResEnc L epoch. What GPU are you using?
Best,
Fabian

AustinTapp · 2024-06-18T17:10:29Z

Hi Fabian,

Thanks for your reply. My GPU has 24GB of RAM; it is an A5000. I am running nnUNet through Ubuntu 22.04 on WSL2. Nothing else was running on the GPU and no other programs were open (this was the same configuration as when the training was started). When training a new fold, for example nnUNetv2_train 602 3d_fullres 2 --npz -p nnUNetResEncUNetPla ns_24G all 24GB (24217MiB / 24564MiB) are used.

Debug.json, nnUNetResEncUNetPlans.json, and nnUNetResEncUNetPlans_24G.json are attached for your convenience. The latter two appear identical.

Best regards,
Austin

debug.json
nnUNetResEncUNetPlans_24G.json
nnUNetResEncUNetLPlans.json

FabianIsensee self-assigned this Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuing Training from ResEnc presets is slower and utilizes less GPU memory #2294

Continuing Training from ResEnc presets is slower and utilizes less GPU memory #2294

AustinTapp commented Jun 14, 2024

FabianIsensee commented Jun 18, 2024

AustinTapp commented Jun 18, 2024

Continuing Training from ResEnc presets is slower and utilizes less GPU memory #2294

Continuing Training from ResEnc presets is slower and utilizes less GPU memory #2294

Comments

AustinTapp commented Jun 14, 2024

FabianIsensee commented Jun 18, 2024

AustinTapp commented Jun 18, 2024