Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuing Training from ResEnc presets is slower and utilizes less GPU memory #2294

Open
AustinTapp opened this issue Jun 14, 2024 · 2 comments
Assignees

Comments

@AustinTapp
Copy link

As recently suggested in https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md I am using nnUNetPlannerResEncL for training. I reconfigured the planner using nnUNetv2_plan_experiment -d 602 -pl nnUNetPlannerResEncL -gpu_memory_target 24 -overwrite_plans_name nnUNetResEncUNetPlans_24G. The approach was working as expected, utilizing 24 GB of GPU memory (nvidia-smi) and each epoch lasted around 238 seconds InitialTraining.txt.

During the training process, my system lost power. I restarted training using the following nnUNetv2_train 602 3d_fullres 1 --npz -p nnUNetResEncUNetPlans_24G --c. However, my current GPU utilization is 22 GB and the training time has increased by about 60 seconds ContinueTraining.txt. Given the network must train for another 300 epochs, this adds 5 hours to the training process.

Why is the GPU utilization less after restarting training with the --c flag?

Thanks for your support!

@FabianIsensee FabianIsensee self-assigned this Jun 14, 2024
@FabianIsensee
Copy link
Member

Hey, this is a bit difficult to debug because we don't see this behavior on our end. If your GPU only has 24GB (so VRAM is rather tight), cudnn.benchmark (which we use) can sometimes do funny things. This is beyond our power to address.
My recommendation would be to restart again and see if that fixes it. Before you do, make sure nothing else runs on the GPU. If the GPU is in your workstation, please make sure to close all unnecessary programs before you restart.
240s is pretty slow for a nnU-Net ResEnc L epoch. What GPU are you using?
Best,
Fabian

@AustinTapp
Copy link
Author

Hi Fabian,

Thanks for your reply. My GPU has 24GB of RAM; it is an A5000. I am running nnUNet through Ubuntu 22.04 on WSL2. Nothing else was running on the GPU and no other programs were open (this was the same configuration as when the training was started). When training a new fold, for example nnUNetv2_train 602 3d_fullres 2 --npz -p nnUNetResEncUNetPla ns_24G all 24GB (24217MiB / 24564MiB) are used.

Debug.json, nnUNetResEncUNetPlans.json, and nnUNetResEncUNetPlans_24G.json are attached for your convenience. The latter two appear identical.

Best regards,
Austin

debug.json
nnUNetResEncUNetPlans_24G.json
nnUNetResEncUNetLPlans.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants