-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuing Training from ResEnc presets is slower and utilizes less GPU memory #2294
Comments
Hey, this is a bit difficult to debug because we don't see this behavior on our end. If your GPU only has 24GB (so VRAM is rather tight), cudnn.benchmark (which we use) can sometimes do funny things. This is beyond our power to address. |
Hi Fabian, Thanks for your reply. My GPU has 24GB of RAM; it is an A5000. I am running nnUNet through Ubuntu 22.04 on WSL2. Nothing else was running on the GPU and no other programs were open (this was the same configuration as when the training was started). When training a new fold, for example Debug.json, nnUNetResEncUNetPlans.json, and nnUNetResEncUNetPlans_24G.json are attached for your convenience. The latter two appear identical. Best regards, debug.json |
As recently suggested in https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md I am using nnUNetPlannerResEncL for training. I reconfigured the planner using
nnUNetv2_plan_experiment -d 602 -pl nnUNetPlannerResEncL -gpu_memory_target 24 -overwrite_plans_name nnUNetResEncUNetPlans_24G
. The approach was working as expected, utilizing 24 GB of GPU memory (nvidia-smi) and each epoch lasted around 238 seconds InitialTraining.txt.During the training process, my system lost power. I restarted training using the following
nnUNetv2_train 602 3d_fullres 1 --npz -p nnUNetResEncUNetPlans_24G --c
. However, my current GPU utilization is 22 GB and the training time has increased by about 60 seconds ContinueTraining.txt. Given the network must train for another 300 epochs, this adds 5 hours to the training process.Why is the GPU utilization less after restarting training with the --c flag?
Thanks for your support!
The text was updated successfully, but these errors were encountered: