You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 17, 2021. It is now read-only.
I am trying to use multiple GPUs to train my network and I observed a peculiar behavior that I think is due to an initialization issue.
My network and application run just fine on a single GPU, but as soon as I want to use multiple GPUs I start getting NaNs all over the network and loss function randomly (checked with tf.check_numerics after each layer). Following that I concluded that the gradients must be an issue and I used tf.clip_by_global_norm and I observed the following output trend:
Learning_rate=9.999204849009402e-06, Global_norm=45138432.0, L1_reconstruction_loss=2676455.0, L2_reconstruction_loss=1177.85400390625, L1_Image_gradient_loss=0.0, L2_Image_gradient_loss=0.0, L2_6_VQ_loss=5526.724609375, L2_4_VQ_loss=2.0605289831494675e+25, L2_2_VQ_loss=0.01820339821279049, Total_loss=2.0605289831494675e+25, Global_norm_1=31815641333760.0
As can be seen, the global_norm of the two GPUs is not even in the same ballpark. This leads me to believe that the copy of the network on the other GPU is not being updated most likely since I already use FixUp+He initialization so my variance is small enough to train without BatchNorm and as I said the network runs just fine on one GPU.
Could you please help?
The text was updated successfully, but these errors were encountered:
I am trying to use multiple GPUs to train my network and I observed a peculiar behavior that I think is due to an initialization issue.
My network and application run just fine on a single GPU, but as soon as I want to use multiple GPUs I start getting NaNs all over the network and loss function randomly (checked with tf.check_numerics after each layer). Following that I concluded that the gradients must be an issue and I used tf.clip_by_global_norm and I observed the following output trend:
Learning_rate=9.999204849009402e-06, Global_norm=45138432.0, L1_reconstruction_loss=2676455.0, L2_reconstruction_loss=1177.85400390625, L1_Image_gradient_loss=0.0, L2_Image_gradient_loss=0.0, L2_6_VQ_loss=5526.724609375, L2_4_VQ_loss=2.0605289831494675e+25, L2_2_VQ_loss=0.01820339821279049, Total_loss=2.0605289831494675e+25, Global_norm_1=31815641333760.0
As can be seen, the global_norm of the two GPUs is not even in the same ballpark. This leads me to believe that the copy of the network on the other GPU is not being updated most likely since I already use FixUp+He initialization so my variance is small enough to train without BatchNorm and as I said the network runs just fine on one GPU.
Could you please help?
The text was updated successfully, but these errors were encountered: