Multi-GPU initialisation issue #461

danieltudosiu · 2019-11-06T18:25:36Z

I am trying to use multiple GPUs to train my network and I observed a peculiar behavior that I think is due to an initialization issue.

My network and application run just fine on a single GPU, but as soon as I want to use multiple GPUs I start getting NaNs all over the network and loss function randomly (checked with tf.check_numerics after each layer). Following that I concluded that the gradients must be an issue and I used tf.clip_by_global_norm and I observed the following output trend:

Learning_rate=9.999204849009402e-06, Global_norm=45138432.0, L1_reconstruction_loss=2676455.0, L2_reconstruction_loss=1177.85400390625, L1_Image_gradient_loss=0.0, L2_Image_gradient_loss=0.0, L2_6_VQ_loss=5526.724609375, L2_4_VQ_loss=2.0605289831494675e+25, L2_2_VQ_loss=0.01820339821279049, Total_loss=2.0605289831494675e+25, Global_norm_1=31815641333760.0
As can be seen, the global_norm of the two GPUs is not even in the same ballpark. This leads me to believe that the copy of the network on the other GPU is not being updated most likely since I already use FixUp+He initialization so my variance is small enough to train without BatchNorm and as I said the network runs just fine on one GPU.

Could you please help?

The text was updated successfully, but these errors were encountered:

danieltudosiu closed this as completed Nov 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU initialisation issue #461

Multi-GPU initialisation issue #461

danieltudosiu commented Nov 6, 2019

Multi-GPU initialisation issue #461

Multi-GPU initialisation issue #461

Comments

danieltudosiu commented Nov 6, 2019