Skip to content
This repository has been archived by the owner on Mar 17, 2021. It is now read-only.

Multi-GPU initialisation issue #461

Closed
danieltudosiu opened this issue Nov 6, 2019 · 0 comments
Closed

Multi-GPU initialisation issue #461

danieltudosiu opened this issue Nov 6, 2019 · 0 comments

Comments

@danieltudosiu
Copy link
Contributor

I am trying to use multiple GPUs to train my network and I observed a peculiar behavior that I think is due to an initialization issue.

My network and application run just fine on a single GPU, but as soon as I want to use multiple GPUs I start getting NaNs all over the network and loss function randomly (checked with tf.check_numerics after each layer). Following that I concluded that the gradients must be an issue and I used tf.clip_by_global_norm and I observed the following output trend:

Learning_rate=9.999204849009402e-06, Global_norm=45138432.0, L1_reconstruction_loss=2676455.0, L2_reconstruction_loss=1177.85400390625, L1_Image_gradient_loss=0.0, L2_Image_gradient_loss=0.0, L2_6_VQ_loss=5526.724609375, L2_4_VQ_loss=2.0605289831494675e+25, L2_2_VQ_loss=0.01820339821279049, Total_loss=2.0605289831494675e+25, Global_norm_1=31815641333760.0
As can be seen, the global_norm of the two GPUs is not even in the same ballpark. This leads me to believe that the copy of the network on the other GPU is not being updated most likely since I already use FixUp+He initialization so my variance is small enough to train without BatchNorm and as I said the network runs just fine on one GPU.

Could you please help?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant