Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Fix multi-node bug in PL 1.2.8 #437

Merged
merged 9 commits into from
Apr 16, 2021
Merged

Conversation

melanibe
Copy link
Contributor

@melanibe melanibe commented Apr 16, 2021

This PR fixes the bug in the DDPPlugin of PL v1.2.8. The bug was that in _call_children_scripts where DDPPlugin assert that global_rank == 0 whereas this function should be called on every node i.e. assert local_rank == 0.
(Tested on 1 node - 4 gpus, 2 nodes 1 gpu and 2 nodes 2 gpus)

Re-adds the end-2-end test for 2 nodes training in the build pipeline.

Closes #436

@melanibe melanibe marked this pull request as ready for review April 16, 2021 12:59
javier-alvarez
javier-alvarez previously approved these changes Apr 16, 2021
ozan-oktay
ozan-oktay previously approved these changes Apr 16, 2021
javier-alvarez
javier-alvarez previously approved these changes Apr 16, 2021
@melanibe melanibe dismissed stale reviews from javier-alvarez and ozan-oktay via 940cb1c April 16, 2021 14:08
@melanibe melanibe enabled auto-merge (squash) April 16, 2021 14:32
@melanibe melanibe merged commit a155946 into main Apr 16, 2021
@melanibe melanibe deleted the melanibe/fix-multi-node-pl-bug branch April 16, 2021 18:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple node training is broken with PL 1.2.7
3 participants