Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ZeRO 2 + Pipelining #677

Merged
merged 4 commits into from
Jan 20, 2021
Merged

Fix ZeRO 2 + Pipelining #677

merged 4 commits into from
Jan 20, 2021

Conversation

leogao2
Copy link
Contributor

@leogao2 leogao2 commented Jan 16, 2021

ZeRO 2 + Pipelining is broken. After hunting down the issue here (EleutherAI/gpt-neox#62), I've made a patch that seems to solve the issue but might create bad inefficiencies elsewhere.

Since this class turns enable_backward_allreduce off, self.overlapping_partition_gradients_reduce_epilogue() defined in the DeepSpeedEngine class never actually runs. I suspect this is because of efficiency problems; get_flat_partition in stage2.py might do something expensive memorywise (a quick glance at the code certainly gives off that impression); someone will have to look into that later. But in the meantime, this fixes ZeRO2 + Pipelining enough to run a demo.

@ShadenSmith
Copy link
Contributor

This is great, thank you!

@ShadenSmith
Copy link
Contributor

ShadenSmith commented Jan 20, 2021

Hi @leogao2 , looks like our auto-formatter wants to make some changes. Can you setup pre-commit with updated formatting? Thanks!

@ShadenSmith ShadenSmith merged commit 34c83a5 into microsoft:master Jan 20, 2021
@ShadenSmith
Copy link
Contributor

Merged, thanks @leogao2 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants