Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bf16 #330

Merged
merged 24 commits into from
Jun 22, 2021
Merged

Bf16 #330

merged 24 commits into from
Jun 22, 2021

Conversation

sdtblck
Copy link
Contributor

@sdtblck sdtblck commented May 14, 2021

Currently working with everything but partitioning checkpointed activations + pipeline parallel - but those fixes will be in deepspeed anyway, so imo this should mostly be ready to merge.

All nccl communication for bf16 now happens in fp32. So it's a little slower if you're communication bound. Untested with zero optimizer as of right now.

Will need to install https://github.com/EleutherAI/DeeperSpeed/tree/bf16 for it to work, currently.

TODO:

  • merge deeperspeed bf16 branch to main
  • test with zero optimizer
  • get partition activations working
  • get pipe parallel working

@sdtblck sdtblck requested a review from a team as a code owner May 14, 2021 13:31
@sdtblck sdtblck marked this pull request as draft May 14, 2021 13:49
@sdtblck
Copy link
Contributor Author

sdtblck commented May 14, 2021

Update: It doesn't work with zero 😆

@sdtblck
Copy link
Contributor Author

sdtblck commented May 14, 2021

Ok, now working for zero 1 + 2.

@sdtblck
Copy link
Contributor Author

sdtblck commented May 14, 2021

aaand now pipeline parallel + activation checkpointing too!
Aside from Zero 3, everything works. I'll wait till we have zero 3 working in our codebase to take a look at that anyway, since i suspect it will be a fair bit of work.

@sdtblck sdtblck marked this pull request as ready for review May 14, 2021 19:43
@sdtblck sdtblck linked an issue May 14, 2021 that may be closed by this pull request
1 task
@StellaAthena StellaAthena merged commit 43f6330 into main Jun 22, 2021
@StellaAthena StellaAthena deleted the _bf16 branch June 22, 2021 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Bf16
2 participants