Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline Parallel QoL Fixes #63

Merged
merged 37 commits into from
Jan 17, 2021
Merged

Pipeline Parallel QoL Fixes #63

merged 37 commits into from
Jan 17, 2021

Conversation

StellaAthena
Copy link
Member

This commit contributes the train_parallel.py file which is an implementation of GPT-3 Small training with pipeline parallelism.

Additionally, several quality of life improvements have been made. Most notably, there are now separate configs and scripts for using ZeRO 1 and ZeRO 2. Several files have been renamed to give a more accurate description of what they contain.

Finally, I want to merge back into the main branch to avoid unreconcilable divergences. The longer this branch remains separate, the more likely that is.

As of this PR, both Pipeline Parallelism and Activation Checkpointing work individually, but not collectively. See Issue #62 for this bug report.

@StellaAthena StellaAthena requested a review from a team as a code owner January 14, 2021 09:04
StellaAthena and others added 9 commits January 14, 2021 04:16
Minor tweak to `forward` to align better with the demo code
Forgot about the open PR and pushed dev code to this branch.
* Pipeline + Checkpoint - ZeRO 2

This illustrates the third combination: pipeline parallelism, activation checkpoints, and ZeRO Stage 1. Again, this works. Upgrading ZeRO to Stage 2 causes it to fail.

Co-authored-by: Shivanshu Purohit <[email protected]>
@StellaAthena StellaAthena merged commit 755181b into main Jan 17, 2021
@StellaAthena StellaAthena deleted the stella branch January 17, 2021 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Pipeline Parallelism
2 participants