Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default of output_layer_parallelism = "row" is broken for model-parallel training #905

Closed
cbcase opened this issue Apr 25, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@cbcase
Copy link

cbcase commented Apr 25, 2023

Describe the bug
By default, output_layer_parallelism = "row", which splits the output in the hidden dimension. But the loss function mpu.vocab_parallel_cross_entropy assumes input split in the vocab dimension. Remarkably, it doesn't fail on shape mismatches but instead runs cleanly and silently computes the wrong loss. I noticed that all the provided configs explicitly set output_layer_parallelism = "column".

To Reproduce
If you train the same model with model_parallel_size > 1 and compare row vs. column parallelism, you will get worse results with the default setting of row.

Expected behavior
See above.

Proposed solution
I'm not sure why row-parallelism on the output is supported, but I think any of the following are reasonable:

  • Get rid of row-parallelism on the output
  • Make column the default
  • Add shape checks to VocabParallelCrossEntropy
@cbcase cbcase added the bug Something isn't working label Apr 25, 2023
@Quentin-Anthony
Copy link
Member

Nice catch! Looks like this issue has snuck through for a while. I'm going to make column the default and add an error for row-parallelism for now, and if you're able to add shape checks that would be much appreciated!

@Quentin-Anthony
Copy link
Member

Created a temporary hotfix in #915 and started tracking the permanent fix in #916

@StellaAthena
Copy link
Member

Nice catch! Looks like this issue has snuck through for a while. I'm going to make column the default and add an error for row-parallelism for now, and if you're able to add shape checks that would be much appreciated!

Is there a reason to support row parallelism at all? I’m struggling to think of why the user might care other than compatibility with loss functions, but as noted mpu’s CE loss is column parallel. Is there another library we support that does it row parallel?

@cbcase
Copy link
Author

cbcase commented May 2, 2023

Fwiw, I think getting rid of row-parallel final linear as an option makes the most sense -- it's not clear to me how you could write a loss function that operates on row-parallel outputs, since the output activations are split in the "reduction dimension" as opposed to along any tensor axis.

Furthermore, in the simple case where you perform an immediate reduce of the parallel output (rather than have a parallel loss function), column parallel is faster since the reduction is an AllGather, not an AllReduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants