Default of `output_layer_parallelism = "row"` is broken for model-parallel training #905

cbcase · 2023-04-25T17:00:00Z

Describe the bug
By default, output_layer_parallelism = "row", which splits the output in the hidden dimension. But the loss function mpu.vocab_parallel_cross_entropy assumes input split in the vocab dimension. Remarkably, it doesn't fail on shape mismatches but instead runs cleanly and silently computes the wrong loss. I noticed that all the provided configs explicitly set output_layer_parallelism = "column".

To Reproduce
If you train the same model with model_parallel_size > 1 and compare row vs. column parallelism, you will get worse results with the default setting of row.

Expected behavior
See above.

Proposed solution
I'm not sure why row-parallelism on the output is supported, but I think any of the following are reasonable:

Get rid of row-parallelism on the output
Make column the default
Add shape checks to VocabParallelCrossEntropy

The text was updated successfully, but these errors were encountered:

Quentin-Anthony · 2023-05-02T04:17:28Z

Nice catch! Looks like this issue has snuck through for a while. I'm going to make column the default and add an error for row-parallelism for now, and if you're able to add shape checks that would be much appreciated!

Quentin-Anthony · 2023-05-02T04:57:57Z

Created a temporary hotfix in #915 and started tracking the permanent fix in #916

StellaAthena · 2023-05-02T05:16:10Z

Nice catch! Looks like this issue has snuck through for a while. I'm going to make column the default and add an error for row-parallelism for now, and if you're able to add shape checks that would be much appreciated!

Is there a reason to support row parallelism at all? I’m struggling to think of why the user might care other than compatibility with loss functions, but as noted mpu’s CE loss is column parallel. Is there another library we support that does it row parallel?

cbcase · 2023-05-02T16:58:23Z

Fwiw, I think getting rid of row-parallel final linear as an option makes the most sense -- it's not clear to me how you could write a loss function that operates on row-parallel outputs, since the output activations are split in the "reduction dimension" as opposed to along any tensor axis.

Furthermore, in the simple case where you perform an immediate reduce of the parallel output (rather than have a parallel loss function), column parallel is faster since the reduction is an AllGather, not an AllReduce.

cbcase added the bug Something isn't working label Apr 25, 2023

Quentin-Anthony added a commit that referenced this issue May 2, 2023

Disable row-parallelism for now because it's broken (#905)

25e1fb0

This was referenced May 2, 2023

Disable row-parallelism for now #915

Merged

Fix output_layer_parallelism = "row" #916

Closed

StellaAthena closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default of `output_layer_parallelism = "row"` is broken for model-parallel training #905

Default of `output_layer_parallelism = "row"` is broken for model-parallel training #905

cbcase commented Apr 25, 2023

Quentin-Anthony commented May 2, 2023

Quentin-Anthony commented May 2, 2023

StellaAthena commented May 2, 2023

cbcase commented May 2, 2023

Default of output_layer_parallelism = "row" is broken for model-parallel training #905

Default of output_layer_parallelism = "row" is broken for model-parallel training #905

Comments

cbcase commented Apr 25, 2023

Quentin-Anthony commented May 2, 2023

Quentin-Anthony commented May 2, 2023

StellaAthena commented May 2, 2023

cbcase commented May 2, 2023

Default of `output_layer_parallelism = "row"` is broken for model-parallel training #905

Default of `output_layer_parallelism = "row"` is broken for model-parallel training #905