change norm sharding #623

ZhiyuLi-goog · 2024-04-26T09:24:49Z

This change is to switch the norm layer's sharding from "embed"/FSDP to match the "activation_embed"/tensor parallelism, aligning the sharding for the element-wise multiplication

boosted multislice MFU by roughly 10% with TP activated after avoiding unexpected cross-slice DCN collective-permute caused by sharding mismatch experiments
verified in HLOs

# Setup [data_dcn, fsdp_ici, tensor_ici] corresponds to [2,16,4] 

# norm weight (scale) was sharded by "embed"/fsdp i.e. 16 way fsdp sharding

reshape.371 = bf16[12288]{0} reshape(add.49), sharding={devices=[16,8]<=[2,16,4]T(1,0,2) last_tile_dim_replicate}, metadata={op_name="jit(train_step)/jit(main)/transpose(jvp(Transformer))/decoder/while/body/checkpoint/rematted_computation/layers/pre_self_attention_norm/add" source_file="/app/maxtext/MaxText/layers/gpt3.py" source_line=91}


# to ensure the same sharding as activation ((data, fsdp), None, tensor)
# expensive all gather and collective permuate communication triggered by sharding mismatch in broadcast: [16(fsdp),8(None)] -> [32(data x fsdp),1(None),4(tp)]

broadcast.1326 = bf16[128,2048,12288]{2,1,0} broadcast(reshape.371), dimensions={2}, sharding={devices=[32,1,4]<=[128]}, metadata={op_name="jit(train_step)/jit(main)/transpose(jvp(Transformer))/decoder/while/body/checkpoint/rematted_computation/layers/pre_self_attention_norm/mul" source_file="/app/maxtext/MaxText/layers/gpt3.py" source_line=91}
multiply.1327 = bf16[128,2048,12288]{2,1,0} multiply(multiply.1320, broadcast.1326), sharding={devices=[32,1,4]<=[128]}, metadata={op_name="jit(train_step)/jit(main)/transpose(jvp(Transformer))/decoder/while/body/checkpoint/rematted_computation/layers/pre_self_attention_norm/mul" source_file="/app/maxtext/MaxText/layers/gpt3.py" source_line=91}

fix lint Revert "fix lint" This reverts commit d8dc450. fix lint

ZhiyuLi-goog requested review from rwitten and gobbleturk as code owners April 26, 2024 09:24

ZhiyuLi-goog force-pushed the lizhiyu/change_norm_sharding branch from d8dc450 to 148ea3f Compare April 26, 2024 17:14

ZhiyuLi-goog assigned rwitten Apr 26, 2024

rwitten approved these changes Apr 26, 2024

View reviewed changes

rwitten removed their assignment Apr 26, 2024

change norm sharding

9feab51

fix lint Revert "fix lint" This reverts commit d8dc450. fix lint

ZhiyuLi-goog force-pushed the lizhiyu/change_norm_sharding branch from 564444c to 9feab51 Compare April 26, 2024 20:16

github-actions bot added the pull ready label Apr 26, 2024

copybara-service bot merged commit 6570445 into main Apr 26, 2024
8 checks passed

copybara-service bot deleted the lizhiyu/change_norm_sharding branch April 26, 2024 21:04

This was referenced Apr 30, 2024

loosen tolerance in assert_params_sufficiently_sharded #628

Merged

fix norm sharding in decoder_norm #637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change norm sharding #623

change norm sharding #623

ZhiyuLi-goog commented Apr 26, 2024 •

edited

Loading

change norm sharding #623

change norm sharding #623

Conversation

ZhiyuLi-goog commented Apr 26, 2024 • edited Loading

ZhiyuLi-goog commented Apr 26, 2024 •

edited

Loading