Training without Pipeline Parallelism #5

kshitijkg · 2023-05-30T16:35:03Z

When training without pipeline parallelism, the sequential wrapper is used: https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/training.py#L461. Code for to_sequential: https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/model/gpt2_model.py#L343

However, all the adapters added are lost when this is done.

This is probably because the model is rebuilt with self.specs which wasnt updated when the adapters are added.

floatingbigcat · 2023-05-31T01:49:40Z

Hi, I have tested on a small model with pp=1 mp=1, but the output of model looks fine.
Did you change this? maybe our code didn't make model sequential now
https://github.com/floatingsnake/gpt-neox/blob/73cdd8692be8a2c579444434e60a01450a8c9a3c/megatron/neox_arguments/arguments.py#L992

https://github.com/floatingsnake/gpt-neox/blob/magma/mytests/test_model_build.py
https://github.com/floatingsnake/gpt-neox/blob/magma/configs/summit-70m-openclipH.yml#L16-L17

part of output:

    )
    (6): ParallelTransformerLayerPipe(
      (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (attention): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelSelfAttention(
          (query_key_value): ColumnParallelLinear()
          (rotary_emb): RotaryEmbedding()
          (scale_mask_softmax): FusedScaleMaskSoftmax()
          (attention_dropout): Dropout(p=0, inplace=False)
          (dense): RowParallelLinear()
        )
      )
      (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (mlp): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelMLP(
          (dense_h_to_4h): ColumnParallelLinear()
          (dense_4h_to_h): RowParallelLinear()
        )
      )
    )
    (7): ParallelTransformerLayerPipe(
      (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (attention): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelSelfAttention(
          (query_key_value): ColumnParallelLinear()
          (rotary_emb): RotaryEmbedding()
          (scale_mask_softmax): FusedScaleMaskSoftmax()
          (attention_dropout): Dropout(p=0, inplace=False)
          (dense): RowParallelLinear()
        )
      )
      (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (mlp): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelMLP(
          (dense_h_to_4h): ColumnParallelLinear()
          (dense_4h_to_h): RowParallelLinear()
        )
      )
    )
    (9): NormPipe(
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (10): ParallelLinearPipe(
      (final_linear): ColumnParallelLinear()
    )
  )
)
Current GPU memory usage: 9.01 GB

floatingbigcat · 2023-05-31T03:41:16Z

As we abandond the sequential wrapper. and mp=1, pp=1 works will without it. We can reopen the issue when it is needed

kshitijkg · 2023-05-31T03:41:55Z

Yes, I had changed that line to test sequential wrapper. But yeah, solving this is not high priority for now since we are moving away from the sequential wrapper :)

kshitijkg added the bug Something isn't working label May 30, 2023

kshitijkg mentioned this issue May 30, 2023

Pythia Checkpoint Loading #4

Closed

floatingbigcat closed this as completed May 31, 2023

kshitijkg added this to the Robin V0 milestone Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training without Pipeline Parallelism #5

Training without Pipeline Parallelism #5

kshitijkg commented May 30, 2023 •

edited

Loading

floatingbigcat commented May 31, 2023 •

edited

Loading

floatingbigcat commented May 31, 2023 •

edited

Loading

kshitijkg commented May 31, 2023

Training without Pipeline Parallelism #5

Training without Pipeline Parallelism #5

Comments

kshitijkg commented May 30, 2023 • edited Loading

floatingbigcat commented May 31, 2023 • edited Loading

floatingbigcat commented May 31, 2023 • edited Loading

kshitijkg commented May 31, 2023

kshitijkg commented May 30, 2023 •

edited

Loading

floatingbigcat commented May 31, 2023 •

edited

Loading

floatingbigcat commented May 31, 2023 •

edited

Loading