Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training without Pipeline Parallelism #5

Closed
kshitijkg opened this issue May 30, 2023 · 3 comments
Closed

Training without Pipeline Parallelism #5

kshitijkg opened this issue May 30, 2023 · 3 comments
Labels
bug Something isn't working
Milestone

Comments

@kshitijkg
Copy link
Member

kshitijkg commented May 30, 2023

When training without pipeline parallelism, the sequential wrapper is used: https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/training.py#L461. Code for to_sequential: https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/model/gpt2_model.py#L343

However, all the adapters added are lost when this is done.

This is probably because the model is rebuilt with self.specs which wasnt updated when the adapters are added.

@kshitijkg kshitijkg added the bug Something isn't working label May 30, 2023
@floatingbigcat
Copy link
Collaborator

floatingbigcat commented May 31, 2023

Hi, I have tested on a small model with pp=1 mp=1, but the output of model looks fine.
Did you change this? maybe our code didn't make model sequential now
https://github.com/floatingsnake/gpt-neox/blob/73cdd8692be8a2c579444434e60a01450a8c9a3c/megatron/neox_arguments/arguments.py#L992

https://github.com/floatingsnake/gpt-neox/blob/magma/mytests/test_model_build.py
https://github.com/floatingsnake/gpt-neox/blob/magma/configs/summit-70m-openclipH.yml#L16-L17

part of output:

    )
    (6): ParallelTransformerLayerPipe(
      (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (attention): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelSelfAttention(
          (query_key_value): ColumnParallelLinear()
          (rotary_emb): RotaryEmbedding()
          (scale_mask_softmax): FusedScaleMaskSoftmax()
          (attention_dropout): Dropout(p=0, inplace=False)
          (dense): RowParallelLinear()
        )
      )
      (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (mlp): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelMLP(
          (dense_h_to_4h): ColumnParallelLinear()
          (dense_4h_to_h): RowParallelLinear()
        )
      )
    )
    (7): ParallelTransformerLayerPipe(
      (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (attention): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelSelfAttention(
          (query_key_value): ColumnParallelLinear()
          (rotary_emb): RotaryEmbedding()
          (scale_mask_softmax): FusedScaleMaskSoftmax()
          (attention_dropout): Dropout(p=0, inplace=False)
          (dense): RowParallelLinear()
        )
      )
      (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (mlp): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelMLP(
          (dense_h_to_4h): ColumnParallelLinear()
          (dense_4h_to_h): RowParallelLinear()
        )
      )
    )
    (9): NormPipe(
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (10): ParallelLinearPipe(
      (final_linear): ColumnParallelLinear()
    )
  )
)
Current GPU memory usage: 9.01 GB

@floatingbigcat
Copy link
Collaborator

floatingbigcat commented May 31, 2023

As we abandond the sequential wrapper. and mp=1, pp=1 works will without it. We can reopen the issue when it is needed

@kshitijkg
Copy link
Member Author

Yes, I had changed that line to test sequential wrapper. But yeah, solving this is not high priority for now since we are moving away from the sequential wrapper :)

@kshitijkg kshitijkg added this to the Robin V0 milestone Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants