Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Pipeline Parallelism #45

Closed
sdtblck opened this issue Jan 5, 2021 · 6 comments · Fixed by #63
Closed

Implement Pipeline Parallelism #45

sdtblck opened this issue Jan 5, 2021 · 6 comments · Fixed by #63
Assignees
Labels
feature request New feature or request
Projects

Comments

@sdtblck
Copy link
Contributor

sdtblck commented Jan 5, 2021

Should be fairly easy as our net is already expressed in terms of layers
https://www.deepspeed.ai/tutorials/pipeline/

@StellaAthena StellaAthena added the feature request New feature or request label Jan 6, 2021
@StellaAthena StellaAthena added this to To do in 1T or BUST via automation Jan 6, 2021
@StellaAthena
Copy link
Member

According to the DeepSpeed documentation,

The key constraint that enables pipeline parallelism is the representation of the forward pass as a sequence of layers and the enforcement of a simple interface between them. The forward pass is implicitly defined by the module layers. The key assumption is that the output of each layer can be directly fed as input to the next, like a torch.nn.Sequence. The forward pass is implicitly:

    x = inputs
    for layer in self.layers:
        x = layer(x)
    return x

Although our code is defined in terms of layers, our layers do not have this feedforward structure. It shouldn’t be much work to rearrange things, but there is the potential for it to be fiddly especially with the token and positional embeddings. Since we never write any forward passing code for the pipeline parallel mode, we may have to create a token embedding and a positional embedding layer.

I should have time to try this out tomorrow.

@StellaAthena StellaAthena moved this from To do to In progress in 1T or BUST Jan 6, 2021
@StellaAthena StellaAthena self-assigned this Jan 6, 2021
@sdtblck
Copy link
Contributor Author

sdtblck commented Jan 6, 2021

hey @StellaAthena , I think @anthony-dipofi is already working on this, apologies, meant to assign him. Maybe you can check in on his progress.

@anthony-dipofi
Copy link
Contributor

Still working on this, but I pushed what I have currently to https://github.com/EleutherAI/gpt-neox/tree/pipeline_parrallel . The main changes were to create a new model class for generating the LayerSpec, but I tried to keep it as similar to the original model as possible.

@StellaAthena
Copy link
Member

StellaAthena commented Jan 8, 2021

Still working on this, but I pushed what I have currently to https://github.com/EleutherAI/gpt-neox/tree/pipeline_parrallel . The main changes were to create a new model class for generating the LayerSpec, but I tried to keep it as similar to the original model as possible.

Running
NCCL_SHM_DISABLE=1 NCCL_DEBUG=info MASTER_ADDR=127.0.0.1 MASTER_PORT=2000 deepspeed train_enwik8_pipeline.py --deepspeed --deepspeed_config configs/base_deepspeed.json
works for me on the CW server, but
NCCL_SHM_DISABLE=1 NCCL_DEBUG=info MASTER_ADDR=127.0.0.1 MASTER_PORT=2000 deepspeed train.py --deepspeed --deepspeed_config configs/base_deepspeed.json
doesn't. Is that where things stand on your end as well?

@anthony-dipofi
Copy link
Contributor

Yes, so the pipelining requires some changes to the training loop and how the data is represented by the Dataset class. I wasn't really sure how to integrate that with the other changes that are being made to the data loading so I just got it working with enwik8, which is whats in the train_enwik8_pipeline.py file. I think what should be required is making some changes in train.py similar to whats in train_en_wik8_pipeline.py, the main thing being using loss = model_engine.train_batch() instead of separate forward/backward passes and loss calculations. If those changes for loading in the other are ready I can look at trying updating train.py too or if there is another way you think I should approach it.

@StellaAthena
Copy link
Member

StellaAthena commented Jan 14, 2021

This sorta works, to the point where I am going to declare it done. However there are some problems, see #62

1T or BUST automation moved this from In progress to Done Jan 14, 2021
@StellaAthena StellaAthena linked a pull request Jan 14, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

3 participants