Implement Pipeline Parallelism #45

sdtblck · 2021-01-05T15:59:30Z

Should be fairly easy as our net is already expressed in terms of layers
https://www.deepspeed.ai/tutorials/pipeline/

StellaAthena · 2021-01-06T06:26:11Z

According to the DeepSpeed documentation,

The key constraint that enables pipeline parallelism is the representation of the forward pass as a sequence of layers and the enforcement of a simple interface between them. The forward pass is implicitly defined by the module layers. The key assumption is that the output of each layer can be directly fed as input to the next, like a torch.nn.Sequence. The forward pass is implicitly:
    x = inputs
    for layer in self.layers:
        x = layer(x)
    return x

Although our code is defined in terms of layers, our layers do not have this feedforward structure. It shouldn’t be much work to rearrange things, but there is the potential for it to be fiddly especially with the token and positional embeddings. Since we never write any forward passing code for the pipeline parallel mode, we may have to create a token embedding and a positional embedding layer.

I should have time to try this out tomorrow.

sdtblck · 2021-01-06T11:49:54Z

hey @StellaAthena , I think @anthony-dipofi is already working on this, apologies, meant to assign him. Maybe you can check in on his progress.

anthony-dipofi · 2021-01-06T15:14:37Z

Still working on this, but I pushed what I have currently to https://github.com/EleutherAI/gpt-neox/tree/pipeline_parrallel . The main changes were to create a new model class for generating the LayerSpec, but I tried to keep it as similar to the original model as possible.

StellaAthena · 2021-01-08T01:41:34Z

Still working on this, but I pushed what I have currently to https://github.com/EleutherAI/gpt-neox/tree/pipeline_parrallel . The main changes were to create a new model class for generating the LayerSpec, but I tried to keep it as similar to the original model as possible.

Running
NCCL_SHM_DISABLE=1 NCCL_DEBUG=info MASTER_ADDR=127.0.0.1 MASTER_PORT=2000 deepspeed train_enwik8_pipeline.py --deepspeed --deepspeed_config configs/base_deepspeed.json
works for me on the CW server, but
NCCL_SHM_DISABLE=1 NCCL_DEBUG=info MASTER_ADDR=127.0.0.1 MASTER_PORT=2000 deepspeed train.py --deepspeed --deepspeed_config configs/base_deepspeed.json
doesn't. Is that where things stand on your end as well?

anthony-dipofi · 2021-01-08T02:43:04Z

Yes, so the pipelining requires some changes to the training loop and how the data is represented by the Dataset class. I wasn't really sure how to integrate that with the other changes that are being made to the data loading so I just got it working with enwik8, which is whats in the train_enwik8_pipeline.py file. I think what should be required is making some changes in train.py similar to whats in train_en_wik8_pipeline.py, the main thing being using loss = model_engine.train_batch() instead of separate forward/backward passes and loss calculations. If those changes for loading in the other are ready I can look at trying updating train.py too or if there is another way you think I should approach it.

StellaAthena · 2021-01-14T08:43:47Z

This sorta works, to the point where I am going to declare it done. However there are some problems, see #62

StellaAthena added the feature request New feature or request label Jan 6, 2021

StellaAthena added this to To do in 1T or BUST via automation Jan 6, 2021

StellaAthena moved this from To do to In progress in 1T or BUST Jan 6, 2021

StellaAthena self-assigned this Jan 6, 2021

sdtblck assigned anthony-dipofi Jan 6, 2021

StellaAthena mentioned this issue Jan 8, 2021

Add Deepspeed Transformer Kernel #43

Closed

StellaAthena closed this as completed Jan 14, 2021

1T or BUST automation moved this from In progress to Done Jan 14, 2021

StellaAthena linked a pull request Jan 14, 2021 that will close this issue

Pipeline Parallel QoL Fixes #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Pipeline Parallelism #45

Implement Pipeline Parallelism #45

sdtblck commented Jan 5, 2021

StellaAthena commented Jan 6, 2021

sdtblck commented Jan 6, 2021

anthony-dipofi commented Jan 6, 2021

StellaAthena commented Jan 8, 2021 •

edited

anthony-dipofi commented Jan 8, 2021

StellaAthena commented Jan 14, 2021 •

edited

Implement Pipeline Parallelism #45

Implement Pipeline Parallelism #45

Comments

sdtblck commented Jan 5, 2021

StellaAthena commented Jan 6, 2021

sdtblck commented Jan 6, 2021

anthony-dipofi commented Jan 6, 2021

StellaAthena commented Jan 8, 2021 • edited

anthony-dipofi commented Jan 8, 2021

StellaAthena commented Jan 14, 2021 • edited

StellaAthena commented Jan 8, 2021 •

edited

StellaAthena commented Jan 14, 2021 •

edited