Add Deepspeed Transformer Kernel #43

sdtblck · 2021-01-05T15:55:13Z

The Deepspeed implemented a transformer kernel that invokes the CUDA kernel only once for Q, K and V values, as opposed to three times (one invocation for Q, K and V respectively), resulting in 3% to 5% performance improvement in end-to-end training.

https://www.deepspeed.ai/tutorials/bert-pretraining/#enabling-deepspeeds-transformer-kernel

Would be good to integrate this into our model.

StellaAthena · 2021-01-08T17:30:05Z

I created a branch stella-kernel to work on this. However I quickly realized that it might not be as non-trivial as it sounds, as it requires you to use a particular Module defined by DeepSpeed. This appears to clash with the approach @anthony-dipofi is taking in #45.

There’s also a chance that this makes our lives easier by obliviating the need for writing the transformer ourselves. I’m not totally sure though, in part because I wasn’t able to find example code that used both the DeepSpeedTransformerLayer and Pipeline Parallelism in my cursory search. I’ll look into this more this weekend and see if I can’t get a better understanding of how

See here for further details about the DeepSpeedTransformerLayer.

anthony-dipofi · 2021-01-08T17:52:43Z

I'll see if I can get the DeepSpeedTransformerLayer working with LayerSpec, it may just be a matter of replacing the other blocks with it.

anthony-dipofi · 2021-01-10T08:14:58Z

Maybe I am missing something obvious, but I think the DeepSpeedTransformerLayer may only be for non-autoregressive usages like BERT. All of the examples I've found have been in the context of training it for BERT and I don't see any input arguments for using causal attention or passing in a mask. I found some more information https://deepspeed.readthedocs.io/en/latest/kernel.html#deepspeed-transformer-config and the source is https://deepspeed.readthedocs.io/en/latest/_modules/deepspeed/ops/transformer/transformer.html#DeepSpeedTransformerConfig

StellaAthena · 2021-01-10T15:48:16Z

@anthony-dipofi Per this issue on the DeepSpeed repo it appears that you are correct. They have other kernels in testing internally which will be out "very soon," and I'm very excited given that they found a 30% and 40% speed-up on forward and backward passes respectively! That's going to be huge. Until that's out though, I'll close this issue.

StellaAthena added the feature request New feature or request label Jan 6, 2021

StellaAthena added this to To do in 1T or BUST via automation Jan 6, 2021

StellaAthena added the good first issue Good for newcomers label Jan 6, 2021

StellaAthena closed this as completed Jan 10, 2021

1T or BUST automation moved this from To do to Done Jan 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Deepspeed Transformer Kernel #43

Add Deepspeed Transformer Kernel #43

sdtblck commented Jan 5, 2021

StellaAthena commented Jan 8, 2021

anthony-dipofi commented Jan 8, 2021

anthony-dipofi commented Jan 10, 2021

StellaAthena commented Jan 10, 2021

Add Deepspeed Transformer Kernel #43

Add Deepspeed Transformer Kernel #43

Comments

sdtblck commented Jan 5, 2021

StellaAthena commented Jan 8, 2021

anthony-dipofi commented Jan 8, 2021

anthony-dipofi commented Jan 10, 2021

StellaAthena commented Jan 10, 2021