Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Deepspeed Transformer Kernel #43

Closed
sdtblck opened this issue Jan 5, 2021 · 4 comments
Closed

Add Deepspeed Transformer Kernel #43

sdtblck opened this issue Jan 5, 2021 · 4 comments
Labels
feature request New feature or request good first issue Good for newcomers
Projects

Comments

@sdtblck
Copy link
Contributor

sdtblck commented Jan 5, 2021

The Deepspeed implemented a transformer kernel that invokes the CUDA kernel only once for Q, K and V values, as opposed to three times (one invocation for Q, K and V respectively), resulting in 3% to 5% performance improvement in end-to-end training.

https://www.deepspeed.ai/tutorials/bert-pretraining/#enabling-deepspeeds-transformer-kernel

Would be good to integrate this into our model.

@StellaAthena StellaAthena added the feature request New feature or request label Jan 6, 2021
@StellaAthena StellaAthena added this to To do in 1T or BUST via automation Jan 6, 2021
@StellaAthena StellaAthena added the good first issue Good for newcomers label Jan 6, 2021
@StellaAthena
Copy link
Member

I created a branch stella-kernel to work on this. However I quickly realized that it might not be as non-trivial as it sounds, as it requires you to use a particular Module defined by DeepSpeed. This appears to clash with the approach @anthony-dipofi is taking in #45.

There’s also a chance that this makes our lives easier by obliviating the need for writing the transformer ourselves. I’m not totally sure though, in part because I wasn’t able to find example code that used both the DeepSpeedTransformerLayer and Pipeline Parallelism in my cursory search. I’ll look into this more this weekend and see if I can’t get a better understanding of how

See here for further details about the DeepSpeedTransformerLayer.

@anthony-dipofi
Copy link
Contributor

I'll see if I can get the DeepSpeedTransformerLayer working with LayerSpec, it may just be a matter of replacing the other blocks with it.

@anthony-dipofi
Copy link
Contributor

Maybe I am missing something obvious, but I think the DeepSpeedTransformerLayer may only be for non-autoregressive usages like BERT. All of the examples I've found have been in the context of training it for BERT and I don't see any input arguments for using causal attention or passing in a mask. I found some more information https://deepspeed.readthedocs.io/en/latest/kernel.html#deepspeed-transformer-config and the source is https://deepspeed.readthedocs.io/en/latest/_modules/deepspeed/ops/transformer/transformer.html#DeepSpeedTransformerConfig

@StellaAthena
Copy link
Member

@anthony-dipofi Per this issue on the DeepSpeed repo it appears that you are correct. They have other kernels in testing internally which will be out "very soon," and I'm very excited given that they found a 30% and 40% speed-up on forward and backward passes respectively! That's going to be huge. Until that's out though, I'll close this issue.

1T or BUST automation moved this from To do to Done Jan 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers
Projects
Development

No branches or pull requests

3 participants