-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Deepspeed Transformer Kernel #43
Comments
I created a branch There’s also a chance that this makes our lives easier by obliviating the need for writing the transformer ourselves. I’m not totally sure though, in part because I wasn’t able to find example code that used both the See here for further details about the |
I'll see if I can get the DeepSpeedTransformerLayer working with LayerSpec, it may just be a matter of replacing the other blocks with it. |
Maybe I am missing something obvious, but I think the DeepSpeedTransformerLayer may only be for non-autoregressive usages like BERT. All of the examples I've found have been in the context of training it for BERT and I don't see any input arguments for using causal attention or passing in a mask. I found some more information https://deepspeed.readthedocs.io/en/latest/kernel.html#deepspeed-transformer-config and the source is https://deepspeed.readthedocs.io/en/latest/_modules/deepspeed/ops/transformer/transformer.html#DeepSpeedTransformerConfig |
@anthony-dipofi Per this issue on the DeepSpeed repo it appears that you are correct. They have other kernels in testing internally which will be out "very soon," and I'm very excited given that they found a 30% and 40% speed-up on forward and backward passes respectively! That's going to be huge. Until that's out though, I'll close this issue. |
The Deepspeed implemented a transformer kernel that invokes the CUDA kernel only once for Q, K and V values, as opposed to three times (one invocation for Q, K and V respectively), resulting in 3% to 5% performance improvement in end-to-end training.
https://www.deepspeed.ai/tutorials/bert-pretraining/#enabling-deepspeeds-transformer-kernel
Would be good to integrate this into our model.
The text was updated successfully, but these errors were encountered: