Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why [seq, batch, heads, hdim] instead of [batch, seq, heads, hdim] ? #271

Closed
richarddwang opened this issue Apr 28, 2021 · 1 comment
Closed

Comments

@richarddwang
Copy link

I have read the post Rotary Embeddings: A Relative Revolution - EleutherAI Blog, it is very detailed and nice !

In the post, there is a note "The layout of the queries and keys in GPT-NeoX, following Megatron, is [seq, batch, heads, hdim], in order to avoid memory-intensive transpose operations. The code will need to be modified to work with the conventional layout of [batch, seq, heads, hdim]".

Could you provide a brief explanation of why or when [batch, seq, heads, dim] results in memory-intensive transpose operations ? Or where can I learn more about it ?

Thanks in advance !

@richarddwang
Copy link
Author

An explanation can be found in section 4.2 of this megatron paper https://arxiv.org/pdf/2104.04473.pdf
It matters in the matmul of query and key I believe.
And we can experiment with Pytorch profiler to know cuda memory consumption of each operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant