You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the post Rotary Embeddings: A Relative Revolution - EleutherAI Blog, it is very detailed and nice !
In the post, there is a note "The layout of the queries and keys in GPT-NeoX, following Megatron, is [seq, batch, heads, hdim], in order to avoid memory-intensive transpose operations. The code will need to be modified to work with the conventional layout of [batch, seq, heads, hdim]".
Could you provide a brief explanation of why or when [batch, seq, heads, dim] results in memory-intensive transpose operations ? Or where can I learn more about it ?
Thanks in advance !
The text was updated successfully, but these errors were encountered:
An explanation can be found in section 4.2 of this megatron paper https://arxiv.org/pdf/2104.04473.pdf
It matters in the matmul of query and key I believe.
And we can experiment with Pytorch profiler to know cuda memory consumption of each operations.
I have read the post Rotary Embeddings: A Relative Revolution - EleutherAI Blog, it is very detailed and nice !
In the post, there is a note "The layout of the queries and keys in GPT-NeoX, following Megatron, is [seq, batch, heads, hdim], in order to avoid memory-intensive transpose operations. The code will need to be modified to work with the conventional layout of [batch, seq, heads, hdim]".
Could you provide a brief explanation of why or when [batch, seq, heads, dim] results in memory-intensive transpose operations ? Or where can I learn more about it ?
Thanks in advance !
The text was updated successfully, but these errors were encountered: