Why [seq, batch, heads, hdim] instead of [batch, seq, heads, hdim] ? #271

richarddwang · 2021-04-28T23:15:13Z

I have read the post Rotary Embeddings: A Relative Revolution - EleutherAI Blog, it is very detailed and nice !

In the post, there is a note "The layout of the queries and keys in GPT-NeoX, following Megatron, is [seq, batch, heads, hdim], in order to avoid memory-intensive transpose operations. The code will need to be modified to work with the conventional layout of [batch, seq, heads, hdim]".

Could you provide a brief explanation of why or when [batch, seq, heads, dim] results in memory-intensive transpose operations ? Or where can I learn more about it ?

Thanks in advance !

richarddwang · 2021-05-03T10:53:56Z

An explanation can be found in section 4.2 of this megatron paper https://arxiv.org/pdf/2104.04473.pdf
It matters in the matmul of query and key I believe.
And we can experiment with Pytorch profiler to know cuda memory consumption of each operations.

richarddwang closed this as completed May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why [seq, batch, heads, hdim] instead of [batch, seq, heads, hdim] ? #271

Why [seq, batch, heads, hdim] instead of [batch, seq, heads, hdim] ? #271

richarddwang commented Apr 28, 2021

richarddwang commented May 3, 2021

Why [seq, batch, heads, hdim] instead of [batch, seq, heads, hdim] ? #271

Why [seq, batch, heads, hdim] instead of [batch, seq, heads, hdim] ? #271

Comments

richarddwang commented Apr 28, 2021

richarddwang commented May 3, 2021