add flash_attn_kvpacked #862

satpalsr · 2023-03-29T16:40:47Z

Issue
Inference with flash attention gives error due to improper shapes of qkv, as key and value are updated due to layer past
To Reproduce:

Take any config with flash attention.
Run python deepy.py generate.py configs/70M-deduped.yml -i input_prompt.txt -o prompt_out.txt with some text in input_prompt.txt
Gives error due to different sq and sk values here

This PR separates query from packed qkv matrix to resolve the issue.

dashstander · 2023-04-13T19:27:53Z

Thanks so much for the PR! This looks great and I'm testing it today. Though @satpalsr can you merge main into your branch and resolve the conflicts against the recent changes that add Flash Attention triton support?

Signed-off-by: Dashiell Stander <[email protected]>

DaoD · 2023-04-14T03:30:32Z

I find this commit is relevant to my issue (#883).
The general idea to tackle the problem is the same.
I think we do not need to add a new parameter layer_past in the function of flash_attention, but simply use if self.training == True to check whether it is training or inference stage.
High five for the same solution!!!

satpalsr · 2023-04-15T17:13:42Z

@dashstander removed conflicts. Please test out now.
Thanks for the suggestion @DaoD

Signed-off-by: Dashiell Stander <[email protected]>

dashstander · 2023-04-18T05:23:48Z

megatron/model/transformer.py

+                # Combined k/v into [b * sk, 2, np, hn].
+                kv = torch.concat([key_layer, value_layer], dim=1)
+
+                output = self.flash_attn_unpadded_kvpacked_func(


this needs to be self.flash_attention_function

Will run the inference and push any changes as needed.

Signed-off-by: Dashiell Stander <[email protected]>

dashstander · 2023-04-18T16:55:24Z

There are a few outstanding bugs that I have pointed out, though the general gist is ensuring that this works for both training and inference and that the code switches between those modes correctly. Some of that is actually not an issue with this PR and seems to be an issue with the generate.py script, as was mention in the (now closed) #883. Actively working on this now, should be able to iron this stuff out.

Signed-off-by: Dashiell Stander <[email protected]>

dashstander · 2023-04-18T17:43:02Z

Just had a productive conversation with @satpalsr . Currently the implementation relies on the torch.nn.Module attribute training which isn't used elsewhere in the code base. There's also an issue with the three different FlashAttention functions getting assigned to the same name at initialization. Since models are always initialized with self.training == True even resetting it by calling inference_mode won't change the FlashAttention function.

StellaAthena · 2023-04-18T20:06:02Z

Since models are always initialized with self.training == True even resetting it by calling inference_mode won't change the FlashAttention function.

Where does this happen? I was expecting it to be in setup_for_inference_or_eval but it doesn't seem to be.

dashstander · 2023-04-18T23:30:54Z

Where does this happen? I was expecting it to be in setup_for_inference_or_eval but it doesn't seem to be.

@StellaAthena It's an attribute built in to torch.nn.Module and just set to True by default. Part of the changes if we want to use this particular attribute will be setting it in the train_mode and inference_mode methods (here and here). An alternative might be to specifically add an inference vs training configuration to NeoxArgs, but that would definitely need to be part of a larger discussion.

dashstander · 2023-04-18T23:38:15Z

~~I also just realized that this doesn't add the Triton kvpacked function , so technically inference with Triton FlashAttention + ALiBi would be broken.~~

Edit: Nevermind, this is wrong. The Triton function we import has the Q / K / V matrices split up.

Signed-off-by: Dashiell Stander <[email protected]>

dashstander · 2023-04-19T00:41:44Z

I tested and confirmed this all worked with these changes.

Signed-off-by: Dashiell Stander <[email protected]>

dashstander

@satpalsr added me as a collaborator on their fork so I made the changes I was requesting. Looks good to me from here.

megatron/model/transformer.py

StellaAthena

Dash has tested and approved this code and I don't see anything that stands out as problematic. You should be good to go ahead and merge it.

satpalsr · 2023-04-24T07:54:28Z

Thanks @dashstander for completing this.

add flash_attn_kvpacked

f4706e0

satpalsr requested a review from a team as a code owner March 29, 2023 16:40

satpalsr requested review from Quentin-Anthony and StellaAthena March 29, 2023 16:40

fix formatting

45d7052

dashstander self-assigned this Apr 13, 2023

whoops

61b5eee

Signed-off-by: Dashiell Stander <[email protected]>

satpalsr added 2 commits April 15, 2023 18:08

accept changes from main & resolve conflicts

9c645dd

Merge branch 'main' into flash_attn_infer

ee99945

DaoD mentioned this pull request Apr 16, 2023

Some problems when using generate.py and how to fix it #883

Closed

Merge in new changes

af7276f

Signed-off-by: Dashiell Stander <[email protected]>

dashstander reviewed Apr 18, 2023

View reviewed changes

dashstander added 2 commits April 18, 2023 01:24

Error

fa8c87d

Signed-off-by: Dashiell Stander <[email protected]>

errors

e02346d

Signed-off-by: Dashiell Stander <[email protected]>

Set training attribute appropriately

2d9c258

Signed-off-by: Dashiell Stander <[email protected]>

StellaAthena linked an issue Apr 18, 2023 that may be closed by this pull request

Unable to run generate text #885

Closed

dashstander added 2 commits April 18, 2023 19:47

Split up FlashAttention methods

1f2d66c

Signed-off-by: Dashiell Stander <[email protected]>

Comment out clear_cache

49d8dba

Signed-off-by: Dashiell Stander <[email protected]>

dashstander added 2 commits April 19, 2023 12:11

Just remove clear_cache

17b84d7

Signed-off-by: Dashiell Stander <[email protected]>

Fix pre-commit formatting

29b968d

Signed-off-by: Dashiell Stander <[email protected]>

dashstander approved these changes Apr 19, 2023

View reviewed changes

megatron/model/transformer.py Outdated Show resolved Hide resolved

megatron/model/transformer.py Outdated Show resolved Hide resolved

StellaAthena approved these changes Apr 19, 2023

View reviewed changes

Merge branch 'main' into flash_attn_infer

c2153df

StellaAthena merged commit c64bacc into EleutherAI:main Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add flash_attn_kvpacked #862

add flash_attn_kvpacked #862

satpalsr commented Mar 29, 2023 •

edited

Loading

dashstander commented Apr 13, 2023 •

edited

Loading

DaoD commented Apr 14, 2023 •

edited

Loading

satpalsr commented Apr 15, 2023

dashstander Apr 18, 2023

satpalsr Apr 18, 2023

dashstander commented Apr 18, 2023

dashstander commented Apr 18, 2023

StellaAthena commented Apr 18, 2023

dashstander commented Apr 18, 2023 •

edited

Loading

dashstander commented Apr 18, 2023 •

edited

Loading

dashstander commented Apr 19, 2023

dashstander left a comment

StellaAthena left a comment •

edited

Loading

satpalsr commented Apr 24, 2023

add flash_attn_kvpacked #862

add flash_attn_kvpacked #862

Conversation

satpalsr commented Mar 29, 2023 • edited Loading

dashstander commented Apr 13, 2023 • edited Loading

DaoD commented Apr 14, 2023 • edited Loading

satpalsr commented Apr 15, 2023

dashstander Apr 18, 2023

Choose a reason for hiding this comment

satpalsr Apr 18, 2023

Choose a reason for hiding this comment

dashstander commented Apr 18, 2023

dashstander commented Apr 18, 2023

StellaAthena commented Apr 18, 2023

dashstander commented Apr 18, 2023 • edited Loading

dashstander commented Apr 18, 2023 • edited Loading

dashstander commented Apr 19, 2023

dashstander left a comment

Choose a reason for hiding this comment

StellaAthena left a comment • edited Loading

Choose a reason for hiding this comment

satpalsr commented Apr 24, 2023

satpalsr commented Mar 29, 2023 •

edited

Loading

dashstander commented Apr 13, 2023 •

edited

Loading

DaoD commented Apr 14, 2023 •

edited

Loading

dashstander commented Apr 18, 2023 •

edited

Loading

dashstander commented Apr 18, 2023 •

edited

Loading

StellaAthena left a comment •

edited

Loading