Switch to using Cuda Flash Attn for Alibi #1183

haileyschoelkopf · 2024-03-10T19:26:44Z

Since 2.4.0.post1, Flash-attn has supported alibi_slopes in its cuda kernels, removing the need for using the triton backend. This PR switches over to using the cuda flash functions rather than triton for compatible flash versions, but retains compatibility for Alibi + flash<2.4.0 via triton for now with a warning.

Future PRs could potentially add the updated Triton implementation from the triton repo which could be faster on H100s, as this is distinct from the triton impl. in the flash attention repo.

add cuda support for flash attn w/ alibi, warn of deprecation of triton

fdac107

haileyschoelkopf requested a review from Quentin-Anthony as a code owner March 10, 2024 19:26

Update NeoXArgs docs automatically

3ac7067

Quentin-Anthony approved these changes Mar 13, 2024

View reviewed changes

Quentin-Anthony merged commit 03186de into main Mar 13, 2024
2 checks passed

Quentin-Anthony deleted the flash-alibi-cuda branch March 13, 2024 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to using Cuda Flash Attn for Alibi #1183

Switch to using Cuda Flash Attn for Alibi #1183

haileyschoelkopf commented Mar 10, 2024

Switch to using Cuda Flash Attn for Alibi #1183

Switch to using Cuda Flash Attn for Alibi #1183

Conversation

haileyschoelkopf commented Mar 10, 2024