#
cuda-core
Here are 2 public repositories matching this topic...
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
gpu
cuda
inference
nvidia
mha
multi-head-attention
llm
large-language-model
flash-attention
cuda-core
decoding-attention
flashinfer
-
Updated
Nov 5, 2024 - C++
Improve this page
Add a description, image, and links to the cuda-core topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the cuda-core topic, visit your repo's landing page and select "manage topics."