Happy needs no error!
Stars
4
stars
written in Cuda
Clear filter
how to optimize some algorithm in cuda.
🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
Flash Attention in ~100 lines of CUDA (forward pass only)