This repo contains code and latex files for the Transformer Tricks papers.
-
Flash normalization:
- arXiv paper: https://arxiv.org/abs/2407.09577
- Notebook:
- HuggingFace repo
-
Approximate attention [work in progress]:
-
Removing weights for skipless transformers:
- arXiv paper: https://arxiv.org/abs/2404.12362
- Notebook:
-
Precomputing the first layer:
- arXiv paper: https://arxiv.org/abs/2402.13388