Release tinygrad 0.9.0 · tinygrad/tinygrad

Close to the new line limit of 8000 lines, sitting at 7958 lines.
tinygrad is much more usable now.

Just over 1200 commits since 0.8.0.

Release Highlights

New documentation: https://docs.tinygrad.org
gpuctypes has been brought in tree and is no longer an external dependency. [#3253]
AMD=1 and NV=1 experimental backends for not requiring any userspace runtime components like ROCm or CUDA.
- These backends should reduce the amount of python time, and specifically with multi-gpu use cases.
PTX=1 for rendering directly to ptx instead of cuda. [#3139] [#3623] [#3775]
Nvidia tensor core support. [#3544]
THREEFRY=1 for numpy-less random number generation using threefry2x32. [#2601] [#3785]
More stabilized multi-tensor API.
- With ring all-reduce: [#3000] [#3852]
Core tinygrad has been refactored into 4 pieces, read more about it here.
Linearizer and codegen has support for generating kernels with multiple outputs.
Lots of progress towards greater kernel fusion in the scheduler.
- Fusing of ReduceOps with their elementwise children. This trains mnist and gpt2 with ~20% less kernels and makes llama inference faster.
- New LoadOps.ASSIGN allows fusing optimizer updates with grad.
- Schedule kernels in BFS order. This improves resnet and llama speed.
- W.I.P. for fusing multiple reduces: [#4259] [#4208]
MLPerf ResNet and BERT with a W.I.P. UNet3D
Llama 3 support with a new llama3.py that provides an OpenAI compatible API. [#4576]
NF4 quantization support in Llama examples. [#4540]
label_smoothing has been added to sparse_categorical_crossentropy. [#3568]

Using tinygrad in a conda env on macOS is known to cause problems with the METAL backend. See #2226.