Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the current state of my FP8 branch, it's far from ready, but it's at the point where you could take a look if you're curious! The last version which was functionally correct was f7c53e3 from several hours ago, but that was missing some key optimisations and refactoring, so I'd focus on this one instead (3b286d7).
In terms of performance, this version should be representative of what's possible despite that bug, and it's currently >33% faster than BF16 on 1xH100 (+huge memory savings)! 🚀 And that's still with BF16 attention (which also affects which inputs/outputs can be FP8) so there's plenty more performance left on the table.
There are a number of things that need discussion & refactoring/rearchitecting, such as absmax_history (especially how tensors are pointed out with pointers and "associated tensors" etc...), CudaScratchAllocator, how to handle checkpointing & determinism, etc...
I'll write down some more explanations tomorrow, but I'll be away until Tuesday so won't make any further progress on the code until then! 🙂 Also keep in mind ~50% of the extra code is in /dev/cuda/ (including for experimental kernels I didn't end up using) so it's a very big change, but not quite as big as the lines imply.