Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add T5 using abstract toolkit (factory) pattern. T5 has a custom LN impl, and no biases anywhere. It also doesnt
scale the multiheaded attention and it uses a relative bias instead of learned positional embeddings.
Its a pre-layer norm model and otherwise fairly vanilla.
To make this work, I needed to change the FFN impl, MHA impl and the LN impl. The implementation so far can
load and run a T5 checkpoint.
I also refactored the encoder-decoder so that the encoder is pre-computed. This is much more efficient than recomputing
the encoder embeddings at ever step of greedy decode. This affects the BART completer example which was refactored
accordingly