Skip to content

Latest commit

 

History

History
67 lines (44 loc) · 4.72 KB

released_checkpoints.md

File metadata and controls

67 lines (44 loc) · 4.72 KB

Experimental T5 Pre-Trained Model Checkpoints

Below are some pointers to checkpoints for experimental models we have trained after writing our paper. We have found that these models can produce better performance in some cases. These checkpoints are not officially supported - use at your own risk!

t5.1.1.*

Similar to the models described in our paper, with the following improvements:

  • GEGLU activation in feed-forward hidden layer, rather than ReLU - see https://arxiv.org/abs/2002.05202 .

  • Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning.

  • Pre-trained on C4 only without mixing in the downstream tasks.

  • no parameter sharing between embedding and classifier layer

  • "xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger d_model and smaller num_heads and d_ff.

The checkpoints are located here:

LM-Adapted: t5.1.1.lm100k

These "LM adapted" models are initialized from t5.1.1 (above) and train for an additional 100K steps on the LM objective discussed in the T5 paper. This adaptation improves the ability of the model to be used for prompt tuning.

Talking Heads: t5.1.th.*

Variation on the t5.1.1 models using talking-heads attention (https://arxiv.org/abs/2003.02436).

First Layers Narrow: t5.1.n4w10.*

Variation on the t5.1.1 models. Each of the encoder and decoder consists of 14 layer groups, with the last ten twice as "wide" as the first four. (double d_ff and num_heads). Parameter count and computation are kept similar to the corresponding t5.1.1 models. For the base model, this increases the number of layers, resulting in better quality, and for the large and xl models, this decreases the number of layers from 24 to 14, decreasing quality, but also decreasing the amount of communication necessary for model parallelism.