Skip to content

What's an example where two different tensors would have different values of width_mult? #5

Answered by thegregyang
davisyoshida asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @davisyoshida, an example would be if you double the d_model of a Transformer but quadruple the d_ffn (where the MLP dimensions are d_model -> d_ffn -> d_model). Because we calculate width_mult using the fan-in dimension, this would cause the first nn.Linear.weight in the MLP layer to have its width_mult=2 but the 2nd nn.Linear.weight would have its width_mult=4. Nevertheless, as we demonstrate in our paper, we should expect hyperparameters to transfer even in this case.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@davisyoshida
Comment options

Answer selected by davisyoshida
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants