You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to understand why the SGD learning rate scaling given in Table 3 is $\theta(1)$. In J.2.1. the reasoning given is that for hidden $W$, "the gradient of $W$ has $\theta(1/n)$ coordinates, so the $\theta(1)$ SGD LR suffices..."
After the first update to $W$ we need $(W + G)x$ to be $\theta(1)$, where $G$ is the gradient. From Table 14 we'd expect $Gx$ to be $\theta(1)$ if $G$ is a tensor product (rather than a standard gaussian). Why do we consider $g$ to be a tensor product? It seems to me that the $\theta(1/n)$ coordinates in $G$ are coming from the random gaussian initialization of the output weights.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, thanks for this cool work.
I'm trying to understand why the SGD learning rate scaling given in Table 3 is$\theta(1)$ . In J.2.1. the reasoning given is that for hidden $W$ , "the gradient of $W$ has $\theta(1/n)$ coordinates, so the $\theta(1)$ SGD LR suffices..."
After the first update to$W$ we need $(W + G)x$ to be $\theta(1)$ , where $G$ is the gradient. From Table 14 we'd expect $Gx$ to be $\theta(1)$ if $G$ is a tensor product (rather than a standard gaussian). Why do we consider $g$ to be a tensor product? It seems to me that the $\theta(1/n)$ coordinates in $G$ are coming from the random gaussian initialization of the output weights.
Beta Was this translation helpful? Give feedback.
All reactions