SGD LR for hidden weights #50

AllanYangZhou · 2023-06-03T07:04:35Z

AllanYangZhou
Jun 3, 2023

Hi, thanks for this cool work.

I'm trying to understand why the SGD learning rate scaling given in Table 3 is $\theta(1)$. In J.2.1. the reasoning given is that for hidden $W$, "the gradient of $W$ has $\theta(1/n)$ coordinates, so the $\theta(1)$ SGD LR suffices..."

After the first update to $W$ we need $(W + G)x$ to be $\theta(1)$, where $G$ is the gradient. From Table 14 we'd expect $Gx$ to be $\theta(1)$ if $G$ is a tensor product (rather than a standard gaussian). Why do we consider $g$ to be a tensor product? It seems to me that the $\theta(1/n)$ coordinates in $G$ are coming from the random gaussian initialization of the output weights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGD LR for hidden weights #50

{{title}}

Replies: 0 comments

Select a reply

SGD LR for hidden weights #50

AllanYangZhou Jun 3, 2023

Replies: 0 comments

AllanYangZhou
Jun 3, 2023