Question about accumulated gradients metric #21

Hambaobao · 2023-08-05T10:24:58Z

Dear author,

Hello, I have read your paper and code. UPop uses the cumulative gradient of mask as metric to evaluate the weight importance. However, I don't understand why UPop prunes the parts with large cumulative gradients. Does it mean that the parts with larger cumulative gradients are less important? Is there any related research supporting this, or is it based on intuition? Could you please provide some clarification?

Thank you.

Hambaobao · 2023-08-05T10:55:57Z

By the way, could you please explain the 'compression_weight' in the code? Why is 'compression_weight' for attention set to 36? Does this number have any special significance?

sdc17 · 2023-08-06T08:12:34Z

Hi, @Hambaobao

I don't understand why UPop prunes the parts with large cumulative gradients. Does it mean that the parts with larger cumulative gradients are less important?

UPop prunes parts with large cumulative gradients of corresponding learnable masks $\zeta$, these masks are initialized to ones and the $l_{1}$-norm of masks are added as additional loss items to drive them smaller:

$$ \mathcal{L} = \mathcal{L_{O}} + w_a\sum\nolimits_{\zeta_{i} \in \zeta_a} \lVert \zeta_{i} \rVert_{1} + w_m\sum\nolimits_{\zeta_{i} \in \zeta_m} \lVert \zeta_{i} \rVert_{1} $$

, which makes masks $\zeta$ corresponding to the unimportant parts smaller if regular optimizers are used. However, it does not satisfy our expectation, i.e., to freely control their values at each iteration t. To this end, we use a custom rule to update masks, and they can no longer be used as a metric of importance because their values themselves are determined by our custom rule. As an alternative, their gradients are still obtained normally by the autograd engine of PyTorch and it is natural that gradients are served as the metric of importance.

Is there any related research supporting this, or is it based on intuition?

For using gradients as a metric of importance, you may refer to this paper, but their motivations and specific uses of gradients are quite different.

could you please explain the compression_weight in the code? Why is compression_weight for attention set to 36

compression_weight is used for unified ranking on different structures. And the scale factor 36 is determined by the shape of the masks for the different structures, i.e. their granularity. More specifically, the reason why compression_weight for attention is set to 36 is that each position in learnable mask $\zeta_a$ corresponds to 36 rows/columns in attention weights:

$$ 36 = 12 \text{(number of heads)} * [1 \text{(weights of query)} + 1 \text{(weights of key)} + 1 \text{(weights of value)]} $$

, while each position in learnable mask $\zeta_m$ corresponds to 1 row/column in FFN weights. Thanks for your questions. We will add some comments about this to the code.

Hambaobao · 2023-08-06T08:28:02Z

Thank you very much for your detailed reply, it has truly been a great help to me.

sdc17 linked a pull request Aug 6, 2023 that will close this issue

add comments on compression_weight #22

Merged

sdc17 closed this as completed Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about accumulated gradients metric #21

Question about accumulated gradients metric #21

Hambaobao commented Aug 5, 2023

Hambaobao commented Aug 5, 2023

sdc17 commented Aug 6, 2023 •

edited

Loading

Hambaobao commented Aug 6, 2023

Question about accumulated gradients metric #21

Question about accumulated gradients metric #21

Comments

Hambaobao commented Aug 5, 2023

Hambaobao commented Aug 5, 2023

sdc17 commented Aug 6, 2023 • edited Loading

Hambaobao commented Aug 6, 2023

sdc17 commented Aug 6, 2023 •

edited

Loading