Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about accumulated gradients metric #21

Closed
Hambaobao opened this issue Aug 5, 2023 · 3 comments · Fixed by #22
Closed

Question about accumulated gradients metric #21

Hambaobao opened this issue Aug 5, 2023 · 3 comments · Fixed by #22

Comments

@Hambaobao
Copy link

Dear author,

Hello, I have read your paper and code. UPop uses the cumulative gradient of mask as metric to evaluate the weight importance. However, I don't understand why UPop prunes the parts with large cumulative gradients. Does it mean that the parts with larger cumulative gradients are less important? Is there any related research supporting this, or is it based on intuition? Could you please provide some clarification?

Thank you.

@Hambaobao
Copy link
Author

By the way, could you please explain the 'compression_weight' in the code? Why is 'compression_weight' for attention set to 36? Does this number have any special significance?

@sdc17
Copy link
Owner

sdc17 commented Aug 6, 2023

Hi, @Hambaobao

I don't understand why UPop prunes the parts with large cumulative gradients. Does it mean that the parts with larger cumulative gradients are less important?

UPop prunes parts with large cumulative gradients of corresponding learnable masks $\zeta$, these masks are initialized to ones and the $l_{1}$-norm of masks are added as additional loss items to drive them smaller:

$$ \mathcal{L} = \mathcal{L_{O}} + w_a\sum\nolimits_{\zeta_{i} \in \zeta_a} \lVert \zeta_{i} \rVert_{1} + w_m\sum\nolimits_{\zeta_{i} \in \zeta_m} \lVert \zeta_{i} \rVert_{1} $$

, which makes masks $\zeta$ corresponding to the unimportant parts smaller if regular optimizers are used. However, it does not satisfy our expectation, i.e., to freely control their values at each iteration t. To this end, we use a custom rule to update masks, and they can no longer be used as a metric of importance because their values themselves are determined by our custom rule. As an alternative, their gradients are still obtained normally by the autograd engine of PyTorch and it is natural that gradients are served as the metric of importance.

Is there any related research supporting this, or is it based on intuition?

For using gradients as a metric of importance, you may refer to this paper, but their motivations and specific uses of gradients are quite different.

could you please explain the compression_weight in the code? Why is compression_weight for attention set to 36

compression_weight is used for unified ranking on different structures. And the scale factor 36 is determined by the shape of the masks for the different structures, i.e. their granularity. More specifically, the reason why compression_weight for attention is set to 36 is that each position in learnable mask $\zeta_a$ corresponds to 36 rows/columns in attention weights:

$$ 36 = 12 \text{(number of heads)} * [1 \text{(weights of query)} + 1 \text{(weights of key)} + 1 \text{(weights of value)]} $$

, while each position in learnable mask $\zeta_m$ corresponds to 1 row/column in FFN weights. Thanks for your questions. We will add some comments about this to the code.

@Hambaobao
Copy link
Author

Thank you very much for your detailed reply, it has truly been a great help to me.

@sdc17 sdc17 linked a pull request Aug 6, 2023 that will close this issue
@sdc17 sdc17 closed this as completed Aug 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants