You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few parts of the Norm Tweaking paper were ambiguous to me and are currently implemented with my best guess. I've listed them below and linked to the relevant code. If you think there's a more correct implementation, let me know! PRs welcome too!
Loss Function
The loss function from the paper is:
Two things that I suspect might be wrong:
The combination of summing over the channels and doing a 2 norm doesn't make sense to me, since I think each normed tensor is a single element. I've implemented the norm as an absolute value instead.
The paper refers to $\sigma$ as the variance, but the equation uses $\sigma^2$. I think this might be a typo since $\sigma$ is usually standard deviation, so have implemented it as if that was the case ($\sigma$ = standard deviation, $\sigma^2$ = variance).
Batch Size
The paper says to use Adam as the optimizer, but also to use the equation $lr_i =lr_0 ∗(1+scale∗(i/L))$ to set a learning rate for level $i$. I think this implies taking the 128 samples and inferring them in smaller batches for each layer so that the optimizer actually does something. I've chosen to infer each sample individually, but it's unclear to me if using a batch size of 2-128 would be helpful or matter at all.
Identifying the Language of Tokens
This repo currently implements LLM-QAT's synthetic data generation and not Norm Tweaking's evolution of it. It's unclear to me how to map each token to a langauge. As a simple example, the token "a" is a word in many languages. Even if it were possible to map each token to a language, it's not clear to me how to determine the proportion of the languages in the training data. How to generate this table for an arbitrary tokenizer vocab?
Implementation of Layer Freezing
I'm not sure if the way I've frozen/unfrozen each layer for tweaking is correct. It looks correct when I examine requires_grad, but it's possible there's a nuance of PyTorch that I have missed.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
A few parts of the Norm Tweaking paper were ambiguous to me and are currently implemented with my best guess. I've listed them below and linked to the relevant code. If you think there's a more correct implementation, let me know! PRs welcome too!
Loss Function
The loss function from the paper is:
![image](https://private-user-images.githubusercontent.com/2950214/299531582-c89d3616-9d76-46ce-9a54-9f02f03cb30e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE0MjAzMDAsIm5iZiI6MTcyMTQyMDAwMCwicGF0aCI6Ii8yOTUwMjE0LzI5OTUzMTU4Mi1jODlkMzYxNi05ZDc2LTQ2Y2UtOWE1NC05ZjAyZjAzY2IzMGUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcxOSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MTlUMjAxMzIwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NWQ0ZDNkZjhkYWRhYzMxNzA3YmI2NDhhNmQwZWExOTJjOGJkOWRmMTY2M2Y2ZWI2YzkyZTViYmVlM2NhMjAwOCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.wEcSh4J6MPvfY5QBdNjjDRSb7TpZtmIt2YkDu0Rp0MI)
Two things that I suspect might be wrong:
Batch Size
The paper says to use Adam as the optimizer, but also to use the equation$lr_i =lr_0 ∗(1+scale∗(i/L))$ to set a learning rate for level $i$ . I think this implies taking the 128 samples and inferring them in smaller batches for each layer so that the optimizer actually does something. I've chosen to infer each sample individually, but it's unclear to me if using a batch size of 2-128 would be helpful or matter at all.
Identifying the Language of Tokens
This repo currently implements LLM-QAT's synthetic data generation and not Norm Tweaking's evolution of it. It's unclear to me how to map each token to a langauge. As a simple example, the token "a" is a word in many languages. Even if it were possible to map each token to a language, it's not clear to me how to determine the proportion of the languages in the training data.
![image](https://private-user-images.githubusercontent.com/2950214/299534512-a2fc21f9-c13f-46a3-b835-47f55db1af6e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE0MjAzMDAsIm5iZiI6MTcyMTQyMDAwMCwicGF0aCI6Ii8yOTUwMjE0LzI5OTUzNDUxMi1hMmZjMjFmOS1jMTNmLTQ2YTMtYjgzNS00N2Y1NWRiMWFmNmUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcxOSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MTlUMjAxMzIwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NGM3MTM3MjJhMzBlY2Q1ODE4MzNjM2RmNmNkMmYwZTI1MmU0MDI1YzhkM2E3ODZiYTg5ZGQ3YjExYmQ4N2Y3ZCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.VEZRiz5AXCf9yNak-_pGJyI0LWL80DYQP2Z3sOSttwY)
How to generate this table for an arbitrary tokenizer vocab?
Implementation of Layer Freezing
I'm not sure if the way I've frozen/unfrozen each layer for tweaking is correct. It looks correct when I examine requires_grad, but it's possible there's a nuance of PyTorch that I have missed.
Beta Was this translation helpful? Give feedback.
All reactions