-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read in/have a additional column in the training data #865
Comments
Is the loss mask known a priori? In particular, in your example it’s completely independent of the data and is a purely geometric property. Is that typical for your uses? |
So, I generate the loss mask during tokenization and it corresponds to structures in the original text data - in this case the first and the second half of the sentences. {"input_ids":[3010,312,260,36230,200,200,586,1351], "loss_mask": [1,1,1,1,2,2,2,2]} I have implemented this and trained smaller models using HuggingFace, and now I would like to train a larger model doing the same thing with this repository. Thanks for the help :) |
The function that handles attention and loss masking is defined here. In the current implementation we mask out the loss for EOT tokens by setting the value to zero, but there’s nothing that I can see that would prevent you from storing an arbitrary multiplier there. If all cases you care about look like your example you should be able to hard-code it. Otherwise, you can modify this code to define the mask as a function that takes an input and then pass the weight values at runtime. |
Thanks for the help. I wrote a new data class with two indexed datasets where one stores the text property and the other stores the loss masks. |
Hi @davidvblumenthal , would you mind explaining a little bit more about your solution? Do you have any code to share? I am trying to build something very similar to your problem. |
Hi @ikergarcia1996 |
I have a regular tokenized dataset and a corresponding column that represents a loss mask where each token in the regular dataset has a corresponding value. For example: 1 always if a token belongs to the first half of sentence and 2 if the token belongs to the backpart of a sentence.
What I want to do with that is to multiple the per token loss with that loss mask. So I can for example weigh the backpart of a sentence stronger.
What would be the best way to implement that?
Thanks in advance :)
The text was updated successfully, but these errors were encountered: