Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read in/have a additional column in the training data #865

Closed
davidvblumenthal opened this issue Mar 31, 2023 · 6 comments
Closed

read in/have a additional column in the training data #865

davidvblumenthal opened this issue Mar 31, 2023 · 6 comments
Labels
feature request New feature or request

Comments

@davidvblumenthal
Copy link

I have a regular tokenized dataset and a corresponding column that represents a loss mask where each token in the regular dataset has a corresponding value. For example: 1 always if a token belongs to the first half of sentence and 2 if the token belongs to the backpart of a sentence.

What I want to do with that is to multiple the per token loss with that loss mask. So I can for example weigh the backpart of a sentence stronger.

What would be the best way to implement that?

Thanks in advance :)

@davidvblumenthal davidvblumenthal added the feature request New feature or request label Mar 31, 2023
@StellaAthena
Copy link
Member

Is the loss mask known a priori? In particular, in your example it’s completely independent of the data and is a purely geometric property. Is that typical for your uses?

@davidvblumenthal
Copy link
Author

So, I generate the loss mask during tokenization and it corresponds to structures in the original text data - in this case the first and the second half of the sentences.
A tokenized sample containing one sentence would look like this:

{"input_ids":[3010,312,260,36230,200,200,586,1351], "loss_mask": [1,1,1,1,2,2,2,2]}

I have implemented this and trained smaller models using HuggingFace, and now I would like to train a larger model doing the same thing with this repository.
But I am unsure what the best way is to implement this with how the data logic works in this repository.

Thanks for the help :)

@StellaAthena
Copy link
Member

StellaAthena commented Apr 3, 2023

The function that handles attention and loss masking is defined here. In the current implementation we mask out the loss for EOT tokens by setting the value to zero, but there’s nothing that I can see that would prevent you from storing an arbitrary multiplier there. If all cases you care about look like your example you should be able to hard-code it. Otherwise, you can modify this code to define the mask as a function that takes an input and then pass the weight values at runtime.

@davidvblumenthal
Copy link
Author

Thanks for the help. I wrote a new data class with two indexed datasets where one stores the text property and the other stores the loss masks.
Works just fine!

@ikergarcia1996
Copy link

Hi @davidvblumenthal , would you mind explaining a little bit more about your solution? Do you have any code to share? I am trying to build something very similar to your problem.

@davidvblumenthal
Copy link
Author

Hi @ikergarcia1996
you can find the code here GitHub
changes are mostly in the preprocessing script, training.py and in the megatron/data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants