read in/have a additional column in the training data #865

davidvblumenthal · 2023-03-31T08:34:13Z

I have a regular tokenized dataset and a corresponding column that represents a loss mask where each token in the regular dataset has a corresponding value. For example: 1 always if a token belongs to the first half of sentence and 2 if the token belongs to the backpart of a sentence.

What I want to do with that is to multiple the per token loss with that loss mask. So I can for example weigh the backpart of a sentence stronger.

What would be the best way to implement that?

Thanks in advance :)

StellaAthena · 2023-03-31T19:28:12Z

Is the loss mask known a priori? In particular, in your example it’s completely independent of the data and is a purely geometric property. Is that typical for your uses?

davidvblumenthal · 2023-03-31T20:44:23Z

So, I generate the loss mask during tokenization and it corresponds to structures in the original text data - in this case the first and the second half of the sentences.
A tokenized sample containing one sentence would look like this:

{"input_ids":[3010,312,260,36230,200,200,586,1351], "loss_mask": [1,1,1,1,2,2,2,2]}

I have implemented this and trained smaller models using HuggingFace, and now I would like to train a larger model doing the same thing with this repository.
But I am unsure what the best way is to implement this with how the data logic works in this repository.

Thanks for the help :)

StellaAthena · 2023-04-03T04:53:14Z

The function that handles attention and loss masking is defined here. In the current implementation we mask out the loss for EOT tokens by setting the value to zero, but there’s nothing that I can see that would prevent you from storing an arbitrary multiplier there. If all cases you care about look like your example you should be able to hard-code it. Otherwise, you can modify this code to define the mask as a function that takes an input and then pass the weight values at runtime.

davidvblumenthal · 2023-04-05T07:21:18Z

Thanks for the help. I wrote a new data class with two indexed datasets where one stores the text property and the other stores the loss masks.
Works just fine!

ikergarcia1996 · 2023-05-04T10:41:32Z

Hi @davidvblumenthal , would you mind explaining a little bit more about your solution? Do you have any code to share? I am trying to build something very similar to your problem.

davidvblumenthal · 2023-05-04T13:50:14Z

Hi @ikergarcia1996
you can find the code here GitHub
changes are mostly in the preprocessing script, training.py and in the megatron/data

davidvblumenthal added the feature request New feature or request label Mar 31, 2023

davidvblumenthal closed this as completed Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read in/have a additional column in the training data #865

read in/have a additional column in the training data #865

davidvblumenthal commented Mar 31, 2023

StellaAthena commented Mar 31, 2023

davidvblumenthal commented Mar 31, 2023

StellaAthena commented Apr 3, 2023 •

edited

Loading

davidvblumenthal commented Apr 5, 2023

ikergarcia1996 commented May 4, 2023

davidvblumenthal commented May 4, 2023

read in/have a additional column in the training data #865

read in/have a additional column in the training data #865

Comments

davidvblumenthal commented Mar 31, 2023

StellaAthena commented Mar 31, 2023

davidvblumenthal commented Mar 31, 2023

StellaAthena commented Apr 3, 2023 • edited Loading

davidvblumenthal commented Apr 5, 2023

ikergarcia1996 commented May 4, 2023

davidvblumenthal commented May 4, 2023

StellaAthena commented Apr 3, 2023 •

edited

Loading