-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the possibility of passing a label dataset #958
Conversation
This seems reasonable. We can also use this to preserve metadata throughout the training pipeline which someone was asking about recently. |
This PR looks great and can be merged if need be now, thank you very much @honglu2875 ! I'll test this tomorrow before approving to doubly confirm though. |
LGTM! |
LGTM, but
|
Addressed via comment block! If this documentation seems deserving of being in the main README instead, happy to move it there. |
this is realy great ,but I still want to know how to edit yml file in local_setup.yml in this way? "data_path": "./data_text_document", but I get this error |
Would it work if you put down |
I follow your instruction but a new problem occurs. The key is how do I pass the labels of "valid/test" set to the arguments? |
Oh, I know the solution. in megatron/data/data_utils.py: build_weighted_datasets function, pass the argument: label_prefix. |
As discussed with @haileyschoelkopf in DM, I'm cleaning up my personal fork and putting the major difference out here in case it is a useful feature. It allows to take an optional
label_data_paths
(which is assumed to be perfectly in sync with the training data).But if there is anything that does not fit with the current design or anything, feel free to close it though.
Why do I need it
It was used when finetuning diff models, where a custom loss mask needs to be applied to the dataset. It is very controllable if I can do it on the dataset level and easy to double-check by poking into the generated masks.
I also used this in some experiments of finetuning pythia for text repairing. It works well.
How it works
label_data_paths
, nothing should change.build_the_dataset
function), so that the dataset yields an extra "label" item;_get_batch
function).Besides the label, there is also a minor fix in
setup_for_inference_or_eval
. Not sure if it's an issue with newer version of DeepSpeed, but theno_load_optim
wasn't enough to prevent the optimizer from loaded when I do inference and use stuff likegenerate_samples_unconditional
.