-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train/valid/test split #102
Comments
Ah yes, that’s a hack we used because we were being lazy and probably a misleading thing to have posted. There are official Pile validation and test sets that you can download from pile.eleuther.ai and which you can evaluate models on using our evaluation codebase. These two datasets are deduplicated against the rest of the Pile and each other. Details on their construction can be found in the Pile paper. We didn’t actually have them downloaded on the cluster we trained the models with, so filled out those values with the training set path and just ignored the results. |
Hello,
I was wondering if the validation and test sets were separated from the train set, or sampled from the same distribution.
According to here: https://github.com/EleutherAI/pythia/blob/main/models/70M/pythia-70m.yml#L91
and how the datasets area formed here: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/data/data_utils.py#L332
it seems like valid and test sets are overlapping with the test set.
The text was updated successfully, but these errors were encountered: