-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there an access to the deduplicated version of the data with meta info? #92
Comments
+1 |
Thanks for your interest!!
I definitely agree that further experimentation is needed on the topic of deduplication in other settings though! |
@haileyschoelkopf i believe you wanted to stress that the updated Pythia paper will have benchmark scores with values immediately before the second epoch, not immediately after. @Jason3900 while we can and will add more data about this, figures 8-14 in the appendix have a lot of the information that I think you are looking for. |
Thanks for the excellent work! Can we have more information about how the deduplication is conducted? The paper only mentioned that "applying near-deduplication with MinHashLSH and a threshold of 0.87". Can we have the source code for the deduplicating process? Thanks! |
+1 on the deduplication code part. In #75, deduplicate-text-datasets was mentioned, but the repo does not contain any near deduplication implementation due to issue. It would be very helpful if such code can be shared. Thanks in advance! |
Hi, the deduplication code originally used to dedupe the Pile is now uploaded here: https://github.com/EleutherAI/pile_dedupe ! Hope this helps—I have not worked with this code before, but happy to familiarize myself and help with any issues that may arise running it! |
Thanks for your excellent work. After reading the paper, I have two questions remain:
The text was updated successfully, but these errors were encountered: