Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there an access to the deduplicated version of the data with meta info? #92

Closed
Jason3900 opened this issue Apr 20, 2023 · 6 comments
Closed

Comments

@Jason3900
Copy link

Thanks for your excellent work. After reading the paper, I have two questions remain:

  1. Is there a way to get access to the deduplicated version of the Pile with preservation of the source/subset domain since we commonly need to set a mix ratio between different subsets considering the various quality?
  2. In the experiments, I think it might be fair to compare the performance of the raw/deduplicated models with consumed tokens up to the deduplicated version. In current setting, the deduplicated version of the data are actually used as a duplicated one since the model have seen it 1.5x times. Thus I'm a bit confused of the observation that the two result in no significant difference in performance as both are "duplicated".
@stabilize-ai
Copy link

+1

@haileyschoelkopf
Copy link
Collaborator

Thanks for your interest!!

  1. Re: point one, I will look into it! Our version of this dataset unfortunately did not preserve metadata during the deduplication process, and this has been an issue for us as well.

  2. I will reupload a version of the paper that contains benchmark scores after ~207B tokens for both model suites, as well as scores at the end of training for both model suites. Your note makes sense, however we see the same result when looking at performance ~ right after epoch 1 of our deduplicated training ends (that equi-token models perform the same with and without dedup in our case).

I definitely agree that further experimentation is needed on the topic of deduplication in other settings though!

@StellaAthena
Copy link
Member

@haileyschoelkopf i believe you wanted to stress that the updated Pythia paper will have benchmark scores with values immediately before the second epoch, not immediately after.

@Jason3900 while we can and will add more data about this, figures 8-14 in the appendix have a lot of the information that I think you are looking for.

@dongdeng
Copy link

dongdeng commented May 4, 2023

Thanks for the excellent work! Can we have more information about how the deduplication is conducted? The paper only mentioned that "applying near-deduplication with MinHashLSH and a threshold of 0.87".

Can we have the source code for the deduplicating process? Thanks!

@ChenghaoMou
Copy link

+1 on the deduplication code part. In #75, deduplicate-text-datasets was mentioned, but the repo does not contain any near deduplication implementation due to issue.

It would be very helpful if such code can be shared. Thanks in advance!

@haileyschoelkopf
Copy link
Collaborator

Hi, the deduplication code originally used to dedupe the Pile is now uploaded here: https://github.com/EleutherAI/pile_dedupe !

Hope this helps—I have not worked with this code before, but happy to familiarize myself and help with any issues that may arise running it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants