Is there an access to the deduplicated version of the data with meta info? #92

Jason3900 · 2023-04-20T02:12:23Z

Thanks for your excellent work. After reading the paper, I have two questions remain:

Is there a way to get access to the deduplicated version of the Pile with preservation of the source/subset domain since we commonly need to set a mix ratio between different subsets considering the various quality?
In the experiments, I think it might be fair to compare the performance of the raw/deduplicated models with consumed tokens up to the deduplicated version. In current setting, the deduplicated version of the data are actually used as a duplicated one since the model have seen it 1.5x times. Thus I'm a bit confused of the observation that the two result in no significant difference in performance as both are "duplicated".

stabilize-ai · 2023-04-21T00:10:51Z

+1

haileyschoelkopf · 2023-04-22T19:33:18Z

Thanks for your interest!!

Re: point one, I will look into it! Our version of this dataset unfortunately did not preserve metadata during the deduplication process, and this has been an issue for us as well.
I will reupload a version of the paper that contains benchmark scores after ~207B tokens for both model suites, as well as scores at the end of training for both model suites. Your note makes sense, however we see the same result when looking at performance ~ right after epoch 1 of our deduplicated training ends (that equi-token models perform the same with and without dedup in our case).

I definitely agree that further experimentation is needed on the topic of deduplication in other settings though!

StellaAthena · 2023-04-27T16:59:51Z

@haileyschoelkopf i believe you wanted to stress that the updated Pythia paper will have benchmark scores with values immediately before the second epoch, not immediately after.

@Jason3900 while we can and will add more data about this, figures 8-14 in the appendix have a lot of the information that I think you are looking for.

dongdeng · 2023-05-04T10:40:20Z

Thanks for the excellent work! Can we have more information about how the deduplication is conducted? The paper only mentioned that "applying near-deduplication with MinHashLSH and a threshold of 0.87".

Can we have the source code for the deduplicating process? Thanks!

ChenghaoMou · 2023-05-21T02:37:01Z

+1 on the deduplication code part. In #75, deduplicate-text-datasets was mentioned, but the repo does not contain any near deduplication implementation due to issue.

It would be very helpful if such code can be shared. Thanks in advance!

haileyschoelkopf · 2023-05-21T03:40:30Z

Hi, the deduplication code originally used to dedupe the Pile is now uploaded here: https://github.com/EleutherAI/pile_dedupe !

Hope this helps—I have not worked with this code before, but happy to familiarize myself and help with any issues that may arise running it!

StellaAthena closed this as completed Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there an access to the deduplicated version of the data with meta info? #92

Is there an access to the deduplicated version of the data with meta info? #92

Jason3900 commented Apr 20, 2023

stabilize-ai commented Apr 21, 2023

haileyschoelkopf commented Apr 22, 2023

StellaAthena commented Apr 27, 2023

dongdeng commented May 4, 2023

ChenghaoMou commented May 21, 2023

haileyschoelkopf commented May 21, 2023

Is there an access to the deduplicated version of the data with meta info? #92

Is there an access to the deduplicated version of the data with meta info? #92

Comments

Jason3900 commented Apr 20, 2023

stabilize-ai commented Apr 21, 2023

haileyschoelkopf commented Apr 22, 2023

StellaAthena commented Apr 27, 2023

dongdeng commented May 4, 2023

ChenghaoMou commented May 21, 2023

haileyschoelkopf commented May 21, 2023