How was deduplication done? #75

theblackcat102 · 2023-03-14T04:18:57Z

Specifically what method or library does the deduplication of the Pile carried out?

I have search previous issues and this repo and see no other mention of the methodology for deduplication.

haileyschoelkopf · 2023-03-20T14:58:18Z

Hi! More detail will be included in the paper that we hope to upload soon, apologies that this is missing from the repository.

This deduplicated copy of the dataset was created using the code from https://github.com/google-research/deduplicate-text-datasets with MinhashLSH and Jaccard similarity thresholded at 0.87, I believe. Let me know if you have further questions, or if this clarifies things!

theblackcat102 · 2023-03-27T23:13:20Z

Okay, I'm excited to see the paper soon!

theblackcat102 closed this as completed Mar 27, 2023

ChenghaoMou mentioned this issue May 21, 2023

Is there an access to the deduplicated version of the data with meta info? #92

Closed

Provide feedback