Difference between LFS and HuggingFace datasets? #112

eric-mitchell · 2023-06-28T07:46:54Z

In order to regenerate the ordering of the examples used during training, the README suggests downloading the dataset from LFS. I'm having issues with this process because LFS downloads two copies of the data, and I can fit only one, not two, on my hard drive. However, according to the discussion here, it seems like the datasets on HuggingFace also preserve the training order. If I just want to see the ordering of the samples, is there any reason not to just use the HuggingFace data?

Thanks!

haileyschoelkopf · 2023-07-03T13:46:35Z

Ahh, I meant to follow up to amend that readme discussion--the HF datasets https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated and https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile are NOT the same shuffle order as the LFS pretokenized dataset.

However, those datasets should be the same order as the data that was used with GPT-NeoX's preprocess_data.py (https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile/tree/main). one feasible solution then would be to tokenize that data and confirm via some checks that it does indeed end up in the same order as the pretokenized files, but this'd still require having 2 copies of the data on your disk

I could see about there being another way for us to distribute the data, so that you don't need to have 2 copies of the .bin file to concatenate them all. Huggingface's 50GB limit per file is what we ran up against here unfortunately.

haileyschoelkopf mentioned this issue Jul 3, 2023

What tool do you use for your data preprocessing/binarization? #69

Closed

haileyschoelkopf closed this as completed Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between LFS and HuggingFace datasets? #112

Difference between LFS and HuggingFace datasets? #112

eric-mitchell commented Jun 28, 2023

haileyschoelkopf commented Jul 3, 2023

Difference between LFS and HuggingFace datasets? #112

Difference between LFS and HuggingFace datasets? #112

Comments

eric-mitchell commented Jun 28, 2023

haileyschoelkopf commented Jul 3, 2023