Provide the shuffled index_mapping npy files for ease of reproducing training data #153

ziqi-zhang · 2024-03-14T17:05:56Z

Hi,

I was wondering can you provide the index_mapping files that is generated by the GPT2Dataset? From the construction of gpt2dataset at here, I can see there are three npy index files

    doc_idx_filename = _filename + "_doc_idx.npy"
    sample_idx_filename = _filename + "_sample_idx.npy"
    shuffle_idx_filename = _filename + "_shuffle_idx.npy"

I was wondering can you provide a copy of these files so that I don't need to regenerate them?

I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of GPT2Dataset, I found that with these index files, I can reproduce the original training data of pythia.

I noticed that you provide the batch_viewer.py to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.

Thanks

The text was updated successfully, but these errors were encountered:

ziqi-zhang changed the title ~~Provide the index_mapping npy files for ease of reproducing training data~~ Provide the shuffled index_mapping npy files for ease of reproducing training data Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide the shuffled index_mapping npy files for ease of reproducing training data #153

Provide the shuffled index_mapping npy files for ease of reproducing training data #153

ziqi-zhang commented Mar 14, 2024 •

edited

Loading

Provide the shuffled index_mapping npy files for ease of reproducing training data #153

Provide the shuffled index_mapping npy files for ease of reproducing training data #153

Comments

ziqi-zhang commented Mar 14, 2024 • edited Loading

ziqi-zhang commented Mar 14, 2024 •

edited

Loading