Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide the shuffled index_mapping npy files for ease of reproducing training data #153

Open
ziqi-zhang opened this issue Mar 14, 2024 · 0 comments

Comments

@ziqi-zhang
Copy link

ziqi-zhang commented Mar 14, 2024

Hi,

I was wondering can you provide the index_mapping files that is generated by the GPT2Dataset? From the construction of gpt2dataset at here, I can see there are three npy index files

    doc_idx_filename = _filename + "_doc_idx.npy"
    sample_idx_filename = _filename + "_sample_idx.npy"
    shuffle_idx_filename = _filename + "_shuffle_idx.npy"

I was wondering can you provide a copy of these files so that I don't need to regenerate them?

I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of GPT2Dataset, I found that with these index files, I can reproduce the original training data of pythia.

I noticed that you provide the batch_viewer.py to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.

Thanks

@ziqi-zhang ziqi-zhang changed the title Provide the index_mapping npy files for ease of reproducing training data Provide the shuffled index_mapping npy files for ease of reproducing training data Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant