-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reshape error in batch viewer #158
Comments
It looks like the right dataset to use there is EleutherAI/pile-deduped-pythia-preshuffled, which gives even 2049-sized across all samples. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thank you for the great project!
I have successfully been able to merge all the shards from EleutherAI/pythia_deduped_pile_idxmaps.
However, while trying to get batches out of the
utils/batch_viewer.py
, I get the following error:Each sample here seems to be of uneven length, and makes sense why this code would fail.
Would you be able to help me (or just point me to a code reference) so that I can chunk the document into the 2049-sized chunks? For context, I only want to do evaluations on top of a subset of training data. I want the chunks to be constructed precisely the same way as during training so that I put them in a dataloader and simply subsample on top (perhaps something like a torch.utils.data.Subset).
The text was updated successfully, but these errors were encountered: