-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preconfigured Datasets are not available #949
Labels
bug
Something isn't working
Comments
wget delivers a certificate issue for data.deepai.org. The certificate has expired. |
I open the link through chrome to download |
Clicking the link that you say is down downloads it for me as well. Perhaps they had a short outage? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
The step:
"Several preconfigured datasets are available, including most components from the Pile, as well as the Pile train set itself, for straightforward tokenization using the prepare_data.py entry point.
E.G, to download and tokenize the enwik8 dataset with the GPT2 Tokenizer, saving them to ./data you can run:
"
creates an Download error. https://data.deepai.org/enwik8.zip is down / not available
To Reproduce
Steps to reproduce the behavior:
python prepare_data.py -d ./data
Expected behavior
Getting a dataset.
Proposed solution
Maybe provide another source or another description how to find such a dataset.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: