Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redpajama_v1/book jsonl file is empty #2

Open
YilunKuang opened this issue May 6, 2024 · 3 comments
Open

redpajama_v1/book jsonl file is empty #2

YilunKuang opened this issue May 6, 2024 · 3 comments

Comments

@YilunKuang
Copy link

Hi,

I was trying to download redpajama_v1/book, but the following link corresponding to this dataset is unavailable:

https://data.together.xyz/redpajama-data-1T/v1.0.0/book/book.jsonl
@TianhuaTao
Copy link
Contributor

Hi. I just checked the lastest url list from https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt . It seems they removed the books subset from it. And the original url for book is indeed not available now. It seems to be an issue on the Redpajama side, possibly related to copyright.

@TianhuaTao
Copy link
Contributor

Yeah, a copyright issue for sure.
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

Gutenberg and Books3
Defunct: The 'book' config is defunct and no longer accessible due to reported copyright infringement for the Book3 dataset contained in this config.

@YilunKuang
Copy link
Author

I see. Thanks for looking into it :) Feel free to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants