Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Dataset to handle bzip2 compressed downloads #815

Closed
bencwallace opened this issue Jun 8, 2020 · 4 comments
Closed

Allow Dataset to handle bzip2 compressed downloads #815

bencwallace opened this issue Jun 8, 2020 · 4 comments

Comments

@bencwallace
Copy link

🚀 Feature

A minor extension to the torchtext Dataset class, allowing it to extract bzip2 compressed files after downloading them.

Motivation

Basically, bzip2 might not be as common as zip or gzip, but it's certainly not unheard of. Moreover, Dataset is already making use of the tarfile module to extract gzip files, so this would be a very minor change.

My particular use case involves the WikiSmall and WikiLarge datasets (see here).

Pitch

I'm happy to implement this feature myself. I would just an extra branch to this conditional statement.

Other thoughts

Not sure if this should be a separate issue, but perhaps the conditional linked to above should have an else clause that warns the user if their data is left un-extracted.

@zhangguanheng66
Copy link
Contributor

There should also be a patch in extract_archive here

@bencwallace
Copy link
Author

Thanks, I hadn't seen that before. I can't look at it carefully right now, but it seems to me the entire extraction procedure in Dataset.download should simply call (a patched version of) extract_archive.

@zhangguanheng66
Copy link
Contributor

Dataset.download will be retired soon as the new dataset abstraction link comes in. But extract_archive func will be used in the future.

@parmeet
Copy link
Contributor

parmeet commented Jun 23, 2022

We have migrated to torchdata datapipes which support multiple types of compressed file reading.

@parmeet parmeet closed this as completed Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants