Allow Dataset to handle bzip2 compressed downloads #815

bencwallace · 2020-06-08T02:43:57Z

🚀 Feature

A minor extension to the torchtext Dataset class, allowing it to extract bzip2 compressed files after downloading them.

Motivation

Basically, bzip2 might not be as common as zip or gzip, but it's certainly not unheard of. Moreover, Dataset is already making use of the tarfile module to extract gzip files, so this would be a very minor change.

My particular use case involves the WikiSmall and WikiLarge datasets (see here).

Pitch

I'm happy to implement this feature myself. I would just an extra branch to this conditional statement.

Other thoughts

Not sure if this should be a separate issue, but perhaps the conditional linked to above should have an else clause that warns the user if their data is left un-extracted.

The text was updated successfully, but these errors were encountered:

zhangguanheng66 · 2020-06-08T03:01:54Z

There should also be a patch in extract_archive here

bencwallace · 2020-06-08T03:06:54Z

Thanks, I hadn't seen that before. I can't look at it carefully right now, but it seems to me the entire extraction procedure in Dataset.download should simply call (a patched version of) extract_archive.

zhangguanheng66 · 2020-06-08T13:31:58Z

Dataset.download will be retired soon as the new dataset abstraction link comes in. But extract_archive func will be used in the future.

parmeet · 2022-06-23T22:07:31Z

We have migrated to torchdata datapipes which support multiple types of compressed file reading.

zhangguanheng66 added help wanted feature request labels Jun 8, 2020

zhangguanheng66 added the legacy label Jun 8, 2020

adamjstewart mentioned this issue Jun 22, 2021

Add bzip2 file compression support pytorch/vision#4097

Merged

parmeet closed this as completed Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Dataset to handle bzip2 compressed downloads #815

Allow Dataset to handle bzip2 compressed downloads #815

bencwallace commented Jun 8, 2020

zhangguanheng66 commented Jun 8, 2020

bencwallace commented Jun 8, 2020

zhangguanheng66 commented Jun 8, 2020

parmeet commented Jun 23, 2022

Allow Dataset to handle bzip2 compressed downloads #815

Allow Dataset to handle bzip2 compressed downloads #815

Comments

bencwallace commented Jun 8, 2020

🚀 Feature

zhangguanheng66 commented Jun 8, 2020

bencwallace commented Jun 8, 2020

zhangguanheng66 commented Jun 8, 2020

parmeet commented Jun 23, 2022