gzipstream

gzipstream allows Python to process multi-part gzip files from a streaming source. The library is originally intended for use with the Python warc library for processing Common Crawl and other web archive data.

Installation

If you are using pip, simply run the command pip install -e git+https://github.com/commoncrawl/gzipstream.git#egg=gzipstream. You can also install using python setup.py install if so desired.

Usage

As an example of usage, examples/streaming_commoncrawl_from_s3.py shows how gzipstream can be used to incrementally process a gzipped web archive (WARC) file. The file is almost a gigabyte in size, selected randomly from the 2014-15 Common Crawl dataset and hosted on Amazon S3. Without gzipstream, processing of the file would only be possible by fully downloading it. This is highly inefficient as (a) a gzipped WARC file is composed of multiple independent gzip files and (b) the WARC file is hunderds of megabytes in size.

For minimal usage however...

from gzipstream import GzipStreamFile
f = open('huge_file.gz') # Any streaming file object that supports `read`
gz = GzipStreamFile(f)

License

MIT License, as per LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
examples		examples
gzipstream		gzipstream
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST		MANIFEST
README		README
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gzipstream

Installation

Usage

License

About

Releases

Packages

Languages

License

commoncrawl/gzipstream

Folders and files

Latest commit

History

Repository files navigation

gzipstream

Installation

Usage

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages