Incorperated into cc_img_dl!

See: https://github.com/kingoflolz/cc_img_dl

Common Crawl Filter

Some quick and dirty code for downloading CC WAT files and parsing out images which are CC licensed, output format is gzip compressed jsonl.

This does not use asynchronous IO, but you don't need very many streams to saturate bandwidth or CPU.

Build and run instructions (single file)

RUSTFLAGS="-C target-cpu=native" cargo build --release
cp target/release/commoncrawl_filter .
commoncrawl_filter http:https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2021-04/segments/1610703495901.0/wat/CC-MAIN-20210115134101-20210115164101-00000.warc.wat.gz CC-MAIN-20210115134101-20210115164101-00000.warc.wat.jsonl.gz

Run instructions (all WATs)

A python helper program is also provided which helps you run multiple instances of the downloader

# get urls for all WATs
python3 download_warc_urls.py

# download and process everything
# arugments are <threads> <url list> <output path>
# e.g.
python3 download_cc.py 8 indexes_1614468564_warc_urls.txt out_dir

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
download_cc.py		download_cc.py
download_warc_urls.py		download_warc_urls.py
indexes_1614468564		indexes_1614468564
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Incorperated into cc_img_dl!

Common Crawl Filter

Build and run instructions (single file)

Run instructions (all WATs)

About

Releases

Packages

Languages

kingoflolz/commoncrawl_filter

Folders and files

Latest commit

History

Repository files navigation

Incorperated into cc_img_dl!

Common Crawl Filter

Build and run instructions (single file)

Run instructions (all WATs)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages