CommonCrawl/Creative Commons Image Downloader

A script for extracting URLs/license information from CommonCrawl WATs and downloading/resizing these images

Compile instructions

To compile the rust components (commoncrawl_filter and img_dl), run ./compile.sh when in the root of the repo.

Run instructions

Note all estimates are very rough, could easily be off by a factor of 2 (but should be the right OOM at least...)

First, to extract relevant data from Common Crawl WAT files. (~1.2PB ingress, ~500GB output, ~100 CPU days)

# get urls for all WATs
python3 download_warc_urls.py

# download and process all WATs
# Usage:
# python3 download_cc.py <threads> <url list> <output path>
python3 download_cc.py 8 indexes_1614468564_warc_urls.txt out_dir

Then use dump_urls.py to create image level metadata from page level metadata (~500GB input, ~250GB output, ~10 CPU days)

# usage:
# python3 dump_urls.py <threads> <input dir> <output dir (created automatically)>
python3 dump_urls.py 8 crawl urls

Use sort_dedup.py to perform URL level deduplication (~250GB input, ~400GB output, ~500GB scratch space, ~15 CPU days)

# usage:
# python3 sort_dedup.py <threads> <input dir> <temp working dir> <output dir>
python3 sort_dedup.py 8 urls hash_clustered deduped_urls

Use download_images.py (which calls img_dl) to actually download the data (~500TB ingress, ~500TB output, ~200 CPU days)

# usage:
# python3 download_images.py <threads> <input dir> <error dir> <image output dir>
python3 download_images.py 8 deduped_urls errors images

# to retry failed downloads after a complete run, use the errors directory as a new input dir. i.e.
python3 download_images.py 8 errors new_errors images

# and again later perhaps
python3 download_images.py 8 new_errors new_new_errors images

# etc etc repeat until satisfied

Use file_convert.py to convert all images to jpeg, resize if too large, discard if too small (~500TB input, ~200TB output, ~100 CPU days)

# usage:
# python3 file_convert.py <threads> <downloaded images> <deduped URL dir> <image output> <label output dir>
python3 8 images deduped_urls converted_images labels

TODOs

Additional filtering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CommonCrawl/Creative Commons Image Downloader

Compile instructions

Run instructions

TODOs

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
commoncrawl_filter		commoncrawl_filter
img_dl		img_dl
.gitignore		.gitignore
compile.sh		compile.sh
download_cc.py		download_cc.py
download_crawl_samples.py		download_crawl_samples.py
download_images.py		download_images.py
download_warc_urls.py		download_warc_urls.py
dump_urls.py		dump_urls.py
example.jsonl.wat.gz		example.jsonl.wat.gz
file_convert.py		file_convert.py
indexes_1614468564		indexes_1614468564
readme.md		readme.md
sort_dedup.py		sort_dedup.py

kingoflolz/cc_img_dl

Folders and files

Latest commit

History

Repository files navigation

CommonCrawl/Creative Commons Image Downloader

Compile instructions

Run instructions

TODOs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages