Skip to content

ELTE-DH/cc_news_index

Repository files navigation

CommonCrawl NEWS dataset indexer

Creates CDXJ index for the CommonCrawl NEWS dataset (there is official index server).

Usage

  1. Set AWS API key and secret in boto.cfg (see example: example_boto.cfg)

  2. Set GNU parallel nodefile (see example: example_nodefile)

    • Copy this directory to the same path on all machines
  3. Set parameters as environment variables:

    • PYTHON (default: python3)
    • OUTPUT_DIR (default: $(PWD)/output)
    • BOTO_CFG (default: $(PWD)/boto.cfg)
    • NO_OF_THREADS (default: 80)
    • NICEVALUE (default: 10)
  4. Set languages to collect in languages_to_collect.txt. The format is "[LANGUAGE NAME AS IN LINGUA]": (because it is grepped from a JSONL for speed concerns)

Run make to execute the whole process or consult with the Makefile for the individual steps

License

This code is licensed under the GPL 3.0 license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published