Cirruswiki: Efficiently Handling CirrusSearch Wikipedia Dumps

Cirruswiki (CirrusSearch Wikipedia Processor) Python script designed to efficiently handle CirrusSearch Wikipedia dumps.

It is used to download, extract data from the dumps, process it with both a cleaner which is heavily inspired from the WikiExtractor and a tokenizer, and finally index them with Elasticsearch.

Usage

usage: python cirrus_extractor.py [-h] [--link LINK] [--lang LANG]
                           [--latest | --no-latest] [--process | --no-process]
                           [--output OUTPUT] [--index INDEX]
                           [--debug | --no-debug] [--verbose | --no-verbose]

options:
    -h, --help            show this help message and exit
    --link LINK           Download link
    --lang LANG           Language code
    --latest, --no-latest
                        Download latest dump
    --process, --no-process
                        Process the dump
    --output OUTPUT       Output directory
    --index INDEX         Index name to store the data in Elasticsearch
    --debug, --no-debug   Debug output
    --verbose, --no-verbose
                        Verbose output

Example

Here are a couple of examples demonstrating how to use Cirruswiki effectively:

Downloading a specific Cirrus dump (e.g., German Wikipedia dump from 2023-05-15), processing it, and indexing it in the Elasticsearch index dewiki:

python cirrus_extractor.py \
        --link https://dumps.wikimedia.org/other/cirrussearch/current/dewiki-20230515-cirrussearch-content.json.gz \
        --process \
        --index dewiki \
        --output output \
        --verbose

Downloading the latest Cirrus dump (e.g., French Wikipedia dump), processing it, and indexing it in the Elasticsearch index frwiki (with debugging enabled):

python cirrus_extractor.py \
        --lang fr \
        --latest \
        --process \
        --index frwiki \
        --output output \
        --debug

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
cirrus_clean.py		cirrus_clean.py
cirrus_download.py		cirrus_download.py
cirrus_extractor.py		cirrus_extractor.py
cirrus_indexer.py		cirrus_indexer.py
cirrus_preprocess.py		cirrus_preprocess.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cirruswiki: Efficiently Handling CirrusSearch Wikipedia Dumps

Usage

Example

About

Releases

Packages

Languages

sadaqabdo/cirruswiki

Folders and files

Latest commit

History

Repository files navigation

Cirruswiki: Efficiently Handling CirrusSearch Wikipedia Dumps

Usage

Example

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages