Skip to content
This repository has been archived by the owner on May 26, 2018. It is now read-only.
/ wikiextractor Public archive
forked from attardi/wikiextractor

A tool for extracting plain text from Wikipedia dumps

License

Notifications You must be signed in to change notification settings

Arkanosis/wikiextractor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikiextractor

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.

The tool is written in Python and requires no additional library.

For further information, see the project Home Page or the Wiki.

The current beta version of WikiExtrctor.py is capable of performing template expansion to some extent.

Usage

The script is invoked with a Wikipedia dump file as an argument. The output is stored in a number of files of similar size in a chosen directory. Each file will contains several documents in this document format.

This is a beta version that performs template expansion by preprocesssng the whole dump and extracting template definitions.

Usage:
 WikiExtractor.py [options] xml-dump-file

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        output directory
  -b n[KMGTPEZY], --bytes n[KMGTPEZY]
                    put specified bytes per output file (default is 1M)
  -B BASE, --base BASE  base URL for the Wikipedia pages
  -c, --compress        compress output files using bzip
  -l, --links           preserve links
  -ns ns1,ns2, --namespaces ns1,ns2
                        accepted namespaces
  -q, --quiet           suppress reporting progress info
  --debug               print debug info
  -s, --sections        preserve sections
  -a, --article         analyze a file containing a single article
  --templates TEMPLATES
                        use or create file containing templates
  --no-templates        Do not expand templates
  --threads THREADS     Number of threads to use (default 8)
  -v, --version         print program version

Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.

Option --no-templates significantly speeds up the extractor, avoiding the cost of expanding MediaWiki templates.

About

A tool for extracting plain text from Wikipedia dumps

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%