WikiExtractor

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.

The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.

For further information, see the project Home Page or the Wiki.

Details

WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.

In order to speed up processing:

multiprocessing is used for dealing with articles in parallel
a cache is kept of parsed templates (only useful for repeated extractions).

Installation

The script may be invoked directly, however it can be installed by doing:

(sudo) python setup.py install

Usage

The script is invoked with a Wikipedia dump file as an argument. The output is stored in several files of similar size in a given directory. Each file will contains several documents in this document format.

usage: WikiExtractor.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html]
                        [-l] [-s] [--lists] [-ns ns1,ns2]
                        [--templates TEMPLATES] [--no-templates] [-r]
                        [--min_text_length MIN_TEXT_LENGTH]
                        [--filter_disambig_pages] [-it abbr,b,big]
                        [-de gallery,timeline,noinclude] [--keep_tables]
                        [--processes PROCESSES] [-q] [--debug] [-a] [-v]
                        input

Wikipedia Extractor:
Extracts and cleans text from a Wikipedia database dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

    <doc id="" revid="" url="" title="">
        ...
        </doc>

If the program is invoked with the --json flag, then each file will
contain several documents formatted as json ojects, one per line, with
the following structure

    {"id": "", "revid": "", "url":"", "title": "", "text": "..."}

Template expansion requires preprocesssng first the whole dump and
collecting template definitions.

positional arguments:
  input                 XML wiki dump file

optional arguments:
  -h, --help            show this help message and exit
  --processes PROCESSES
                        Number of processes to use (default 1)

Output:
  -o OUTPUT, --output OUTPUT
                        directory for extracted files (or '-' for dumping to
                        stdout)
  -b n[KMG], --bytes n[KMG]
                        maximum bytes per output file (default 1M)
  -c, --compress        compress output files using bzip
  --json                write output in json format instead of the default one

Processing:
  --html                produce HTML output, subsumes --links
  -l, --links           preserve links
  -s, --sections        preserve sections
  --lists               preserve lists
  -ns ns1,ns2, --namespaces ns1,ns2
                        accepted namespaces in links
  --templates TEMPLATES
                        use or create file containing templates
  --no-templates        Do not expand templates
  -r, --revision        Include the document revision id (default=False)
  --min_text_length MIN_TEXT_LENGTH
                        Minimum expanded text length required to write
                        document (default=0)
  --filter_disambig_pages
                        Remove pages from output that contain disabmiguation
                        markup (default=False)
  -it abbr,b,big, --ignored_tags abbr,b,big
                        comma separated list of tags that will be dropped,
                        keeping their content
  -de gallery,timeline,noinclude, --discard_elements gallery,timeline,noinclude
                        comma separated list of elements that will be removed
                        from the article text
  --keep_tables         Preserve tables in the output article text
                        (default=False)

Special:
  -q, --quiet           suppress reporting progress info
  --debug               print debug info
  -a, --article         analyze a file containing a single article (debug
                        option)
  -v, --version         print program version

Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.

Option --no-templates significantly speeds up the extractor, avoiding the cost of expanding MediaWiki templates.

For further information, visit the documentation.

installation and usage of opencc on windows

please refer to http:https://blog.csdn.net/helihongzhizhuo/article/details/47251935 for installation

refer to https://pypi.python.org/pypi/opencc-python/0.1 for opencc usage

installation and usage of pyltp

refer to https://github.com/HIT-SCIR/ltp/blob/master/doc/install.rst for installation

refer to http:https://pyltp.readthedocs.io/zh_CN/latest/ for usage

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
entity_extract		entity_extract
opencc-1.0.1-win64		opencc-1.0.1-win64
.gitignore		.gitignore
ChangeLog		ChangeLog
IOUtils.java		IOUtils.java
LICENSE		LICENSE
README.md		README.md
RecogIns.java		RecogIns.java
WikiExtractor.py		WikiExtractor.py
WikiExtractorV2.py		WikiExtractorV2.py
cirrus-extract.py		cirrus-extract.py
citiao_nt_zhwiki.txt		citiao_nt_zhwiki.txt
citiao_nt_zhwiki_final.txt		citiao_nt_zhwiki_final.txt
extract.bat		extract.bat
extract.sh		extract.sh
extractAllTitles.py		extractAllTitles.py
extractIns.py		extractIns.py
extractPage.py		extractPage.py
ins2ins.txt		ins2ins.txt
ins_dict.txt		ins_dict.txt
ins_title_json.txt		ins_title_json.txt
ins_titles_all.txt		ins_titles_all.txt
recognize.py		recognize.py
recognize_with_jieba.py		recognize_with_jieba.py
samples.txt		samples.txt
self_dict.txt		self_dict.txt
self_dict_update.txt		self_dict_update.txt
setup.py		setup.py
spectral_clustering.py		spectral_clustering.py
t2s.bat		t2s.bat
t2s.py		t2s.py
tests.py		tests.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiExtractor

Details

Installation

Usage

installation and usage of opencc on windows

installation and usage of pyltp

About

Releases

Packages

Languages

License

renke2/wikiextractor

Folders and files

Latest commit

History

Repository files navigation

WikiExtractor

Details

Installation

Usage

installation and usage of opencc on windows

installation and usage of pyltp

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages