Syntactic Parsing of Wikipedia

This repository contains scripts for parsing wikipedia (and other wikis) easily.

It is based on the following tools:

mtg parser: outputs constituency trees, labelled dependency trees and morphological analyses.
Benoît Crabbé's tokenizer, forked version for French.
The tokenizer of the Stanford parser for other languages.
Giuseppe Attardi's wikiextractor.

The parser is described in Chapter 8 of my dissertation and trained on the SPMRL dataset (Seddah et al. 2013) and the discontinuous Penn Treebank (Evang and Kallmeyer, 2011) for English.

If you use this data in the context of a publication, please cite:

@phdthesis{coavoux-phd-thesis,
    author = {Coavoux, Maximin},
    school = {Univ Paris Diderot, Sorbonne Paris Cit\'{e}},
    title = {Discontinuous Constituency Parsing of Morphologically Rich Languages},
    year = 2017
}

Data

Download the data

You can download the parsed data at this url: http:https://www.llf.cnrs.fr/wikiparse/.

Here is what is available:

Language	Constituency trees	Discontinuous constituency trees	Labelled dependency trees	POS tags	Morphological analysis
French	X		X	X	X
English		X		X
German		X		X	X
Basque	X			X	X
Polish	X		X	X	X
Swedish	X			X	X
Hungarian	X			X	X

Data description

For each wiki page, we provide two files:

ID.txt.tok.conll: dependency trees in conll format, including morphological analysis.
ID.txt.tok.discbracket: constituency trees in discbracket format.

where ID is an identifier for the wiki page. The wikipedia article corresponding to the file ID.txt.tok.conll is accessible with the URL https://fr.wikipedia.org/?curid=ID. For example, https://fr.wikipedia.org/?curid=1750 yields the article Linguistique. The parse trees for this article are in 1750.txt.tok.conll and 1750.txt.tok.discbracket.

The morphological analysis for a token is provided as a set of attribute-value couples (see conll example below). The attributes include:

French: gender (g), number (n), tense (t), mood (m), subcategory (s), person (p), multiword expression (mwehead and pred).
German: case, number, gender, degree, tense, mood, person.

See the documentation of the SPMRL dataset release for more information about morphological annotations.

Examples

Discbracket tree:

(ROOT (SENT (NP (DET 0=La) (NC 1=fin)  (PP (P 2=de) (NP (DET 3=la) (NC 4=mission) (NPP+ (NPP 5=STS) (PONCT 6=-) (ADJ 7=114))))) (VN (V 8=est) (VPP 9=prévue)) (PP (P 10=pour) (NP (DET 11=le) (ADJ 12=7) (NC 13=août))) (PONCT 14=.)))

Corresponding constituency tree (drawn with discodop):

Conll tree:

1	La	_	DET	DET	g=f|n=s|s=def	2	det	_	_
2	fin	_	NC	NC	g=f|n=s|s=c	10	suj	_	_
3	de	_	P	P	_	2	dep	_	_
4	la	_	DET	DET	g=f|n=s|s=def	5	det	_	_
5	mission	_	NC	NC	g=f|n=s|s=c	3	obj.p	_	_
6	STS	_	NPP	NPP	mwehead=NPP+|s=p|pred=y	5	mod	_	_
7	-	_	PONCT	PONCT	s=w|pred=y	6	dep_cpd	_	_
8	114	_	ADJ	ADJ	g=f|s=card|pred=y	6	dep_cpd	_	_
9	est	_	V	V	m=ind|n=s|p=3|t=pst	10	aux.pass	_	_
10	prévue	_	VPP	VPP	g=f|m=part|n=s|t=past	0	root	_	_
11	pour	_	P	P	_	10	mod	_	_
12	le	_	DET	DET	g=m|n=s|s=def	14	det	_	_
13	7	_	ADJ	ADJ	g=m|n=s|s=card	14	mod	_	_
14	août	_	NC	NC	g=m|n=s|s=c	11	obj.p	_	_
15	.	_	PONCT	PONCT	s=s	10	ponct	_	_

Corresponding dependency tree (drawn with ginger):

Some stats (wikipedia only)

Langage	Number of articles	Number of sentences	Number of tokens
English	5,490,659	210,149,524	3,641,031,044
Basque	284,192	3,423,803	53,630,159
German	2,109,141	72,803,332	1,225,851,095
French	1,917,621	48,675,094	1,303,811,180
Hungarian	418,216	12,962,605	178,171,115
Polish	1,244,308	27,580,861	393,371,595
Swedish	3,789,290	32,218,499	500,037,343

Reparse

Setup

Instructions for downloading and compiling these tools are in setup.sh. To run it, you need boost, g++, clang++ and java8 (for the Stanford parser).

Source Data

The scripts need cirrus dumps as input, found at https://dumps.wikimedia.org/other/cirrussearch/.

To download and extract the data for a specific language, run:

bash download_extract_lang.sh <date> <language code>

where date is the time stamp for a wikipedia dump (see what is available here) and language code is the identifier used by wikipedia for the language.

For example:

bash download_extract_lang.sh 20171009 fr # download French wiki dump of 9 october 2017
bash download_extract_lang.sh 20171009 ko # download Korean wiki

Parse

Use the script parse_wiki.py to parse the data.

python3 parse_wiki.py --help
# python3 parse_wiki.py <parser exe> <parsing model> <path to tokenizer> <wiki root>--threads <num of threads> --beam <size of beam>

python3 parse_wiki.py ./mtg2_parser FRENCH ./tokenizer_fr extracted_texts/frwiki --threads 20
# or
python3 parse_wiki.py "./mtg2_parser -p " FRENCH ./tokenizer_fr extracted_texts/frwiki --threads 20

The -p option precomputes and caches character-based word embeddings (higher initialization time but faster parsing).

For French, each thread should take less than 1 Go of memory.

Pipeline:

Print each article in <ID>.txt, where ID is an identifier for the article (ex: https://fr.wikisource.org/?curid=1026462 yields the wikisource page for Du côté de chez Swann).
Call the tokenizer (sentence segmentation, tokenization) and do some preprocessing to match the input format of the parser (essentially, replace parentheses by -LRB- / -RRB-). This outputs <ID>.txt.tok for each <ID>.txt file.
Call the parser. The parser outputs:
- <ID>.txt.conll: a conll file containing labelled dependency trees and morphological analyses.
- <ID>.txt.discbracket: a discbracket file containing a constituency tree.

References

Maximin Coavoux. Discontinuous Constituency Parsing of Morphologically Rich Languages. PhD dissertation, Université Paris Diderot, Université Sorbonne Paris Cité (USPC), 2017. [bib]

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
LICENSE		LICENSE
README.md		README.md
download_extract_french.sh		download_extract_french.sh
download_extract_lang.sh		download_extract_lang.sh
parse_wiki.py		parse_wiki.py
phd.bib		phd.bib
setup.sh		setup.sh
setup_stanford_tokenizer.sh		setup_stanford_tokenizer.sh
tokenizer_en		tokenizer_en
tokenizer_fr		tokenizer_fr
wdeptree.png		wdeptree.png
word_count.py		word_count.py
wtreedisco.png		wtreedisco.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Syntactic Parsing of Wikipedia

Data

Download the data

Data description

Examples

Some stats (wikipedia only)

Reparse

Setup

Source Data

Parse

References

About

Releases

Packages

Languages

License

mcoavoux/wiki_parse

Folders and files

Latest commit

History

Repository files navigation

Syntactic Parsing of Wikipedia

Data

Download the data

Data description

Examples

Some stats (wikipedia only)

Reparse

Setup

Source Data

Parse

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages