An aligned subset of the Parallel Universal Dependencies

This repository contains word-alignment annotations for several language pairs with PUD corpora. At the moment, English–French, English–Russian, English–Chinese, English–Japanese, and English–Korean are available. The analysis of the morphosyntactic divergences in these language pairs was reported in

@inproceedings{nikolaevetal2020clmd,
	title="Fine-Grained Analysis of Cross-Linguistic Syntactic Divergences",
	author="Nikolaev, Dmitry and Arviv, Ofir and Karidi, Taelin and Kenneth, Neta and Mitnik, Veronika		and Saeboe, Lilja Maria and Abend, Omri",
	booktitle="Proceedings of the 2020 {C}onference of the {A}ssociation for {C}omputational
		{L}inguistics",
	year="2020",
	pages="forthcoming"
}

ArXiv version: https://arxiv.org/abs/2005.03436

Alignments are stored in the alignments directory in subdirectories corresponding to PUD corpora. Each subdirectory contains three files:

en.conllu: the original CoNLL-U records from the English PUD corpus.
target.conllu: CoNLL-U records from the target corpus.
alignment.json: the alignments.

.conllu files only contain records for which alignments are available (999 sentences for En–Fr, 995 sentences for En–Ru, 999 sentences for En–Jp, 999 sentences for En–Zh, and 884 sentences for En–Ko). At the moment, the records are identical to the ones provided in the parent repo, but differences may accrue over time.

Data format

Alignments are stored as JSON files with lists of objects of the following form:

{
  "7": ["3"],
  "6": ["4"],
  "9": ["9", "10"],
  "X": ["34", "5", "6", "19", "29", "17"], 
  "33": ["X"], 
  "3": ["X"]
}

Keys are id’s of the nodes in the original UD tree corresponding to content words (see AlignmentManual.md for the discussion of the distinction between content and function words) or to function words that are in a many-to-one relationship with a target-side content word. Many-to-one relationships are reflected as cases where several source-side keys map to the same one-element list ("7": ["3"], "12": ["3"]). One to many relationships are represented by a key, value pair with several id’s in the value list ("9": ["9", "10"]). "X" represents unaligned content words on the source side ("33": ["X"], "3": ["X"]) or the target side ("X": ["34", "5", "6"]). Many-to-many relationships are prohibited by the annotation manual. In case of aligned multiword expressions where no connections can be established between individual words, the headwords were aligned.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
alignments		alignments
confusion-matrices		confusion-matrices
data		data
src		src
.gitignore		.gitignore
AlignmentManual.md		AlignmentManual.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An aligned subset of the Parallel Universal Dependencies

Data format

About

Releases

Packages

Languages

lovodkin93/exploring-clmd-divergences

Folders and files

Latest commit

History

Repository files navigation

An aligned subset of the Parallel Universal Dependencies

Data format

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages