Estonian L2 Grammatical Error Correction Corpus (EstGEC-L2)

The subset of Estonian Grammatical Error Correction Corpus (EstGEC) that contains L2 learner writings error-annotated in the M2 format.

This subcorpus currently consists of 263 texts and 3,790 sentences retrieved from the Estonian Interlanguage Corpus compiled at the Tallinn University. The texts include narrative/descriptive and argumentative writings as well as informal and formal letters representing various proficiency levels. EstGEC-L2 material has been divided into a test and development set that can be used for evaluating and improving Estonian automated correction tools. The test set comprises 2,029 and the dev set 1,761 sentences, distributed between the proficiency levels as follows:

A2 – 937 (495 in test set);
B1 – 963 (504 in test set);
B2 – 1,091 (534 in test set);
C1 – 796 (495 in test set).

Previously, the texts had been manually error-tagged in the CoNLL-U format, indicating the error type, scope, and correction in the field for miscellaneous token attributes. The annotation has been converted to the M2 format (the conversion script can be found here) using an adapted version of the ERRANT tagset. Whereas the previous format was limited to one error annotation per sentence, up to two new annotation versions have been added. Considering the two-phase annotation, each text has been reviewed by at least three annotators.

There are 12 main and 18 combined error types in the error classification (see tables 1 and 2). The prefix indicates whether a word, phrase or punctuation mark should be replaced ('R:'), is missing ('M:') or unnecessary ('U:'). In our tagset, we do not distinguish the part-of-speech (POS) of the replaced, added or deleted word. For example, all word choice errors are indicated by the tag 'R:LEX'. This has helped to reduce the complexity of the error categorization, while allowing us to classify all errors and avoid the 'OTHER' tag. There are numerous edit and POS combinations, since the edit types often overlap (e.g., spelling errors co-occur with inflection and word choice errors) and the POS of the original word and its replacement can differ.

Another important difference to the English M2 annotation is that we allow overlapping error scope if a token-level error occurs within a word order error, e.g., one of the words contains a spelling error. Therefore, it is possible to detect token-level corrections even if word order has not been edited.

Furthermore, orthography errors have been divided into capitalization and whitespace errors. Inflection errors are marked as nominal (noun, adjective, pronoun and numeral) or verb form errors without a further distinction, i.e., these include case, number, agreement, tense, mood and other errors in the choice of inflected form.

Table 1. Main error types

Error tag	Meaning	Example
R:SPELL	Spelling error	soobib -> sobib
R:CASE	Capitalization error	Juuli -> juuli
R:WS	Whitespace error	igalpool -> igal pool
R:NOM:FORM	Nominal form error	kallis -> kallid (Sing -> Plur)
R:VERB:FORM	Verb form error	tegeleb -> tegeles (Pres -> Past)
R:LEX	Word choice error	ilusasti -> ilus (ADV -> ADJ)
R:PUNCT	Punctuation choice error	Kohtumiseni. -> Kohtumiseni!
R:WO	Word order error	üldse polnud -> polnud üldse
M:LEX	Missing word(s)	See väga ilus linn -> See on väga ilus linn
U:LEX	Unnecessary word(s)	auto välimus on punane -> auto on punane
U:PUNCT	Unnecessary punctuation	laupäeval, kell 10 -> laupäeval kell 10

Table 2. Combined error types

Error tag	Meaning	Example
R:SPELL:CASE	Spelling and capitalization error	Vannalinnas -> vanalinnas
R:WS:SPELL	Whitespace and spelling error	liimik koht -> lemmikkoht
R:WS:CASE	Whitespace and capitalization error	Kontserdi majas -> kontserdimajas
R:WS:NOM:FORM	Whitespace and nominal form error	kogupäev -> kogu päeva (Nom -> Gen)
R:WS:NOM:FORM:SPELL	Whitespace, nominal form and spelling error	politika uudiseid -> poliitikauudised (Par -> Nom)
R:WS:NOM:FORM:CASE	Whitespace, nominal form and capitalization error	cv online -> CV-Online’i (Nom -> Gen)
R:NOM:FORM:SPELL	Nominal form and spelling error	ekskursioni ~ ekskursiooni -> ekskursioonile (Gen/Par -> All)
R:NOM:FORM:CASE	Nominal form and capitalization error	tartu -> Tartut (Nom -> Par)
R:NOM:FORM:SPELL:CASE	Nominal form, spelling and capitalization error	Sobrad ~ Sõbrad -> sõpradega (Nom -> Com)
R:VERB:FORM:SPELL	Verb form and spelling error	kaisin ~ käisin -> käin (Past -> Pres)
R:VERB:FORM:SPELL:CASE	Verb form, spelling and capitalization error	jstume ~ istume -> Istusime (Pres -> Past)
R:LEX:SPELL	Word choice and spelling error	laksin ~ läksin -> käisin
R:LEX:CASE	Word choice and capitalization error	võimalikult -> Võimalik (ADV -> ADJ)
R:LEX:NOM:FORM	Word choice and nominal form error	muusikaid -> muusikastiilid (Par -> Nom)
R:LEX:VERB:FORM	Word choice and verb form error	(mina) oli -> (mina) käisin (3rd person -> 1st person)
R:LEX:WO	Word choice error affects word order	läbi interneti -> interneti kaudu
R:LEX:WS	Word choice and whitespace error	oma teist -> teineteist
R:WO:NOM:FORM	Word order error affects the choice of nominal form	pealinn Islandil -> Islandi pealinn (Ade -> Gen)

References

The dataset has been used to evaluate the GEC toolkit developed in collaboration by the language technology groups of the University of Tartu and the Tallinn University. The L1 subset of the EstGEC corpus is being annotated at the University of Tartu.
The M2 Scorer adapted for EstGEC can be found here.
Conference presentations:
- 8th Estonian Digital Humanities Conference, October 5-7, 2022, Tallinn
- 20th Annual Conference of Applied Linguistics, April 27-28, 2023, Tallinn (in Estonian)

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
dev		dev
test		test
LICENSE		LICENSE
README.md		README.md
detokenizer.py		detokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Estonian L2 Grammatical Error Correction Corpus (EstGEC-L2)

The subset of Estonian Grammatical Error Correction Corpus (EstGEC) that contains L2 learner writings error-annotated in the M2 format.

References

About

Releases

Packages

Languages

License

tlu-dt-nlp/EstGEC-L2-Corpus

Folders and files

Latest commit

History

Repository files navigation

Estonian L2 Grammatical Error Correction Corpus (EstGEC-L2)

The subset of Estonian Grammatical Error Correction Corpus (EstGEC) that contains L2 learner writings error-annotated in the M2 format.

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages