Skip to content

Estonian Grammatical Error Correction (GEC) test and development corpus that contains L2 learner texts error-annotated in the M2 format.

License

Notifications You must be signed in to change notification settings

tlu-dt-nlp/EstGEC-L2-Corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Estonian L2 Grammatical Error Correction Corpus (EstGEC-L2)

The subset of Estonian Grammatical Error Correction Corpus (EstGEC) that contains L2 learner writings error-annotated in the M2 format.

This subcorpus currently consists of 263 texts and 3,790 sentences retrieved from the Estonian Interlanguage Corpus compiled at the Tallinn University. The texts include narrative/descriptive and argumentative writings as well as informal and formal letters representing various proficiency levels. EstGEC-L2 material has been divided into a test and development set that can be used for evaluating and improving Estonian automated correction tools. The test set comprises 2,029 and the dev set 1,761 sentences, distributed between the proficiency levels as follows:

  • A2 – 937 (495 in test set);
  • B1 – 963 (504 in test set);
  • B2 – 1,091 (534 in test set);
  • C1 – 796 (495 in test set).

Previously, the texts had been manually error-tagged in the CoNLL-U format, indicating the error type, scope, and correction in the field for miscellaneous token attributes. The annotation has been converted to the M2 format (the conversion script can be found here) using an adapted version of the ERRANT tagset. Whereas the previous format was limited to one error annotation per sentence, up to two new annotation versions have been added. Considering the two-phase annotation, each text has been reviewed by at least three annotators.

There are 12 main and 18 combined error types in the error classification (see tables 1 and 2). The prefix indicates whether a word, phrase or punctuation mark should be replaced ('R:'), is missing ('M:') or unnecessary ('U:'). In our tagset, we do not distinguish the part-of-speech (POS) of the replaced, added or deleted word. For example, all word choice errors are indicated by the tag 'R:LEX'. This has helped to reduce the complexity of the error categorization, while allowing us to classify all errors and avoid the 'OTHER' tag. There are numerous edit and POS combinations, since the edit types often overlap (e.g., spelling errors co-occur with inflection and word choice errors) and the POS of the original word and its replacement can differ.

Another important difference to the English M2 annotation is that we allow overlapping error scope if a token-level error occurs within a word order error, e.g., one of the words contains a spelling error. Therefore, it is possible to detect token-level corrections even if word order has not been edited.

Furthermore, orthography errors have been divided into capitalization and whitespace errors. Inflection errors are marked as nominal (noun, adjective, pronoun and numeral) or verb form errors without a further distinction, i.e., these include case, number, agreement, tense, mood and other errors in the choice of inflected form.

Table 1. Main error types

Error tag Meaning Example
R:SPELL Spelling error soobib -> sobib
R:CASE Capitalization error Juuli -> juuli
R:WS Whitespace error igalpool -> igal pool
R:NOM:FORM Nominal form error kallis -> kallid (Sing -> Plur)
R:VERB:FORM Verb form error tegeleb -> tegeles (Pres -> Past)
R:LEX Word choice error ilusasti -> ilus (ADV -> ADJ)
R:PUNCT Punctuation choice error Kohtumiseni. -> Kohtumiseni!
R:WO Word order error üldse polnud -> polnud üldse
M:LEX Missing word(s) See väga ilus linn -> See on väga ilus linn
U:LEX Unnecessary word(s) auto välimus on punane -> auto on punane
U:PUNCT Unnecessary punctuation laupäeval, kell 10 -> laupäeval kell 10

Table 2. Combined error types

Error tag Meaning Example
R:SPELL:CASE Spelling and capitalization error Vannalinnas -> vanalinnas
R:WS:SPELL Whitespace and spelling error liimik koht -> lemmikkoht
R:WS:CASE Whitespace and capitalization
error
Kontserdi majas -> kontserdimajas
R:WS:NOM:FORM Whitespace and nominal form
error
kogupäev -> kogu päeva (Nom -> Gen)
R:WS:NOM:FORM:SPELL Whitespace, nominal form and
spelling error
politika uudiseid -> poliitikauudised
(Par -> Nom)
R:WS:NOM:FORM:CASE Whitespace, nominal form and
capitalization error
cv online -> CV-Online’i (Nom -> Gen)
R:NOM:FORM:SPELL Nominal form and spelling error ekskursioni ~ ekskursiooni ->
ekskursioonile (Gen/Par -> All)
R:NOM:FORM:CASE Nominal form and capitalization
error
tartu -> Tartut (Nom -> Par)
R:NOM:FORM:SPELL:CASE Nominal form, spelling and
capitalization error
Sobrad ~ Sõbrad -> sõpradega
(Nom -> Com)
R:VERB:FORM:SPELL Verb form and spelling error kaisin ~ käisin -> käin (Past -> Pres)
R:VERB:FORM:SPELL:CASE Verb form, spelling and
capitalization error
jstume ~ istume -> Istusime
(Pres -> Past)
R:LEX:SPELL Word choice and spelling error laksin ~ läksin -> käisin
R:LEX:CASE Word choice and capitalization
error
võimalikult -> Võimalik (ADV -> ADJ)
R:LEX:NOM:FORM Word choice and nominal form
error
muusikaid -> muusikastiilid
(Par -> Nom)
R:LEX:VERB:FORM Word choice and verb form
error
(mina) oli -> (mina) käisin
(3rd person -> 1st person)
R:LEX:WO Word choice error affects
word order
läbi interneti -> interneti kaudu
R:LEX:WS Word choice and whitespace
error
oma teist -> teineteist
R:WO:NOM:FORM Word order error affects the
choice of nominal form
pealinn Islandil -> Islandi pealinn
(Ade -> Gen)

References

About

Estonian Grammatical Error Correction (GEC) test and development corpus that contains L2 learner texts error-annotated in the M2 format.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages