Skip to content

Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for benchmarking Welsh part-of-speech taggers

License

Notifications You must be signed in to change notification settings

techiaith/corpws-meincnodi-rhannau-ymadrodd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Corpws Meincnodi Tagwyr Rhan Ymadrodd

English text below

Mae hwn yn gorpws o bron i 25,000 mil o eiriau o destun ar ffurf 1,500 o frawddegau a godwyd o amryw o ffynonhellau gwahanol gyda'r bwriad o greu prawf da a chytbwys o allu unrhyw dagiwr rhan ymadrodd Cymraeg.

Cynnwys

Fe gynlluniwyd y Corpws Meincnodi i gynnwys cynrychioliad eang o fathau gwahanol o Gymraeg cyfoes mewn orgraff fodern. Cynhwysir ynddo amrywiad o destunau er mwyn gwobrwyo'r gallu i gyffredinoli i ffurfiau ac orgraff lai safonol, ond Cymraeg fel y caiff ei gynhyrchu ar ffurf testun heddiw oedd yr hyn a ganolbwyntiwyd arno yn bennaf wrth lunio'r corpws. O ran cynnwys y brawddegau, ymdrechwyd i sicrhau bod ynddynt amrywiaeth o ran cywair, arddull, tafodiaith a phwnc, ac fe gyfeiriwyd at y fframweithiau a ddefnyddwyd gan CEG a CorCenCC wrth wneud hynny. Yn ogystal â sicrhau amrywiaeth o ran mathau'r testunau ffynhonnell, ceisiwyd hefyd sicrhau bod amrywiaeth o ran amser a pherson o fewn cystrawennau'r brawddegau.

Tagio a Gwerthuso

Ein bwriad dros y misoedd nesaf, ar gais Llywodraeth Cymru, yw tagio'r brawddegau hyn mewn modd a fydd yn caniatáu gwerthuso a chymharu gwahanol dagwyr rhan ymadrodd Cymraeg. Y nifer o eiriau o gorpws y mae'n bosib eu tagio yn gywir yn ystod y cyfnod nesaf fydd y gwir derfyn ar faint y Corpws Meincnodi llawn. Gan fod y gwahanol dagwyr yn defnyddio gwahanol setiau o dagiau rhan ymadrodd, byddwn yn datblygu set gyffredinol o dagiau cyfryngol i hwyluso'r gymhariaeth honno, yn ogystal â fframwaith i alluogi'r cymharu a'r gwerthuso.

Trwydded

Yn wahanol i'n data hyfforddi, a drwyddedir o dan drwydded CC0, trwyddedir y corpws hwn o dan drwydded fwy caethiwus, sef CC-BY-SA. Y rheswm am hynny yw bod defnyddio CC-BY-SA yn ein galluogi i godi enghreifftiau o ffynhonellau pwysig megis Wicipedia, deunyddiau gan Goleg Cymraeg Cenedlaethol Cymru a chorpora CorCenCC a Chorpws Siarad. Gellir dosbarthu'r corpws meincnodi hwn yn rhydd cyhyd â bod y gofynion o ran cydnabyddiaeth a 'rhannu cyffelyb' (hynny yw, 'sharealike') y drwydded CC-BY-SA yn cael eu parchu.

Cydnabod ein gwaith

Os defnyddiwch chi'r adnodd hwn, gofynwn yn garedig i chi gydnabod a chyfeirio at ein gwaith. Mae cydnabyddiaeth o'r fath yn gymorth i ni sicrhau cyllid yn y dyfodol i greu rhagor o adnoddau defnyddiol i'w rhannu.

Cydnabyddiaeth

Defnyddwyd testunau o'r adnoddau canlynol yn y Corpws Meincnodi:

Corpws Siarad 2014, Corpws:Siarad, Deuchar, M., Davies, P. & Donnelly, K., Cyrchwyd ar 03/12/2020 < http:https://bangortalk.org.uk/speakers.php?c=siarad>

Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. [On-line]

Gwales.com 2020, Llyfrau, gwales.com, Cyrchwyd ar 03/12/2020 http:https://www.gwales.com/books/?tsid=15

Hwb Cymru 2020, Dysgu ac addysgu i Gymru, Hwb Cymru, Cyrchwyd ar 03/12/2020 https://hwb.gov.wales/

James, E. W. 2018, Williams, William (Pantycelyn), James, E. W, Cyrchwyd ar 03/12/2020 http:https://orca.cf.ac.uk/128971/1/Williams%2C%20William%20%28Pantycelyn%29.pdf

Knight D, Morris S, Fitzpatrick T, et al. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh (Version 1.0.0). Cardiff University. ht tp:https://doi.org/10.17035/d.2020.0119878310

Meddwl.org 2020, hafan, meddwl.org, Cyrchwyd ar 03/12/2020 < https://meddwl.org/> Porth Coleg Cymraeg Cenedlaethol 2020, Hafan, Coleg Cymraeg Cenedlaethol, Cyrchwyd ar 03/12/2020 < https://wici.porth.ac.uk/index.php/Hafan>

Raspberrypi.org 2020, Hello , Raspberrypi.org, Cyrchwyd ar 03/12/2020 <https://projects.raspberrypi.org/

Wici Pobol y Cwm 2020, Home, Wici Pobol y Cwm, Cyrchwyd ar 03/12/2020 https://pobol-y-cwm.fandom.com/cy/wiki/Main_Page

Wiki Y Cyfryngau Cymraeg 2020, Home, Wiki Y Cyfryngau Cymraeg Cyrchwyd ar 03/12/2020 https://y-cyfryngau-cymraeg.fandom.com/cy/wiki/Main_Page

Wicipedia 2020, Croeso i Wicipedia, Sefydliad Wikimedia, Cyrchwyd ar 03/12/2020 https://cy.wikipedia.org/wiki/Hafan

Ymddiriolaeth Adeiladu Cymru 2019, Rhestr gyfeirio cynllunio digwyddiadau, Ymddiriolaeth Adeiladu Cymru, Cyrchwyd ar 03/12/2020 http:https://www.yac.cymru/uploads/resources/2019-03-13-24-2-bct-event-planning-checklist-c.pdf


Welsh Part-of-Speech Tagger Benchmarking Corpus

Benchmark Corpus

This is a corpus of approximately 25,000 words, in the form of 1,500 sentences drawn from a variety of different sources with a view to creating a good, balanced test of the ability of Welsh Part Of Speech (POS) taggers to tag Welsh language text correctly.

Contents

The Benchmark Corpus is designed to include a broad representation of different types of contemporary Welsh in modern orthography. To reward the ability to generalize to a less standard forms and orthographical conventions, the corpus contains a range of different texts. However, the main focus during corpus construction was Welsh language text as it is produced today.

Efforts were made to ensure sentences included in the corpus were varied in terms of register, style, dialect and subject matter, and reference was made to the frameworks used by CEG and CorCenCC in doing so. As well as ensuring variety in the types of source texts, we also sought to ensure that there was a variety in respect of tense and person within sentence structures.

Tagging and Evaluation

Our intention over the coming months, at the request of the Welsh Government, is to tag these sentences in a way that will allow the evaluation and comparison of different Welsh POS taggers. The true limit on the size of the full Benchmark Corpus will be the number of words from the current corpus that can be correctly tagged during this time. As the different POS taggers use different tagsets, we will develop a general, intermediate tagset to facilitate that comparison, as well as a framework to enable comparison and evaluation.

Licence

Unlike our training data, which is licensed under the CC0 license, this corpus is licensed under the more restrictive CC-BY-SA license. This is because using CC-BY-SA allows us to collect examples from important sources such as Wikipedia, materials from Coleg Cymraeg Cenedlaethol Cymru and from the Siarad and CorCenCC corpora. This benchmark corpus can be freely distributed as long as the attribution and sharealike requirements of the CC-BY-SA license are respected.

Acknowledging our work

If you use this resource, we kindly ask you to acknowledge and reference our work. Doing so helps us secure future funding to create more useful resources to share.

Acknowledgements

Texts from the following sources were used in this resource:

Corpws Siarad 2014, Corpws:Siarad, Deuchar, M., Davies, P. & Donnelly, K., Cyrchwyd ar 03/12/2020 < http:https://bangortalk.org.uk/speakers.php?c=siarad>

Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. [On-line]

Gwales.com 2020, Llyfrau, gwales.com, Cyrchwyd ar 03/12/2020 http:https://www.gwales.com/books/?tsid=15

Hwb Cymru 2020, Dysgu ac addysgu i Gymru, Hwb Cymru, Cyrchwyd ar 03/12/2020 https://hwb.gov.wales/

James, E. W. 2018, Williams, William (Pantycelyn), James, E. W, Cyrchwyd ar 03/12/2020 http:https://orca.cf.ac.uk/128971/1/Williams%2C%20William%20%28Pantycelyn%29.pdf

Knight D, Morris S, Fitzpatrick T, et al. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh (Version 1.0.0). Cardiff University. ht tp:https://doi.org/10.17035/d.2020.0119878310

Meddwl.org 2020, hafan, meddwl.org, Cyrchwyd ar 03/12/2020 < https://meddwl.org/> Porth Coleg Cymraeg Cenedlaethol 2020, Hafan, Coleg Cymraeg Cenedlaethol, Cyrchwyd ar 03/12/2020 < https://wici.porth.ac.uk/index.php/Hafan>

Raspberrypi.org 2020, Hello , Raspberrypi.org, Cyrchwyd ar 03/12/2020 <https://projects.raspberrypi.org/

Wici Pobol y Cwm 2020, Home, Wici Pobol y Cwm, Cyrchwyd ar 03/12/2020 https://pobol-y-cwm.fandom.com/cy/wiki/Main_Page

Wiki Y Cyfryngau Cymraeg 2020, Home, Wiki Y Cyfryngau Cymraeg Cyrchwyd ar 03/12/2020 https://y-cyfryngau-cymraeg.fandom.com/cy/wiki/Main_Page

Wicipedia 2020, Croeso i Wicipedia, Sefydliad Wikimedia, Cyrchwyd ar 03/12/2020 https://cy.wikipedia.org/wiki/Hafan

Ymddiriolaeth Adeiladu Cymru 2019, Rhestr gyfeirio cynllunio digwyddiadau, Ymddiriolaeth Adeiladu Cymru, Cyrchwyd ar 03/12/2020 http:https://www.yac.cymru/uploads/resources/2019-03-13-24-2-bct-event-planning-checklist-c.pdf

About

Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for benchmarking Welsh part-of-speech taggers

Topics

Resources

License

Stars

Watchers

Forks

Packages