Skip to content

Annotated Country-Level Dialectal Arabic Corpus: An Unsupervised Approach

Notifications You must be signed in to change notification settings

Maha-J-Althobaiti/Twt15DA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Twt15DA

Annotated Country-Level Dialectal Arabic Corpus: An Unsupervised Approach

The annotated dialectal Arabic corpus (Twt15DA) is collected from Twitter and consists of 311,785 tweets containing 3,858,459 words in total. The unsupervised approach to build the corpus uses an iterative procedure consisting of three main steps: automatic creation of dialectal word lists, selection of seed words, and collection of dialectal sentences. The Pointwise Mutual Information (PMI) association measure, along with the geographical frequency of word occurrence online were used to classify dialectal words. The poor performance of MSA POS tagger on dialectal Arabic contents was exploited in order to extract the dialectal words.

The Twt15DA corpus is available in a manner similar to the TREC Microblog Track (McCreadie et al., 2012), releasing only User ID and Tweet ID pairs along with annotations. The User ID and Tweet ID can be used to crawl Twitter.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).

https://creativecommons.org/licenses/by-nc-nd/4.0/

You are free to:
	Share — copy and redistribute the material in any medium or format 

Under the following terms:
	Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
	NonCommercial — You may not use the material for commercial purposes.
	NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material. 

Please cite our paper in any published work using this resource:

@article{althobaiti2021creation,
  title={Creation of annotated country-level dialectal Arabic resources: An unsupervised approach},
  author={Althobaiti, Maha J},
  journal={Natural Language Engineering},
  pages={1--42},
  year={2021},
  publisher={Cambridge University Press}
}

About

Annotated Country-Level Dialectal Arabic Corpus: An Unsupervised Approach

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages