ANETAC: Arabic Named Entity Transliteration and Classification Dataset

Description

ANETAC is an English-Arabic named entity transliteration and classification dataset (https://arxiv.org/abs/1907.03110) built from freely available parallel translation corpora. The dataset contains 79,924 English-Arabic named entities along with their respective classes that can be either a Person, a Location, or an Organization.

An example of the instances present in the dataset are provided in the below Table:

Usage

We note that first results using this EN-AR transliteration data (the one in EN-AR Translit folder) has been already published in the work of Hadj Ameur et al. "Arabic Machine Transliteration using an Attention-based Encoder-decoder Model".

Citations

If you want to use the ANETAC dataset please cite the following arXiv paper:

@article{ameur2019anetac,
  title={ANETAC: Arabic Named Entity Transliteration and Classification Dataset},
  author={Ameur, Mohamed Seghir Hadj and Meziane, Farid and Guessoum, Ahmed},
  journal={arXiv preprint arXiv:1907.03110},
  year={2019}
}

Baseline Results

The baseline results that have been obtained when using ANETAC are reported in the following publication (you are welcomed to compare your own results to our baseline transliteration models):

@article{HADJAMEUR2017287,
title = "Arabic Machine Transliteration using an Attention-based Encoder-decoder Model",
journal = "Procedia Computer Science",
volume = "117",
pages = "287 - 297",
year = "2017",
note = "Arabic Computational Linguistics",
issn = "1877-0509",
doi = "https://doi.org/10.1016/j.procs.2017.10.120",
url = "https://www.sciencedirect.com/science/article/pii/S1877050917321774",
author = "Mohamed Seghir Hadj Ameur and Farid Meziane and Ahmed Guessoum",
keywords = "Natural Language Processing, Arabic Language, Arabic Transliteration, Deep Learning, Sequence-to-sequence Models, Encoder-decoder Architecture, Recurrent Neural Networks",
abstract = "Transliteration is the process of converting words from a given source language alphabet to a target language alphabet, in a way that best preserves the phonetic and orthographic aspects of the transliterated words. Even though an important effort has been made towards improving this process for many languages such as English, French and Chinese, little research work has been accomplished with regard to the Arabic language. In this work, an attention-based encoder-decoder system is proposed for the task of Machine Transliteration between the Arabic and English languages. Our experiments proved the efficiency of our proposal approach in comparison to some previous research developed in this area."
}

Contacts:

For all questions please contact [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
EN-AR NE		EN-AR NE
EN-AR Translit		EN-AR Translit
README.md		README.md
image.png		image.png
stats.PNG		stats.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ANETAC: Arabic Named Entity Transliteration and Classification Dataset

Description

CONTENTS

Usage

Citations

Baseline Results

Contacts:

About

Releases

Packages

MohamedHadjAmeur/ANETAC

Folders and files

Latest commit

History

Repository files navigation

ANETAC: Arabic Named Entity Transliteration and Classification Dataset

Description

CONTENTS

Usage

Citations

Baseline Results

Contacts:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages