indic language identification

Language identification tools are poor at supporting Indian languages, and worse at supporting Romanized variants.

This library addresses this particular use-case. The model is trained using weak labels obtained from an unsupervised method [Palakodety, KhudaBukhsh & Carbonell, 2020].

The model is a fasttext model.

Install

First install fasttext: pip install fasttext.
Download the model: link

Predict

You can now use the model like you would any fasttext model:

/path/to/fasttext predict fasttext_indic_merged_supervised_model_final.ftz -

(base) ➜  indic_language_identification git:(master) ✗ ~/fastText/fasttext predict /path/to/fasttext_indic_merged_supervised_model_final.ftz -
yeh kaun si bhasha hai
__label__hi_romanized

Cite

If you use this work, do cite us:

@inproceedings{palakodety-khudabukhsh-2020-annotation,
    title = "Annotation Efficient Language Identification from Weak Labels",
    author = "Palakodety, Shriphani  and
      KhudaBukhsh, Ashiqur",
    booktitle = "Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.wnut-1.24",
    pages = "181--192",
}

(c) Onai 2020

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
clean_text.py		clean_text.py
indic_test_set.txt		indic_test_set.txt
quantize.py		quantize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

indic language identification

Install

Predict

Cite

About

Releases

Packages

Languages

onai/indic-language-identification

Folders and files

Latest commit

History

Repository files navigation

indic language identification

Install

Predict

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages