Skip to content

This repository is dedicated to development of code-mixed language resources.

Notifications You must be signed in to change notification settings

l3cube-pune/code-mixed-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

L3Cube-HingCorpus

L3Cube-HingCorpus is the first large-scale real Hindi-English code mixed data in a Roman. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We also present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus. The evaluation details are mentioned in our paper link .

The full HingCorpus(roman) is shared here .

Hing BERT models and Hing Fast Text model

Model Description Link
HingBERT Base-BERT roman
HingRoBERTa RoBERTa roman , roman + devanagari
HingMBERT mBERT roman , roman + devanagari
HingGPT GPT2 roman devanagari
HingFT Fast Text link

L3Cube-HingLID

The L3Cube-HingLID is the Hindi-English code-mixed language identification dataset. It consists of 31756, 6420, and 6279 train, test, and validation samples respectively. The dataset is shared in the folder L3Cube-HingLID/. The HingBERT-LID model is shared here .

L3Cube-MeCorpus

L3Cube-MeCorpus is a first-of-its-kind large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences released in paper . The dataset details and code-mixed MeBERT models are shared in the MarathiNLP repo.

L3Cube-MeSent, MeHate, and MeLID

MeSent, MeHate, and MeLID are the first code-mixed Marathi-English Sentiment Analysis, Hate Speech Identification, and Language Identification datasets respectively released in paper . The datasets are shared here .

License

L3Cube-HingCorpus, and L3Cube-HingLID is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citing

@article{nayak2022l3cube,
  title={L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models},
  author={Nayak, Ravindra and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2204.08398},
  year={2022}
}

This project is co-ordinated and mentored by Raviraj Joshi under L3Cube Pune. For any queries contact [email protected] .

About

This repository is dedicated to development of code-mixed language resources.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published