Skip to content

(alsii) Automated Language detection in Social Interactions on the Internet

Notifications You must be signed in to change notification settings

hi-im-buggy/alsii

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language and Society Project

Group:

  • Pratyaksh Gautam (2020114002)
  • Nukit Tailor (2020114012)

The original code is under the directory code_release/

Data

The Facebook, Twitter and Whatsapp data was all downloaded from: https://amitavadas.com/Code-Mixing.html

Resources

  1. The English word list "resources/EN.words.txt" was downloaded from: https://wordlist.aspell.net/
  2. The Hindi transliteration word list "resources/HI.trans.fire2013.txt" was downloaded from: https://web.archive.org/web/20160312153954/https://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/
  3. The Hindi word list was compiled by Gupta et al. (2012): https://www.lrec-conf.org/proceedings/lrec2012/pdf/365_Paper.pdf

Running the code

The main annotation script is "process.py". It should be run as follows: python3 process.py <src_file> [-top_n int] -out <out_file> Where <src_file> is the input text file in CoNLL-format (1 token per line), and <out_file> is the name of the output file that will be generated. The -top_n flag controls how much of the manually created word list will be used to classify tokens. By default, it uses the whole word list.

Now , to check the scores , run the following command: python3 scorer.py -hyp <out_file> -ref <ref_file> [-v] Where <out_file> is the output file generated by process.py, and <ref_file> is the reference file. The -v flag is optional and will print the scores.

Results

With our modifications to the source, we were able to achieve the following improved F-scores as compared to the original code:

--------
WHATSAPP                          
--------
        en      hi      univ
en      294     420     30        
hi      32      1988    37  
univ    37      131     249         

        Old-scores                              New-scores

CLASS   P       R       F1              CLASS   P       R       F1
en      39.516  80.992  53.117          en      39.783  80.992  53.358
hi      96.646  78.299  86.51           hi      96.65   78.417  86.584
univ    59.712  78.797  67.94           univ    59.427  78.797  67.75                       

--------
FACEBOOK
--------
        en      hi      univ
en      12997   397     530
hi      127     2446    173
univ    90      14      3841

        Old-scores                              New-scores

CLASS   P       R       F1              CLASS   P       R       F1
en      93.335  98.35   95.777          en      93.342  98.358  95.785
hi      89.043  85.614  87.295          hi      89.075  85.614  87.31
univ    97.363  84.507  90.481          univ    97.364  84.529  90.494   

--------
TWITTER
-------
        en      hi      univ
en      3038    1047    227
hi      575     8034    243
univ    119     698     3330

        Old-scores                              New-scores

CLASS   P       R       F1              CLASS   P       R       F1
en      70.255  81.324  75.385          en      70.455  81.404  75.535
hi      90.721  82.084  86.187          hi      90.759  82.156  86.243
univ    80.352  87.605  83.822          univ    80.299  87.632  83.805 

About

(alsii) Automated Language detection in Social Interactions on the Internet

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages