Skip to content

Persian Named Entity Classification from Arman Rayan Sharif corpus with deep learning

License

Notifications You must be signed in to change notification settings

AminTaheri23/Deep-Persian-NER

Repository files navigation

Deep Persian NER

Persian Named Entity Classification from Arman Rayan Sharif corpus with deep learning

About

In this project, I built and trained a model to recognize Named Entities from a sentence. This model should give a tag to each word from a sentence. A classical application for Natural Language Processing. The main file is here

for replicating results you will need to download PersianNER dataset from this section (link below in ArmanPersoNERCorpus and you need to change path in ipynb file) and you also need a fastext pretrained model. My fastest model is here. you can add it to your drive and use the correct path to address it. (it is recommended to use this ipynb file in Google Colab)

Named Entity Recognition (NER)

Named Entity Recognition is a process where an algorithm takes a string of text (sentence or paragraph) as input and identifies relevant nouns (people, places, organizations, and...) that are mentioned in that string. Here is an example:

John  went to New   York  to interview with Microsoft 
B-PER O    O  B-LOC I-LOC O  O         O    B-ORG

ArmanPersoNERCorpus

https://github.com/HaniehP/PersianNER

This dataset includes 250,015 tokens and 7,682 Persian sentences in total. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated by a new line. The NER tags are in IOB format.

The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (eg. named entity recognition)

An example with IOB format:

John B-PER
lives O
in O
New B-LOC
York I-LOC
. O
 
This O
is O
another O
sentence

Dataset Structure

In ArmanPersoNERCorpus, NEs are categorized into six classes:

  1. person B-pers,I-pers
  2. organization B-org,I-org(such as banks, ministries, embassies, teams, nationalities, networks and publishers)
  3. location B-loc,I-loc(such as cities, villages, rivers, seas, gulfs, deserts and mountains)
  4. facility B-fac,I-fac(such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas)
  5. product B-pro, I-pro (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions)
  6. event B-event,I-event(such as wars, earthquakes, national holidays, festivals and conferences)
  7. other O are the remaining tokens

Misc.

This was the final project for FanAsa Academy's DeepNLP course that held in Summer of 1398(2019).

Instructors: Reza Vasefi - Fatemeh Mashhadi

About

Persian Named Entity Classification from Arman Rayan Sharif corpus with deep learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published