Skip to content

Arabic part of speech tagging using arabic PUD dataset using bidirectioanl LSTM for sequential labeling classification

Notifications You must be signed in to change notification settings

shaimaaK/arabic-sequence-classification-POS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arabic Sequence Labeling: Part Of Speech Tagging NLP task

This project is implemented on arabic part of speech tagging as part of the "Natural Language Processing" course of my master's degree. the project uses the arabic PUD dataset from universal dependencies and implements

  1. Deep learning model (BiLSTM) for sequential labeling classification
  2. Pre-deep learning model (KNN) for multi-class classification

Table of contents

Arabic PUD Dataset

During preprocessing steps the following processes are applied :

  1. Remove tanween and tashkeel
  2. Remove sentences that contains non-arabic words (i.e. english characters)
Such that the distribution of tags within the dataset is visualized as barchart where the majority of words (5553 word) in the dataset is associated with noun tag while the least common tag with the dataset is X. Each of the tags symbolizes part of the speech, refer to the image below for description of each tag.

Arabic Word Embedding

Word embedding provides a dense representation of words and their relative meanings.
The word embedding technique used in this project is N-Gram Word2Vec -SkipGram model from aravec project trained on twitter data with vector size 300.

Structure BiLSTM sequential labeling classification model

Results

The dataset is split to 70% for training and 30% for testing

BiLSTM sequential labeling classification model

KNN multi-class classification model

image

Requirements

Preprocessing and visualization

  • conllu
  • matplotlib.pyplot
  • pandas
  • re
  • seaborn
  • numpy
  • tensorflow (Tokenizer,pad_sequences)
  • sklearn (preprocessing.LabelEncoder,model_selection.train_test_split)
Word Embedding
  • gensim
Classification model
  • tensorflow
  • keras.models.sequential
  • keras.layers (Dense,Embedding,Bidirectional,LSTM,TimeDistributed,InputLayer)
  • sklearn.neighbors.KNeighborsClassifier
Model Evaluation
  • sklearn.metrics

References and Resources

  • reading and parsing dataset: link
  • Processing input data:link
  • Aravec for word embedding model :link
  • Keras Embedding layer : link1,link2,link3