Skip to content

The dataset was first scraped from an Arabic book library website, then cleansed of non-Arabic words, numerals, and tags before going through multiple preprocessing steps. Second, after applying the TF-IDF vectorizer, we used the Random Forest Classifier (RFC) algorithm to be trained on the data. Finally, two Question-Answer datasets were create…

Notifications You must be signed in to change notification settings

aehabV/Arabic-Book-Recommendation-and-Question-Answering-System

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arabic NLP

In this project we used data scrapped from https://www.arab-books.com/

First, We scrapped books data from the website like Book Name, Author, Category, Pages’ number, Dar El-Nashr, Book Size, and Description.

Second, We cleaned the data from English words, numbers, and tags.

Third, We used an autocorrect Library "Hunspell-ar" to correct mismatcched words.

Fourth, we stemmed and lemmatized the data using Farassa, We tokenized, removed the stop words, lemmatized and stemmed our data.

Note: We have tried different lemmatizers and stemmers (e.g. Qalsadi, Snowball stemmer, ISRI stemmer and Madamira) but we found that Farassa has outperformed all of them on Arabic words.

Fifth, we train the model using RandomForestClassifier (RFC) on the features(description column) and labels(category column) after applying TF-IDF vectorizer then evaluated and tested the model on 30% of the dataset using new unseen data and achieved an accuracy of 73%.

Sixth, we created two Question-Answer Datasets, then we matched the input query with the questions in the two Datasets and retrieved the most relevant answer.

Seventh, we weighted all the words of All Descriptions of the books, we weighted the important keywords by calculating Tf-Idf for each one. Note: We removed all stop words from the Descriptions.

About

The dataset was first scraped from an Arabic book library website, then cleansed of non-Arabic words, numerals, and tags before going through multiple preprocessing steps. Second, after applying the TF-IDF vectorizer, we used the Random Forest Classifier (RFC) algorithm to be trained on the data. Finally, two Question-Answer datasets were create…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%