Skip to content

The problem of Author identification is about the identification of author of a tested document from a group of potential authors

Notifications You must be signed in to change notification settings

EN555/Author-Identification

Repository files navigation

Author Identification

This project was in collaboration with Nir Son, Eviatar Nachshoni, Yosef Danan, under supervision of Prof. Lee-Ad Gottlieb

Table of content

Installation

https://minikube.sigs.k8s.io/docs/start/

pip install -r requiremnets.txt
pip install -r requiremnets-dev.txt
make fmt
make lint

Project Goal

The problem of Author identification is about the identification of author of a tested document from a group of potential authors. Our research focuses on try distinguish between different authors when they are about similar topics.

Data

we use the C50 data set1 which compose of 2,500 texts by 50 different authors (50 for each) for train, and the same for test. The texts are not particularly long - the average length is around 500 words. image

pre-commit install

Task Challenge

We tried to approach the problem as a classic classification problem - that it, try and predict the most likely author of an anonymous text (only) out of the 50 authors in the original set. We tried to solve this problem using both stylistic features and content features, and with a variety of machine learning models.

Baseline Models

We first distinct betweent two fundemental feartures:

  1. Style: We used the baseline from Stanford with three style features (average sentence length, average word length, and hapax disLegemena (lexicographic diversity)) and Xgboost model for training. This yielded an accuracy of 0.12
  2. Content: we stated with a relatively simple try - bag of words representation with Naive Bayes. This model yielded an accuracy of 0.58

Our Approach

We first encoding each sentence using Glove50, then we use average pooling over all the sentence,

Then we got vector with all the probability for each of the authors.
We check the max probability in compare to threshold, if the result lower than the threshold we extract the 10 authors with the high probability.

Then we check the correct author using pure style model, we extract complex and simple features for each of the document. The accuracy that we got 83.2%.

About

The problem of Author identification is about the identification of author of a tested document from a group of potential authors

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •