Author Identification

This project was in collaboration with Nir Son, Eviatar Nachshoni, Yosef Danan, under supervision of Prof. Lee-Ad Gottlieb

Table of content

Installation
Project Goal
Data
Task Challenge
Baseline Model
Our Approach

Installation

https://minikube.sigs.k8s.io/docs/start/

pip install -r requiremnets.txt

pip install -r requiremnets-dev.txt
make fmt
make lint

Project Goal

The problem of Author identification is about the identification of author of a tested document from a group of potential authors. Our research focuses on try distinguish between different authors when they are about similar topics.

Data

we use the C50 data set1 which compose of 2,500 texts by 50 different authors (50 for each) for train, and the same for test. The texts are not particularly long - the average length is around 500 words.

pre-commit install

Task Challenge

We tried to approach the problem as a classic classification problem - that it, try and predict the most likely author of an anonymous text (only) out of the 50 authors in the original set. We tried to solve this problem using both stylistic features and content features, and with a variety of machine learning models.

Baseline Models

We first distinct betweent two fundemental feartures:

Style: We used the baseline from Stanford with three style features (average sentence length, average word length, and hapax disLegemena (lexicographic diversity)) and Xgboost model for training. This yielded an accuracy of 0.12
Content: we stated with a relatively simple try - bag of words representation with Naive Bayes. This model yielded an accuracy of 0.58

Our Approach

We first encoding each sentence using Glove50, then we use average pooling over all the sentence,

Then we got vector with all the probability for each of the authors.
We check the max probability in compare to threshold, if the result lower than the threshold we extract the 10 authors with the high probability.

Then we check the correct author using pure style model, we extract complex and simple features for each of the document. The accuracy that we got 83.2%.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.dvc		.dvc
articles		articles
helm-charts/service		helm-charts/service
notebooks		notebooks
product		product
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
Readme.md		Readme.md
dvc.yaml		dvc.yaml
flow.drawio		flow.drawio
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Author Identification

Table of content

Installation

Project Goal

Data

Task Challenge

Baseline Models

Our Approach

About

Releases

Packages

Contributors 4

Languages

EN555/Author-Identification

Folders and files

Latest commit

History

Repository files navigation

Author Identification

Table of content

Installation

Project Goal

Data

Task Challenge

Baseline Models

Our Approach

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages