This project was in collaboration with Nir Son, Eviatar Nachshoni, Yosef Danan, under supervision of Prof. Lee-Ad Gottlieb
https://minikube.sigs.k8s.io/docs/start/
pip install -r requiremnets.txt
pip install -r requiremnets-dev.txt
make fmt
make lint
The problem of Author identification is about the identification of author of a tested document from a group of potential authors. Our research focuses on try distinguish between different authors when they are about similar topics.
we use the C50 data set1 which compose of 2,500 texts by 50 different authors (50 for each) for train, and the same for test. The texts are not particularly long - the average length is around 500 words.
- https://archive.ics.uci.edu/ml/datasets/Reuter_50_50#
- https://drive.google.com/file/d/1UnTLPc0pnxDZUso-ruCu_egOnHHkJ0sh/view?usp=sharing
- https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip and put
Data\C50
pre-commit install
We tried to approach the problem as a classic classification problem - that it, try and predict the most likely author of an anonymous text (only) out of the 50 authors in the original set. We tried to solve this problem using both stylistic features and content features, and with a variety of machine learning models.
We first distinct betweent two fundemental feartures:
- Style: We used the baseline from Stanford with three style features (average sentence length, average word length, and hapax disLegemena (lexicographic diversity)) and Xgboost model for training. This yielded an accuracy of 0.12
- Content: we stated with a relatively simple try - bag of words representation with Naive Bayes. This model yielded an accuracy of 0.58
We first encoding each sentence using Glove50, then we use average pooling over all the sentence,
Then we got vector with all the probability for each of the authors.
We check the max probability in compare to threshold, if the result lower than the threshold we extract the 10 authors with the high probability.
Then we check the correct author using pure style model, we extract complex and simple features for each of the document. The accuracy that we got 83.2%.