Experiments with Convolutional Neural Networks for Multi-Label Authorship Attribution

Dainis Boumber ([email protected]), Yifan Zhang ([email protected]), Arjun Mukherjee ([email protected])

University of Houston

Publication pending review.

Description

We explore the use of Convolutional Neural Networks (CNNs) for multi-label Authorship Attribution (AA) problems and propose a CNN specifically designed for such tasks. By treating smaller documents as sentences and averaging the author probability distributions at sentence level for the longer documents, our design adapts to single-label datasets and various document sizes, retaining the capabilities of a traditional CNN. As a part of this work, we also create and make available to the public a multi-label Authorship Attribution dataset (MLPA-400) , consisting of 400 scientific publications by 20 authors from the field of Machine Learning. Experimental results demonstrate that our method outperforms several state-of-the-art models on the proposed task.

Multi-label CNN

Prerequisits: scikit-learn, tensoflow v1.0, python3

Run python aa.py for a sample set of experiments. Any additional data is to be placed under datahelpers/data (default, can be changed)

MLPA-400 dataset

Machine Learning Papers' Authorship (MPLA-400) dataset contains 20 publications by each of the top-20 authors in ML, for the total of 400.

The data is located in ./ml_dataset directory. You can also obtain it as a tarball from

https://drive.google.com/open?id=0B_LjdXSWGw1YR2dIek95bGFfZEE

Labels.csv contains the ground truths in the following format: ,<author_1>,...<author_20>\n is plain text and <author_n> is a digit 0 or 1 indicating whether this author is one of the co-authors. The first row is the header row.

See MLPA-400 for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
datahelpers		datahelpers
evaluators		evaluators
ml_dataset		ml_dataset
networks		networks
results		results
trainer		trainer
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aa.py		aa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experiments with Convolutional Neural Networks for Multi-Label Authorship Attribution

Description

Multi-label CNN

MLPA-400 dataset

About

Releases

Packages

Languages

License

dainis-boumber/AA_CNN

Folders and files

Latest commit

History

Repository files navigation

Experiments with Convolutional Neural Networks for Multi-Label Authorship Attribution

Description

Multi-label CNN

MLPA-400 dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages