Modified Blog Authorship Dataset

This data is sourced from the "Blog Authorship Corpus", available here. The original dataset was tokenized and split into sentences using spacy. Sentences with less than 5 tokens and sentences with more than 30 tokens were discarded. Number-like tokens were replaced by "<#>". Tokens other than the 9999 most common tokens were replaced by "", for a vocabulary of 10000 words. Sentences were tagged with the gender (0 for male, 1 for female) and age bracket (0 for teens, 1 for 20s, 2 for 30s) and placed into a pandas dataframe.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
blogs.h5_0		blogs.h5_0
blogs.h5_1		blogs.h5_1
blogs.h5_2		blogs.h5_2
blogs_vocab.pickle		blogs_vocab.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modified Blog Authorship Dataset

About

Releases

Packages

Languages

spitis/blogs_data

Folders and files

Latest commit

History

Repository files navigation

Modified Blog Authorship Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages