Skip to content

spitis/blogs_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modified Blog Authorship Dataset

This data is sourced from the "Blog Authorship Corpus", available here. The original dataset was tokenized and split into sentences using spacy. Sentences with less than 5 tokens and sentences with more than 30 tokens were discarded. Number-like tokens were replaced by "<#>". Tokens other than the 9999 most common tokens were replaced by "", for a vocabulary of 10000 words. Sentences were tagged with the gender (0 for male, 1 for female) and age bracket (0 for teens, 1 for 20s, 2 for 30s) and placed into a pandas dataframe.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages