Skip to content

Code base for collecting, munging, pre- and post-processing, modeling, and analyzing my dissertation data.

Notifications You must be signed in to change notification settings

joelchan/openideo-data-processing-pipeline

Repository files navigation

openideo-data-processing-pipeline

This folder holds all the code that I'm currently using to either pre-process or post-process data, and some code for "data collection" (e.g., downloading HTML files, extracting comments, genealogies, etc.). I'm still working out the details for the structure of the pipeline, but the code here handles:

  • extracting genealogies from pairwise citation data
  • extracting comments from HTML files
  • tokenization of text
  • feature selection and input file preparation for semantic models
  • searching the feature space (using gensim) for LSA and LDA
  • similarity queries for semantic models
  • some random R code for post-processing and exploring the data

More to come...

About

Code base for collecting, munging, pre- and post-processing, modeling, and analyzing my dissertation data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published