Skip to content

dportabella/cleaning_data_tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cleaning data

Simple examples about cleaning text data

See source code, output and comments on the scala files:

  • Word freq
  • Special chars, identify and clean
  • Word stemmer
  • Natural Language Processing

## notebook/regexs.ipynb, regexs.pdf

  • Regexs
  • Stop words
  • Find patterns in tokens
  • querying Patstat outside the SQL relational model

Next time

  • comparing text, text distance, alignment, disambiguation, google refine…
  • regex vs CFG, web scraping, table stats, validation, data curation workflow

Requirements

How to run a scala example

$ sbt "runMain application.TextCleanExample"
$ sbt "runMain application.StanfordNLPExample"
$ export dbUrl="jdbc:mysql:https://example.com/patstat_2015a?user=__USER__&password=__PASSWORD__&useSSL=false"
$ sbt "runMain application.RemoveStopWordsExample $dbUrl"
$ sbt "runMain application.PatentNumbersPatterns $dbUrl"
$ sbt "runMain application.EPFLPatentsProject $dbUrl"

How to run jupyther with the regexs example

docker run -it --rm -p 8888:8888 -v $PWD/notebook:/home/jovyan/work jupyter/all-spark-notebook start-notebook.sh

Do you have other use cases or questions?

Contact me at [email protected]

About

Simple examples about cleaning text data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published