Data Quality Assessment

Project Description

Open data often has quality issues. These can negatively impact the results of analyses or data-driven models based on the data.

In this project, our goal is to identify quality issues in datasets from NYC Open Data. In particular, we develop methods to automatically identify suspicious/anomalous values in tables. These values may be explicit null values (e.g., N/A), disguised null values (e.g., 999-999-9999 in a phone number column), syntactic outliers (e.g., alpha-numeric strings in a column consisting of alphabetic last names), semantic outliers (e.g., a state name in a column of city names), misspellings (e.g., brokklyn vs brooklyn), etc. One important question to explore is: can we define quality measures for datasets to help users make more informed decisions when they are selecting datasets?

Purpose

develop (semi-)automated techniques that classify column values into one of the following four categories:

Valid value
Misspelling/Abbreviation of a valid value
Invalid value
NULL value For values in class 2 list the valid value. For invalid values, flag them as semantic outliers (e.g., values that are valid in a different column/domain) if appropriate.

NULL values can be represented in many different ways (e.g., NULL, n/a, 999, 999-999-9999) and there are different types of outliers, therefore, manual inspection will be required to obtain this information. You can use existing profiling tools to help you better understand the data, e.g., the Datamart profiler (https://docs.auctus.vida-nyu.org/python/datamart-profiler.html) and data-profile-viewer, both available as Python packages: https://pypi.org/project/datamart-profiler https://github.com/soniacq/DataProfileVis

You should also try to answer the following questions:

Are there patterns for how NULL values/outliers are represented?
What is the precision and recall of the techniques you designed/implemented?
When and why does your approach fail?

Architecture

Method

NULL Value
1. Explicit Null
  - Pandas isNull()
  - Regular expression matching
2. Disguised MAR/MCAR(Missing At Random/Missing Completely At Random) values
  - FAHES(Qahtan et al., 2018)
  - MCAR: The missingness of data is entirely unrelated to any data.
  - MAR: The missingness of data is related to some of the observed data but not to the missing data itself.
3. Repetitive pattern
  - Modified source code of FAHES(Qahtan et al., 2018)
  - Standard deviation of character difference
Misspell & Abbreviation
- SpellChecker in Python Library
- GloVe(Global Vectors for Word Representation): a method based on semantic similarity.
Invalid
1. Type Invalid
  - Numeric data: check character and range
  - Date format: regular expression
2. Semantic Invalid
  - Invalid country: pycountry (a python library)
3. Outliers
  
  Algorithms:
  
  KNN(using largest, mean, median distance respectively), Isolation Forest, ABOD, HBOS, ECOD, Feature Bagging.

Team

Result Output

output

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Invalid		Invalid
Misspelling_Abbreviation		Misspelling_Abbreviation
NULL		NULL
dataset		dataset
img		img
output		output
.gitignore		.gitignore
Final_report_Data_Quality_Assessment.pdf		Final_report_Data_Quality_Assessment.pdf
README.md		README.md
top_module.ipynb		top_module.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Quality Assessment

Project Description

Purpose

Architecture

Method

Team

Result Output

References & Resource

About

Releases

Packages

Contributors 3

Languages

Iris-Song/PerHapS

Folders and files

Latest commit

History

Repository files navigation

Data Quality Assessment

Project Description

Purpose

Architecture

Method

Team

Result Output

References & Resource

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages