Data Validation

The purpose of this program is to provide a general framework for testing csv files against a configuration stored in a json file.

The program can be run with python3 main.py

Configuration

file_schema.json contains a list of files to be tested.

file_schema.json is in the following format:

contains "data_dir" key, value pair for the directory where the files are
contains "report_dir" key, value pair for the directory where the reports should be written out to
contains "files" key, value pair for the list of files to be checked
- file contains the filename of the file to be tested
- file contains columns with the list of columns to be checked
  - each column contans a the column name
  - each column contans a regex to be tested against
  - each column array can contain a "req" arguement to tell that that column is required
    - this factors into the report
    - a row either has all required fields matching the regex, or it does not, the number of rows with all required arguements passing the regex is reported in the reports

Design considerations

It should aim to be extensible for other types of files, and other file schema
It should aim to be modular in design
It should aim to keep functions small, and only doing one thing
File schema config should be easy to understand and manipulate

I started out thinking I should iterate through the rows checking against regex, keeping a record of the total amount of empty rows, and empty columns, which would provide a way of checking rows which would be easy to verify

I should generate a report including the number of empty columns, empty rows, and bad values.

I should also consider keeping a record, or making a way to create a list of all the bad rows and bad values.

I ended up deciding to stick with the regex approach, because it allows for a great amount of flexibility in configuring the project, without having to change the project code. I decided to go with a json configuration file, because it's a standard format and easy to deal with, and also easy to modify in a way that allows for program extensibility.

Room for improvement

this project could use unit testing
this project could have implimentation added for dealing with different kinds of files outside of .csv files
type hints could be added
check_csv_file could be split up into two-three smaller functions, increasing readability and modularity
there could be a function to automatically generate a default configuration file, or something from an existing file
the numbers at the end of the column arrays in the config could be removed, since I don't use them anywhere
if performance is a concern, it could be parallelized
this project could be changed to be an actual python module
could allow for command line arguement to specify the location of the configuration file to be used
there could be a percentage threshold field added for each file for the required fields to determine if the file is okay or not

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
reports		reports
.gitignore		.gitignore
README.md		README.md
file_schema.json		file_schema.json
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Validation

Configuration

file_schema.json is in the following format:

Design considerations

Room for improvement

About

Releases

Packages

Languages

Duderichy/Basic-Data-Validator

Folders and files

Latest commit

History

Repository files navigation

Data Validation

Configuration

file_schema.json is in the following format:

Design considerations

Room for improvement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages