An open source project from Data to AI Lab at MIT.
Data Lineage Tracing Library
- License: MIT
- Development Status: Pre-Alpha
- Homepage: https://github.com/data-dev/DataTracer
DataTracer is a Python library for solving Data Lineage problems using statistical methods, machine learning techniques, and hand-crafted heuristics.
Currently the Data Tracer library implements discovery of the following properties:
- Primary Key: Identify which column is the primary key in each table.
- Foreign Key: Find which relationships exist between the tables.
- Column Mapping: Given a field in a table, deduce which other fields, from the same table or other tables, are more related or contributed the most in generating the given field.
The DataTracer library also incorporates a REST API that enables interaction with the DataTracer Solvers via HTTP communication. You can check it here
DataTracer has been developed and tested on Python 3.5 and 3.6, 3.7
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where DataTracer is run.
The easiest and recommended way to install DataTracer is using pip:
pip install datatracer
This will pull and install the latest stable release from PyPi.
If you want to install from source or contribute to the project please read the Contributing Guide.
The DataTracer library is prepared to work using datasets, which are a collection of tables
loaded as pandas.DataFrames
and a MetaData JSON which provides information about the
dataset structure.
You can find more information about the MetaData format in the MetaData repository.
The DataTracer also includes a few demo datasets which you can easily
download to your computer using the datatracer.get_demo_data
function:
from datatracer import get_demo_data
get_demo_data()
This will create a folder called datatracer_demo
in your working directory with a few
datasets ready to use inside it.
In this short tutorial we will guide you through a series of steps that will help you getting started with Data Tracer.
The first step will be to load the data in the format expected by DataTracer.
For this, we can use the datatracer.load_dataset
function passing the path to
the dataset folder.
For example, if we want to use the classicmodels
dataset included in the demo folder
that we just created we can load it using:
from datatracer import load_dataset
metadata, tables = load_dataset('datatracer_demo/classicmodels')
This will return a tuple which contains:
- A
MetaData
instance with details about the dataset. - A
dict
with all the tables of the dataset loaded as apandas.DataFrame
.
In the DataTracer project, the different Data Lineage problems are solved using what we call solvers.
We can see the list of available solvers using the get_solvers
function:
from datatracer import get_solvers
get_solvers()
which will return a list with their names:
['datatracer.column_map',
'datatracer.foreign_key.basic',
'datatracer.foreign_key.standard',
'datatracer.primary_key.basic']
In order to use the selected solver you will need to load it using the DataTracer
class.
In this example, we will try to figure out the relationships between the tables in our dataset
by using the solver datatracer.foreign_key.standard
.
from datatracer import DataTracer
# Load the Solver
solver = DataTracer.load('datatracer.foreign_key.standard')
# Solve the Data Lineage problem
foreign_keys = solver.solve(tables)
The result will be a dictionary containing the foreign key candidates:
[{'table': 'products',
'field': 'productLine',
'ref_table': 'productlines',
'ref_field': 'productLine'},
{'table': 'payments',
'field': 'customerNumber',
'ref_table': 'customers',
'ref_field': 'customerNumber'},
{'table': 'orders',
'field': 'customerNumber',
'ref_table': 'customers',
'ref_field': 'customerNumber'},
{'table': 'orderdetails',
'field': 'productCode',
'ref_table': 'products',
'ref_field': 'productCode'},
{'table': 'orderdetails',
'field': 'orderNumber',
'ref_table': 'orders',
'ref_field': 'orderNumber'},
{'table': 'employees',
'field': 'officeCode',
'ref_table': 'offices',
'ref_field': 'officeCode'}]
You can learn more about the DataTracer features in the notebook tutorials.
Also don't forget to have a look at the DataTracer REST API.