luigi_gdb_pipeline_demo

An example to illustrate using Luigi to manage a data science workflow in Greenplum Database For more info see: http:https://engineering.pivotal.io/post/luigi-data-science/

Use Case Description

The use case for this demo is a simple unsupervised approach for identifying anomalous or highly variant user network behavior based on activity patterns. The main idea is to create 24 matrices, one for each hour of the day. The rows of the index are indexed by network users and the columns by days of the week. Row (i,j) for the matrix corresponding to hour k counts the number of events from user i in hour k of day j. The reason for creating a matrix for each hour is to accounts for differing volumes of traffic throughout the day. We then run principal component analysis (PCA) on each of these matrices and examine the large entries of the principal component directions which will correspond to ‘erratic’ users who drive most of the variance in the data. In later iterations we can extend the pipeline to use PCA for dimension reduction, clustering and other more sophisticated methods.

We break up our workflow into 5 types of tasks Luigi tasks as follows:

Initializing all the necessary schema and user defined functions in our pipeline
Creating a table counting the number of flows for each user during each hour of the data set
For each of the 24 hours we use the output of step 2.) to create a table that is a sparse matrix representation of the hourly flow count matrix.
Run the MADlib PCA function to compute the principal direction for each of the 24 flow count matrices. The MADlib PCA algorithm is computed in parallel across GPDB nodes.
For each of 24 principal directions, Identify users with large entries and write a list of these users with outlier score (magnitude of the entry) to a file to investigate.

Steps to running:

Install Pivotal Greenplum Sandbox Virtual Machine. Available for free download on the [Pivotal Network] (https://network.pivotal.io/products/pivotal-gpdb#/releases/567/file_groups/337). Once we start the Sandbox with VirtualBox or VMware Fusion, it is a two step process to start the GPDB:
Login as gpadmin with the password provided at startup.
Type ./start_all.sh to start the GPDB instance.
Create an ssh tunnel to make the GPDB instance available to a psycopg2 connection on localhost at port 9912

$ssh gpadmin@<VM IP address> -L 9912:127.0.0.1:5432

Set environment variables for scripts to point to GPDB

export GPDB_DATABASE=gpadmin
- export GPDB_USER=gpadmin
- export GPDB_PASSWORD=<vm_password>
- export GPDB_HOST=localhost
- export GPDB_PORT=9912

Run write_sample_data_to_db.py To populate database with fake data.
Run the Luigi Pipeline. From inside pca_pipeline/
$luigid —background
$python -m luigi —module pipeline PipelineTask

We can view the status of our pipeline and the pipeline dependency graph by opening a browser to http:https://localhost:8082 After running the pipeline, by default the output files will be available in pca_pipeline/target.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
generate_sample_data		generate_sample_data
pca_pipeline		pca_pipeline
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

luigi_gdb_pipeline_demo

Use Case Description

Steps to running:

About

Releases

Packages

Languages

ericwayman/luigi_gdb_pipeline_demo

Folders and files

Latest commit

History

Repository files navigation

luigi_gdb_pipeline_demo

Use Case Description

Steps to running:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages