Skip to content
Eric Kimbrel edited this page Mar 19, 2014 · 5 revisions

See Why Correlation Approximation for background information.

Prerequisites

This project requires the following

Process

The correlation approximation system runs a two step process.

  1. Training Phase - loading your data.

    Large set of time series data (or any numeric vectors) is read in to the system and reduced to several smaller projections of the data. K-means centroids are found for each projection. The projects, reduced vectors, and centroids are cached for use in the next phase. For a complete description of the algorithms see The Google White Paper The number of projections as well as the dimensions of each projection and number of centroids to calculate is easily configurable.

  2. Test Phase - testing a new vector against the cached data

    In this phase the system loads the reduced vectors, projection data, and centroids from the training phase and uses them to quickly find the top N (default to 100) most highly correlated vectors from your data set.

Training Phase Input

We currently take a text file (local or hdfs) for input. The text must be two tab seberated columns where the first column is a string Key, and the second columns is a vector representing your time series (as a comma sperated list of Doubles) For an example see test_data.tsv. All vectors must be of equal length.

Training Phase Output

Output data from the training phase is written as object files (not human readable) to local files or to hdfs.

Test Phase Input and Output

Bulk Mode

Bulk mode is a method to test the system performance and accuracy by correlating all the vectors in the system against each other. No additional input is required, the system uses the original data from the training phase. Output is written to a local or hdfs file.

Interactive Mode

Interactive Mode is a simple command line program. You'll specify some configuration information on the command line and you'll then be able to enter time series data as a comma separated list of doubles. For each time series you enter you'll be returned the most highly correlated vectors from the training set.

Batch Mode (coming soon)

A command line tool for correlating all vectors in a given input file (local or hdfs) and supplying the results to an output file (local or hdfs)

Whats next?

Try it out with a simple example