GitHub - DAINTINESS-Group/Pythia: A data set profiler

Pythia

Java library that produces an automated statistical profile of an input dataset.

A standard dataset is just a text file, with lines, where each line is a record, the fields of which are separated by a separator (eg. tabs, comma, pipe, etc). After registering a dataset and declaring the desired data analysis methods that should get executed, the system produces a 100% automatic statistical profile of the dataset and generates reports of the findings.

Setup

Intellij IDEA Installation Requirements

Install Intellij IDEA (Community edition is free)
Import the project as a Maven project, and it runs out of the box.

Eclipse Installation Requirements

Install Eclipse
Import the project as a Maven project.

Maven

The project uses a Maven wrapper so there is no need to install it to your system as long as you have the JAVA_HOME environmental variable pointing to your Java 8 installation folder.

🛠️ Build with Maven

Navigate to the root folder of the repo and run,

./mvnw clean install

and wait for the procedure to finish

After that, there should be a folder called target that includes two jar files:

Pythia-x.y.z-all-deps.jar and Pythia-x.y.z.jar

The difference is that the all deps jar file is an uber jar so you can import Pythia to a project and run it out of the box. (All dependecies are embedded into the all deps jar)

Otherwise you will need to provide the Pythia dependencies to your pom.xml file.

To run with the driver Main method, navigate to the root folder of the repo:

java -jar target/Pythia-x.y.z-all-deps.jar

🧪 Run tests

Navigate to the root folder of the repo and run,

./mvnw test

Code Formatter

This project complies with Google's Java coding style and is formatted using the official Google java formatter. You can follow the installation guide in the official GitHub repo to install it to your Editor.

Note: Consider installing it and run it so that the project follows a coding style

In case you want to format all java files from the command line, run in the root folder of the project:

java -jar google-java-format-x.y.z-all-deps.jar -i $(find . -type f -name "*.java")

Note: The formatter needs a Java 11 installation to run in the command line

Usage

Suppose we want to generate a statistical profile of the following file:

name	age	money
Michael	25	20
John	21	15
Andy	30	1000
Justin	65	10000

Below is a sample Main class that showcases API usage in simple steps for the above dataset:

public class Main {
	public static void main(String[] args) throws AnalysisException, IOException {
		
        // 1. Initialize a DatasetProfiler object (this is the main engine interface of Pythia).
        IDatasetProfiler datasetProfiler = new IDatasetProfilerFactory().createDatasetProfiler();

        // 2. Specify the schema, an alias and the path of the input dataset.
        StructType schema =
                new StructType(
                        new StructField[]{
                                new StructField("name", DataTypes.StringType, true, Metadata.empty()),
                                new StructField("age", DataTypes.IntegerType, true, Metadata.empty()),
                                new StructField("money", DataTypes.IntegerType, true, Metadata.empty()),
                        });
        String alias = "people";
        String path = String.format(
                "src%stest%sresources%sdatasets%speople.json",
                File.separator, File.separator, File.separator, File.separator);
        
        // 3. Register the input dataset specified in step 2 into Pythia.
        datasetProfiler.registerDataset(alias, path, schema);

        // 4. Specify labeling rules for a column and a name for the new labeled column.
        List<Rule> rules =
                new ArrayList<Rule>(
                        Arrays.asList(
                                new Rule("money", LabelingSystemConstants.LEQ, 20, "poor"),
                                new Rule("money", LabelingSystemConstants.LEQ, 1000, "mid"),
                                new Rule("money", LabelingSystemConstants.GT, 1000, "rich")));
        String newColumnName = "money_labeled";
        
        // 5. Create a RuleSet object and compute the new labeled column
        // (steps 4 & 5 can be repeated multiple times).
        RuleSet ruleSet = new RuleSet(newColumnName, rules);
        datasetProfiler.computeLabeledColumn(ruleSet);
        
        // 6. Specify the DominanceColumnSelectionMode and (optionally) a list of 
        // measurement & coordinate columns used in dominance pattern identification.
        DominanceColumnSelectionMode mode = DominanceColumnSelectionMode.USER_SPECIFIED_ONLY;
        String[] measurementColumns = new String[] { "money", "age" };
        String[] coordinateColumns =  new String[] { "name" };
        
        // 7. Declare the specified dominance parameters into Pythia
        // (steps 6 & 7 are optional, however, they are a prerequisite for highlight patterns identification).
    	datasetProfiler.declareDominanceParameters(mode, measurementColumns, coordinateColumns);

    	// 8. Specify the auxiliary data output directory and the desired parts of the analysis procedure 
    	// that should get executed for the computation of the dataset profile.
    	String auxiliaryDataOutputDirectory = "results";
    	boolean shouldRunDescriptiveStats = true;
    	boolean shouldRunHistograms = true;
    	boolean shouldRunAllPairsCorrelations = true;
    	boolean shouldRunDecisionTrees = true;
    	boolean shouldRunHighlightPatterns = true;
        
        // 9. Create a DatasetProfilerParameters object with the parameters specified in step 8
        // and compute the profile of the dataset (this will take a while for big datasets).
        DatasetProfilerParameters parameters =  new DatasetProfilerParameters(
        		auxiliaryDataOutputDirectory,
                shouldRunDescriptiveStats,
                shouldRunHistograms,
                shouldRunAllPairsCorrelations,
                shouldRunDecisionTrees,
                shouldRunHighlightPatterns);
        datasetProfiler.computeProfileOfDataset(parameters);

        // 10. (Optionally) specify an output directory path for the generated reports
        // (unspecified output directory path means that the reports will be generated under the 
        // auxiliary data output directory specified in step 8).
        String outputDirectoryPath = "";
        
        // 11. Generate a report in plain text and markdown format.
        datasetProfiler.generateReport(ReportGeneratorConstants.TXT_REPORT, outputDirectoryPath);
        datasetProfiler.generateReport(ReportGeneratorConstants.MD_REPORT, outputDirectoryPath);
    }
}

Contributors

Alexiou Alexandros
Charisis Alexandros (Youtube)
Christodoulos Antoniou
Dimos Gkitsakis
George Karathanos
Lampros Vlachopoulos
Panos Vassiliadis

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.mvn/wrapper		.mvn/wrapper
src		src
uml		uml
.gitignore		.gitignore
README.md		README.md
dependency-reduced-pom.xml		dependency-reduced-pom.xml
logo.png		logo.png
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
pythia-logo-old.png		pythia-logo-old.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pythia

Java library that produces an automated statistical profile of an input dataset.

Setup

Intellij IDEA Installation Requirements

Eclipse Installation Requirements

Maven

🛠️ Build with Maven

🧪 Run tests

Code Formatter

Usage

Contributors

About

Releases

Packages

Contributors 7

Languages

DAINTINESS-Group/Pythia

Folders and files

Latest commit

History

Repository files navigation

Pythia

Java library that produces an automated statistical profile of an input dataset.

Setup

Intellij IDEA Installation Requirements

Eclipse Installation Requirements

Maven

🛠️ Build with Maven

🧪 Run tests

Code Formatter

Usage

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages