Backported docs

salesforce · Aug 22, 2018 · afad44c · afad44c
1 parent 93b7f0f
commit afad44c
Show file tree

Hide file tree

Showing 50 changed files with 2,241 additions and 1,151 deletions.
diff --git a/.gitignore b/.gitignore
@@ -27,3 +27,5 @@ gradlew.bat
 derby.log
 metastore_db/
 *.bak
+
+docs/_build
diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS =
+SPHINXBUILD = sphinx-build
+SPHINXPROJ = TransmogrifAI
+SOURCEDIR = .
+BUILDDIR = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+ @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/README.md b/docs/README.md
@@ -1,10 +1,19 @@
-# TransmogrifAI Docs
+# Docs
 
-- `git clone https://github.com/salesforce/TransmogrifAI.git` - clone TransmogrifAI repo
-- `cd ./TransmogrifAI` - go to cloned directory
-- `./gradlew docs:buildDocs` - build documentation files
-- `./gradlew docs:serve` - run a web server to serve the docs (Ctrl-C to stop).
-- `open http:https://localhost:3000` or visit http:https://localhost:3000 in your browser
+[Sphinx](http:https://www.sphinx-doc.org) based docs site hosted on [ReadTheDocs](https://readthedocs.org/projects/transmogrifai).
 
-You can also run `./gradlew docs:buildDocs --continuous` in one terminal to automatically rebuild the docs when
-something changes, then run `./gradlew docs:serve` in another terminal to run the web server.
+## Running locally
+
+If you wish to run the docs locally install the following dependencies:
+```bash
+pip install sphinx sphinx-autobuild recommonmark sphinx_rtd_theme
+```
+
+Then simply run:
+```bash
+cd docs
+make html
+sphinx-autobuild . _build/html
+```
+
+Browse to - http:https://localhost:8000
diff --git a/docs/_templates/breadcrumbs.html b/docs/_templates/breadcrumbs.html
diff --git a/docs/abstractions/index.md b/docs/abstractions/index.md
@@ -0,0 +1,76 @@
+# Abstractions
+
+TransmogrifAI is designed to simplify the creation of machine learning workflows. To this end we have created an abstraction for creating and running machine learning workflows. The abstraction is made up of Features, Stages, Workflows and Readers which interact as shown in the diagram below.
+
+![TransmogrifAI Abstractions](https://github.com/salesforce/TransmogrifAI/raw/master/resources/AbstractionDiagram-cropped.png)
+
+## Features
+
+The primary abstraction introduced in TransmogrifAI is that of a Feature. A Feature is essentially a type-safe pointer to a column in a DataFrame and contains all information about that column -- it's name, the type of data to be found in it, as well as lineage information about how it was derived. Features are defined using FeatureBuilders:
+
+```scala
+val name: Feature[Text] = FeatureBuilder.Text[Passenger].extract(_.name.toText).asPredictor
+val age: Feature[RealNN] = FeatureBuilder.RealNN[Passenger].extract(_.age.toRealNN).asPredictor
+```
+
+The above lines of code define two ```Features``` of type ```Text``` and ```RealNN``` called ```name``` and ```age``` that are extracted from data of type ```Passenger``` by applying the stated extract methods.
+
+One can also define Features that are the result of complex time-series aggregates. Take a look at this [example](../examples/Time-Series-Aggregates-and-Joins.html) and this [page](../developer-guide#aggregate-data-readers) for more advanced reading on FeatureBuilders.
+
+Features can then be manipulated using Stages to produce new Features. In TransmogrifAI, as in SparkML, there are two types of Stages -- Transformers and Estimators.
+
+## Stages
+
+### Transformers
+
+Transformers specify functions for transforming one or more Features to one or more *new* Features. Here is an example of applying a tokenizing Transformer to the ```name``` Feature defined above:
+
+```scala
+val nameTokens = new TextTokenizer[Text]().setAutoDetectLanguage(true).setInput(name).getOutput()
+```
+
+The output ```nameTokens``` is a new Feature of type ```TextList```. Because Features are strongly typed, it is also possible to create shortcuts for these Transformers and create a Feature operation syntax. The above line could alternatively have been written as:
+
+```scala
+val nameTokens = name.tokenize()
+```
+TransmogrifAI provides an easy way for wrapping all Spark Transformers, and additionally provides many Transformers of its own. For more reading about creating new Transformers and shortcuts, follow the links [here](../developer-guide#transformers) and [here](../developer-guide#creating-shortcuts-for-transformers-and-estimators).
+
+### Estimators
+
+Estimators specify algorithms that can be applied to one or more Features to produce Transformers that in turn produce new Features. Think of Estimators as learning algorithms, that need to be fit to the data, in order to then be able to transform it. Users of TransmogrifAI do not need to worry about the fitting of algorithms, this happens automatically behind the scenes when a TransmogrifAI workflow is trained. Below we see an example of a use of a bucketizing estimator that determines the buckets that maximize information gain when fit to the data, and then transforms the Feature ```age``` to a new bucketized Feature of type ```OPVector```:
+
+```scala
+val bucketizedAge = new DecisionTreeNumericBucketizer[Double, Real]().setInput(label, age).getOutput()
+```
+
+Similar to Transformers above, one can easily create shortcuts for Estimators, and so the line of code above could have been alternatively written as:
+
+```scala
+val bucketizedAge = age.autoBucketize(label = label)
+```
+TransmogrifAI provides an easy way for wrapping all Spark Estimators, and additionally provides many Estimators of its own. For more reading about creating new Estimators follow the link [here](../developer-guide#estimators).
+
+## Workflows and Readers
+
+Once all the Features and Feature transformations have been defined, actual data can be materialized by adding the desired Features to a TransmogrifAI Workflow and feeding it a DataReader. When the Workflow is trained, it infers the entire DAG of Features, Transformers, and Estimators that are needed to materialize the result Features. It then prepares this DAG by passing the data specified by the DataReader through the DAG and fitting all the intermediate Estimators in the DAG to Transformers.
+
+In the example below, we would like to materialize ```bucketizedAge``` and ```nameTokens```. So we set these two Features as the result Features for a new Workflow:
+
+```scala
+val workflow = new OPWorkflow().setResultFeatures(bucketizedAge, nameTokens).setReader(PassengerReader)
+```
+
+The PassengerReader is a DataReader that essentially specifies a ```read``` method that can be used for loading the Passenger data. When we train this workflow, it reads the Passenger data and fits the bucketization estimator by determining the optimal buckets for ```age```:
+
+```scala
+val workflowModel = workflow.train()
+```
+
+The workflowModel now has a prepped DAG of Transformers. By calling the ```score``` method on the workflowModel, we can transform any data of type Passenger to a DataFrame with two columns for ```bucketizedAge``` and ```nameTokens```
+
+```scala
+val dataFrame = workflowModel.setReader(OtherPassengerReader).score()
+```
+
+WorkflowModels can be saved and loaded. For more advanced reading on topics like stacking workflows, aggregate DataReaders for time-series data, or joins for DataReaders, follow our links to [Workflows](../developer-guide#workflows) and [Readers](../developer-guide#datareaders).
diff --git a/docs/automl-capabilities/index.md b/docs/automl-capabilities/index.md
@@ -0,0 +1,69 @@
+# AutoML Capabilities
+
+## Vectorizers and Transmogrification
+
+This is the Stage that automates the feature engineering step in the machine learning pipeline.
+
+The TransmogrifAI [transmogrifier](https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/stages/impl/feature/Transmogrifier.scala) (shortcut ```.transmogrify()```) takes in a sequence of features, automatically applies default transformations to them based on feature types (e.g. imputation, null value tracking, one hot encoding, tokenization, split Emails and pivot out the top K domains) and combines them into a single vector.
+
+```scala
+val features = Seq(email, phone, age, subject, zipcode).transmogrify()
+```
+
+
+If you want to do the feature engineering at a single feature level, you can do so in combination with automatic type specific transformations. Each feature type has an associated ```.vectorize(....)``` method that will transform the feature into a feature vector given some input parameters. Each ```.vectorize(....)``` method behaves differently according to the type of feature being transformed.
+
+```scala
+val emailFeature = email.vectorize()
+val features = Seq(emailFeature, phone, age, subject, zipcode).transmogrify()
+```
+
+For advanced users, you can also completely [customize automatic feature engineering](../developer-guide#transmogrification).
+
+## Feature Validation
+
+#### SanityChecker
+
+This is the Stage that automates the feature selection step in the machine learning pipeline.
+
+The SanityChecker is an Estimator that can analyze a particular dataset for obvious issues prior to fitting a model on it. It applies a variety of statistical tests to the data based on Feature types and discards predictors that are indicative of [label leakage](http:https://machinelearningmastery.com/data-leakage-machine-learning/) or that show little to no predictive power. In addition to flagging and fixing data issues, the SanityChecker also outputs statistics about the data for diagnostics and insight generation further down the ML pipeline.
+
+The SanityChecker can be instantiated as follows:
+
+```scala
+// Add sanity checker estimator
+val checkedFeatures = new SanityChecker().setRemoveBadFeatures(true).setInput(label, features).getOutput()
+```
+For advanced users, check out how to [customize default parameters](../developer-guide#sanitychecker) and peek into the SanityChecker metadata using model insights.
+
+#### RawFeatureFilter
+
+One of the fundamental assumptions of machine learning is that the data you are using to train your model reflects the data that you wish to score. In the real world, this assumption is often not true. TransmogrifAI has an optional stage after data reading that allows you to check that your features do not violate this assumption and remove any features that do. This stage is called the [RawFeatureFilter](https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/filters/RawFeatureFilter.scala), and to use it you call the method `withRawFeatureFilter(Option(trainReader), Option(scoreReader),...)` on your [Workflows](../developer-guide#workflows). This method takes the training and scoring data readers as inputs.
+
+```scala
+// Add raw feature filter estimator
+val workflow =
+ new OpWorkflow()
+ .setResultFeatures(survived, rawPrediction, prob, prediction)
+ .withRawFeatureFilter(Option(trainReader), Option(scoreReader), None)
+```
+
+It will load the training and scoring data and exclude individual features based on fill rate, relative fill rates between training and scoring, or differences in the distribution of data between training and scoring. This stage can eliminate many issues, such as leakage of information that is only filled out after the label and changes in data collection practices, before they affect your model.
+
+For advanced users, check out how to set [optional parameters](../developer-guide#rawfeaturefilter) for when to exclude features.
+
+
+## ModelSelectors
+
+This is the Stage that automates the model selection step in the machine learning pipeline.
+
+TransmogrifAI will select the best model and hyper-parameters for you based on the class of modeling you are doing (eg. Classification, Regression etc.).
+Smart model selection and comparison gives the next layer of improvements over traditional ML workflows.
+
+```scala
+val (pred, raw, prob) = BinaryClassificationModelSelector().setInput(label, features).getOutput()
+```
+
+The ModelSelector is an Estimator that uses data to find the best model. BinaryClassificationModelSelector is for binary classification tasks, multi classification tasks can be done using MultiClassificationModelSelector. Best Regression model are done through RegressionModelSelector. Currently the possible classification models that can be applied in the selector are` LogisticRegression`, `DecisionTrees`, `RandomForest` and `NaiveBayes`. The possible regression models are ` LinearRegression`, `DecisionTrees`, `RandomForest` and `GBTrees`. The best model is selected via a CrossValidation or TrainingSplit, by picking the best SparkML model and wraping it. It is also possible to perform hyperparameter tuning for each model through a grid search.
+
+For advanced users, check out how to set CrossValidation parameters, balance datasets and customize hyperparameter-tuning [here](../developer-guide#modelselector).
diff --git a/docs/build.gradle b/docs/build.gradle