Skip to content

Commit

Permalink
Backported docs
Browse files Browse the repository at this point in the history
  • Loading branch information
tovbinm committed Aug 22, 2018
1 parent 93b7f0f commit afad44c
Show file tree
Hide file tree
Showing 50 changed files with 2,241 additions and 1,151 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,5 @@ gradlew.bat
derby.log
metastore_db/
*.bak

docs/_build
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXPROJ = TransmogrifAI
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
25 changes: 17 additions & 8 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,19 @@
# TransmogrifAI Docs
# Docs

- `git clone https://github.com/salesforce/TransmogrifAI.git` - clone TransmogrifAI repo
- `cd ./TransmogrifAI` - go to cloned directory
- `./gradlew docs:buildDocs` - build documentation files
- `./gradlew docs:serve` - run a web server to serve the docs (Ctrl-C to stop).
- `open http:https://localhost:3000` or visit http:https://localhost:3000 in your browser
[Sphinx](http:https://www.sphinx-doc.org) based docs site hosted on [ReadTheDocs](https://readthedocs.org/projects/transmogrifai).

You can also run `./gradlew docs:buildDocs --continuous` in one terminal to automatically rebuild the docs when
something changes, then run `./gradlew docs:serve` in another terminal to run the web server.
## Running locally

If you wish to run the docs locally install the following dependencies:
```bash
pip install sphinx sphinx-autobuild recommonmark sphinx_rtd_theme
```

Then simply run:
```bash
cd docs
make html
sphinx-autobuild . _build/html
```

Browse to - http:https://localhost:8000
Empty file.
76 changes: 76 additions & 0 deletions docs/abstractions/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Abstractions

TransmogrifAI is designed to simplify the creation of machine learning workflows. To this end we have created an abstraction for creating and running machine learning workflows. The abstraction is made up of Features, Stages, Workflows and Readers which interact as shown in the diagram below.

![TransmogrifAI Abstractions](https://github.com/salesforce/TransmogrifAI/raw/master/resources/AbstractionDiagram-cropped.png)

## Features

The primary abstraction introduced in TransmogrifAI is that of a Feature. A Feature is essentially a type-safe pointer to a column in a DataFrame and contains all information about that column -- it's name, the type of data to be found in it, as well as lineage information about how it was derived. Features are defined using FeatureBuilders:

```scala
val name: Feature[Text] = FeatureBuilder.Text[Passenger].extract(_.name.toText).asPredictor
val age: Feature[RealNN] = FeatureBuilder.RealNN[Passenger].extract(_.age.toRealNN).asPredictor
```

The above lines of code define two ```Features``` of type ```Text``` and ```RealNN``` called ```name``` and ```age``` that are extracted from data of type ```Passenger``` by applying the stated extract methods.

One can also define Features that are the result of complex time-series aggregates. Take a look at this [example](../examples/Time-Series-Aggregates-and-Joins.html) and this [page](../developer-guide#aggregate-data-readers) for more advanced reading on FeatureBuilders.

Features can then be manipulated using Stages to produce new Features. In TransmogrifAI, as in SparkML, there are two types of Stages -- Transformers and Estimators.

## Stages

### Transformers

Transformers specify functions for transforming one or more Features to one or more *new* Features. Here is an example of applying a tokenizing Transformer to the ```name``` Feature defined above:

```scala
val nameTokens = new TextTokenizer[Text]().setAutoDetectLanguage(true).setInput(name).getOutput()
```

The output ```nameTokens``` is a new Feature of type ```TextList```. Because Features are strongly typed, it is also possible to create shortcuts for these Transformers and create a Feature operation syntax. The above line could alternatively have been written as:

```scala
val nameTokens = name.tokenize()
```
TransmogrifAI provides an easy way for wrapping all Spark Transformers, and additionally provides many Transformers of its own. For more reading about creating new Transformers and shortcuts, follow the links [here](../developer-guide#transformers) and [here](../developer-guide#creating-shortcuts-for-transformers-and-estimators).

### Estimators

Estimators specify algorithms that can be applied to one or more Features to produce Transformers that in turn produce new Features. Think of Estimators as learning algorithms, that need to be fit to the data, in order to then be able to transform it. Users of TransmogrifAI do not need to worry about the fitting of algorithms, this happens automatically behind the scenes when a TransmogrifAI workflow is trained. Below we see an example of a use of a bucketizing estimator that determines the buckets that maximize information gain when fit to the data, and then transforms the Feature ```age``` to a new bucketized Feature of type ```OPVector```:

```scala
val bucketizedAge = new DecisionTreeNumericBucketizer[Double, Real]().setInput(label, age).getOutput()
```

Similar to Transformers above, one can easily create shortcuts for Estimators, and so the line of code above could have been alternatively written as:

```scala
val bucketizedAge = age.autoBucketize(label = label)
```
TransmogrifAI provides an easy way for wrapping all Spark Estimators, and additionally provides many Estimators of its own. For more reading about creating new Estimators follow the link [here](../developer-guide#estimators).

## Workflows and Readers

Once all the Features and Feature transformations have been defined, actual data can be materialized by adding the desired Features to a TransmogrifAI Workflow and feeding it a DataReader. When the Workflow is trained, it infers the entire DAG of Features, Transformers, and Estimators that are needed to materialize the result Features. It then prepares this DAG by passing the data specified by the DataReader through the DAG and fitting all the intermediate Estimators in the DAG to Transformers.

In the example below, we would like to materialize ```bucketizedAge``` and ```nameTokens```. So we set these two Features as the result Features for a new Workflow:

```scala
val workflow = new OPWorkflow().setResultFeatures(bucketizedAge, nameTokens).setReader(PassengerReader)
```

The PassengerReader is a DataReader that essentially specifies a ```read``` method that can be used for loading the Passenger data. When we train this workflow, it reads the Passenger data and fits the bucketization estimator by determining the optimal buckets for ```age```:

```scala
val workflowModel = workflow.train()
```

The workflowModel now has a prepped DAG of Transformers. By calling the ```score``` method on the workflowModel, we can transform any data of type Passenger to a DataFrame with two columns for ```bucketizedAge``` and ```nameTokens```

```scala
val dataFrame = workflowModel.setReader(OtherPassengerReader).score()
```

WorkflowModels can be saved and loaded. For more advanced reading on topics like stacking workflows, aggregate DataReaders for time-series data, or joins for DataReaders, follow our links to [Workflows](../developer-guide#workflows) and [Readers](../developer-guide#datareaders).
69 changes: 69 additions & 0 deletions docs/automl-capabilities/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# AutoML Capabilities

## Vectorizers and Transmogrification

This is the Stage that automates the feature engineering step in the machine learning pipeline.

The TransmogrifAI [transmogrifier](https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/stages/impl/feature/Transmogrifier.scala) (shortcut ```.transmogrify()```) takes in a sequence of features, automatically applies default transformations to them based on feature types (e.g. imputation, null value tracking, one hot encoding, tokenization, split Emails and pivot out the top K domains) and combines them into a single vector.

```scala
val features = Seq(email, phone, age, subject, zipcode).transmogrify()
```


If you want to do the feature engineering at a single feature level, you can do so in combination with automatic type specific transformations. Each feature type has an associated ```.vectorize(....)``` method that will transform the feature into a feature vector given some input parameters. Each ```.vectorize(....)``` method behaves differently according to the type of feature being transformed.

```scala
val emailFeature = email.vectorize()
val features = Seq(emailFeature, phone, age, subject, zipcode).transmogrify()
```

For advanced users, you can also completely [customize automatic feature engineering](../developer-guide#transmogrification).

## Feature Validation

#### SanityChecker

This is the Stage that automates the feature selection step in the machine learning pipeline.

The SanityChecker is an Estimator that can analyze a particular dataset for obvious issues prior to fitting a model on it. It applies a variety of statistical tests to the data based on Feature types and discards predictors that are indicative of [label leakage](http:https://machinelearningmastery.com/data-leakage-machine-learning/) or that show little to no predictive power. In addition to flagging and fixing data issues, the SanityChecker also outputs statistics about the data for diagnostics and insight generation further down the ML pipeline.

The SanityChecker can be instantiated as follows:

```scala
// Add sanity checker estimator
val checkedFeatures = new SanityChecker().setRemoveBadFeatures(true).setInput(label, features).getOutput()
```
For advanced users, check out how to [customize default parameters](../developer-guide#sanitychecker) and peek into the SanityChecker metadata using model insights.

#### RawFeatureFilter

One of the fundamental assumptions of machine learning is that the data you are using to train your model reflects the data that you wish to score. In the real world, this assumption is often not true. TransmogrifAI has an optional stage after data reading that allows you to check that your features do not violate this assumption and remove any features that do. This stage is called the [RawFeatureFilter](https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/filters/RawFeatureFilter.scala), and to use it you call the method `withRawFeatureFilter(Option(trainReader), Option(scoreReader),...)` on your [Workflows](../developer-guide#workflows). This method takes the training and scoring data readers as inputs.

```scala
// Add raw feature filter estimator
val workflow =
new OpWorkflow()
.setResultFeatures(survived, rawPrediction, prob, prediction)
.withRawFeatureFilter(Option(trainReader), Option(scoreReader), None)
```

It will load the training and scoring data and exclude individual features based on fill rate, relative fill rates between training and scoring, or differences in the distribution of data between training and scoring. This stage can eliminate many issues, such as leakage of information that is only filled out after the label and changes in data collection practices, before they affect your model.

For advanced users, check out how to set [optional parameters](../developer-guide#rawfeaturefilter) for when to exclude features.


## ModelSelectors

This is the Stage that automates the model selection step in the machine learning pipeline.

TransmogrifAI will select the best model and hyper-parameters for you based on the class of modeling you are doing (eg. Classification, Regression etc.).
Smart model selection and comparison gives the next layer of improvements over traditional ML workflows.

```scala
val (pred, raw, prob) = BinaryClassificationModelSelector().setInput(label, features).getOutput()
```

The ModelSelector is an Estimator that uses data to find the best model. BinaryClassificationModelSelector is for binary classification tasks, multi classification tasks can be done using MultiClassificationModelSelector. Best Regression model are done through RegressionModelSelector. Currently the possible classification models that can be applied in the selector are` LogisticRegression`, `DecisionTrees`, `RandomForest` and `NaiveBayes`. The possible regression models are ` LinearRegression`, `DecisionTrees`, `RandomForest` and `GBTrees`. The best model is selected via a CrossValidation or TrainingSplit, by picking the best SparkML model and wraping it. It is also possible to perform hyperparameter tuning for each model through a grid search.

For advanced users, check out how to set CrossValidation parameters, balance datasets and customize hyperparameter-tuning [here](../developer-guide#modelselector).
58 changes: 0 additions & 58 deletions docs/build.gradle

This file was deleted.

Loading

0 comments on commit afad44c

Please sign in to comment.