Skip to content

Commit

Permalink
Merge branch 'master' into ec/standardMetadata
Browse files Browse the repository at this point in the history
  • Loading branch information
erica-chiu committed Aug 1, 2019
2 parents 7459f31 + b505ff7 commit 007467e
Show file tree
Hide file tree
Showing 92 changed files with 2,640 additions and 720 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
version: 2
version: 2.1

machine-config: &machine-config
machine: true
Expand Down
40 changes: 40 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,45 @@
# Changelog

## 0.6.0

Bug fixes:
- Quick Fix Alias Type Names [#346](https://github.com/salesforce/TransmogrifAI/pull/346)
- Forecast Evaluator - fixes SMAPE, adds MASE and Seasonal Error metrics [#342](https://github.com/salesforce/TransmogrifAI/pull/342)

New features / updates:
- Aggregate LOCOs of DateToUnitCircleTransformer. [#349](https://github.com/salesforce/TransmogrifAI/pull/349)
- Convert lambda functions into concrete classes to allow compatibility with Scala 2.12 [#357](https://github.com/salesforce/TransmogrifAI/pull/357)
- Replace mapValues with immutable Map where applicable [#363](https://github.com/salesforce/TransmogrifAI/pull/363)
- Aggregate spark metrics during run time instead of post processing by default [#358](https://github.com/salesforce/TransmogrifAI/pull/358)
- Allow customizing serialization for FeatureGenerator extract function [#352](https://github.com/salesforce/TransmogrifAI/pull/352)
- Update helloworld examples to be simple [#351](https://github.com/salesforce/TransmogrifAI/pull/351)
- Adding `key` ctor field in all RawFeatureFilter results [#348](https://github.com/salesforce/TransmogrifAI/pull/348)
- Forecast evaluator + SMAPE metric [#337](https://github.com/salesforce/TransmogrifAI/pull/337)
- Local scoring for model with features of all types [#340](https://github.com/salesforce/TransmogrifAI/pull/340)
- Remove local runner + update docs [#335](https://github.com/salesforce/TransmogrifAI/pull/335)
- Added missing test for java conversions [#334](https://github.com/salesforce/TransmogrifAI/pull/334)
- Get rid of scalaj-collections [#333](https://github.com/salesforce/TransmogrifAI/pull/333)
- Workflow independent model loading [#274](https://github.com/salesforce/TransmogrifAI/pull/274)
- Aggregated LOCOs of SmartTextVectorizer outputs [#308](https://github.com/salesforce/TransmogrifAI/pull/308)
- Added community projects docs section [#326](https://github.com/salesforce/TransmogrifAI/pull/326)
- Add FeatureBuilder.fromSchema [#325](https://github.com/salesforce/TransmogrifAI/pull/325)
- Improve WeekOfMonth in date transformers [#323](https://github.com/salesforce/TransmogrifAI/pull/323)
- Improved datetime unit transformer shortcuts - Part 2 [#319](https://github.com/salesforce/TransmogrifAI/pull/319)
- Correctly pass main class for CLI sub project [#321](https://github.com/salesforce/TransmogrifAI/pull/321)
- Serialize blacklisted map keys with the model + updated access on workflow/model members [#320](https://github.com/salesforce/TransmogrifAI/pull/320)
- Improved datetime unit transformer shortcuts [#316](https://github.com/salesforce/TransmogrifAI/pull/316)
- Improved OpScalarStandardScalerTest [#317](https://github.com/salesforce/TransmogrifAI/pull/317)
- improved PercentileCalibratorTest [#318](https://github.com/salesforce/TransmogrifAI/pull/318)
- Added concrete wrappers for HashingTF, NGram and StopWordsRemover [#314](https://github.com/salesforce/TransmogrifAI/pull/314)
- Avoid singleton random generators [#312](https://github.com/salesforce/TransmogrifAI/pull/312)
- Remove free function aggregation with feature builders [#311](https://github.com/salesforce/TransmogrifAI/pull/311)
- Added util methods to create class/object by name + retrieve type tag by type name [#310](https://github.com/salesforce/TransmogrifAI/pull/310)

Dependency updates:
- Bump shadowjar plugin to 5.0.0 [#306](https://github.com/salesforce/TransmogrifAI/pull/306)
- Bump Apache Tika to 1.21 [#331](https://github.com/salesforce/TransmogrifAI/pull/331)
- Enable CicleCI version 2.1 [#353](https://github.com/salesforce/TransmogrifAI/pull/353)

## 0.5.3

Bug fixes:
Expand Down
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# TransmogrifAI

[![Maven Central](https://img.shields.io/maven-central/v/com.salesforce.transmogrifai/transmogrifai-core_2.11.svg?colorB=blue)](https://search.maven.org/search?q=g:com.salesforce.transmogrifai) [![Download](https://api.bintray.com/packages/salesforce/maven/TransmogrifAI/images/download.svg)](https://bintray.com/salesforce/maven/TransmogrifAI/_latestVersion) [![Javadocs](https://www.javadoc.io/badge/com.salesforce.transmogrifai/transmogrifai-core_2.11/0.5.3.svg?color=blue)](https://www.javadoc.io/doc/com.salesforce.transmogrifai/transmogrifai-core_2.11/0.5.3) [![Spark version](https://img.shields.io/badge/spark-2.3-brightgreen.svg)](https://spark.apache.org/downloads.html) [![Scala version](https://img.shields.io/badge/scala-2.11-brightgreen.svg)](https://www.scala-lang.org/download/2.11.12.html) [![License](http:https://img.shields.io/:license-BSD--3-blue.svg)](./LICENSE) [![Chat](https://badges.gitter.im/salesforce/TransmogrifAI.svg)](https://gitter.im/salesforce/TransmogrifAI?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![Maven Central](https://img.shields.io/maven-central/v/com.salesforce.transmogrifai/transmogrifai-core_2.11.svg?colorB=blue)](https://search.maven.org/search?q=g:com.salesforce.transmogrifai) [![Download](https://api.bintray.com/packages/salesforce/maven/TransmogrifAI/images/download.svg)](https://bintray.com/salesforce/maven/TransmogrifAI/_latestVersion) [![Javadocs](https://www.javadoc.io/badge/com.salesforce.transmogrifai/transmogrifai-core_2.11/0.6.0.svg?color=blue)](https://www.javadoc.io/doc/com.salesforce.transmogrifai/transmogrifai-core_2.11/0.6.0) [![Spark version](https://img.shields.io/badge/spark-2.3-brightgreen.svg)](https://spark.apache.org/downloads.html) [![Scala version](https://img.shields.io/badge/scala-2.11-brightgreen.svg)](https://www.scala-lang.org/download/2.11.12.html) [![License](http:https://img.shields.io/:license-BSD--3-blue.svg)](./LICENSE) [![Chat](https://badges.gitter.im/salesforce/TransmogrifAI.svg)](https://gitter.im/salesforce/TransmogrifAI?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

[![TravisCI Build Status](https://travis-ci.com/salesforce/TransmogrifAI.svg?token=Ex9czVEUD7AzPTmVh6iX&branch=master)](https://travis-ci.com/salesforce/TransmogrifAI) [![CircleCI Build Status](https://circleci.com/gh/salesforce/TransmogrifAI.svg?&style=shield&circle-token=e84c1037ae36652d38b49207728181ee85337e0b)](https://circleci.com/gh/salesforce/TransmogrifAI) [![Documentation Status](https://readthedocs.org/projects/transmogrifai/badge/?version=stable)](https://docs.transmogrif.ai/en/stable/?badge=stable) [![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/2557/badge)](https://bestpractices.coreinfrastructure.org/projects/2557) [![Codecov](https://codecov.io/gh/salesforce/TransmogrifAI/branch/master/graph/badge.svg)](https://codecov.io/gh/salesforce/TransmogrifAI) [![CodeFactor](https://www.codefactor.io/repository/github/salesforce/transmogrifai/badge)](https://www.codefactor.io/repository/github/salesforce/transmogrifai)

Expand Down Expand Up @@ -128,8 +128,8 @@ Start by picking TransmogrifAI version to match your project dependencies from t

| TransmogrifAI Version | Spark Version | Scala Version | Java Version |
|-------------------------------------------------|:-------------:|:-------------:|:------------:|
| 0.6.0 (unreleased, master) | 2.3 | 2.11 | 1.8 |
| **0.5.3 (stable)**, 0.5.2, 0.5.1, 0.5.0 | **2.3** | **2.11** | **1.8** |
| 0.6.1 (unreleased, master) | 2.4 | 2.11 | 1.8 |
| **0.6.0 (stable)**, 0.5.3, 0.5.2, 0.5.1, 0.5.0 | **2.3** | **2.11** | **1.8** |
| 0.4.0, 0.3.4 | 2.2 | 2.11 | 1.8 |

For Gradle in `build.gradle` add:
Expand All @@ -140,10 +140,10 @@ repositories {
}
dependencies {
// TransmogrifAI core dependency
compile 'com.salesforce.transmogrifai:transmogrifai-core_2.11:0.5.3'
compile 'com.salesforce.transmogrifai:transmogrifai-core_2.11:0.6.0'
// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// compile 'com.salesforce.transmogrifai:transmogrifai-models_2.11:0.5.3'
// compile 'com.salesforce.transmogrifai:transmogrifai-models_2.11:0.6.0'
}
```

Expand All @@ -154,10 +154,10 @@ scalaVersion := "2.11.12"
resolvers += Resolver.jcenterRepo

// TransmogrifAI core dependency
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.5.3"
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.6.0"

// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-models" % "0.5.3"
// libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-models" % "0.6.0"
```

Then import TransmogrifAI into your code:
Expand Down
22 changes: 11 additions & 11 deletions build.gradle
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
buildscript {
repositories {
maven { url "https://plugins.gradle.org/m2/" }
mavenCentral()
jcenter()
maven { url "https://plugins.gradle.org/m2/" }
}
dependencies {
classpath 'org.github.ngbinh.scalastyle:gradle-scalastyle-plugin_2.11:1.0.1'
classpath 'com.commercehub.gradle.plugin:gradle-avro-plugin:0.16.0'
}
}

plugins {
id 'com.commercehub.gradle.plugin.avro' version '0.8.0'
id 'org.scoverage' version '2.5.0'
id 'net.minecrell.licenser' version '0.4.1'
id 'com.github.jk1.dependency-license-report' version '0.5.0'
Expand Down Expand Up @@ -57,14 +58,13 @@ configure(allProjs) {
scalaVersionRevision = '12'
scalaTestVersion = '3.0.5'
scalaCheckVersion = '1.14.0'
junitVersion = '4.11'
avroVersion = '1.7.7'
sparkVersion = '2.3.2'
sparkAvroVersion = '4.0.0'
junitVersion = '4.12'
avroVersion = '1.8.2'
sparkVersion = '2.4.3'
scalaGraphVersion = '1.12.5'
scalafmtVersion = '1.5.1'
hadoopVersion = 'hadoop2'
json4sVersion = '3.2.11' // matches Spark dependency version
json4sVersion = '3.5.3' // matches Spark dependency version
jodaTimeVersion = '2.9.4'
jodaConvertVersion = '1.8.1'
algebirdVersion = '0.13.4'
Expand All @@ -75,20 +75,20 @@ configure(allProjs) {
googleLibPhoneNumberVersion = '8.8.5'
googleGeoCoderVersion = '2.82'
googleCarrierVersion = '1.72'
chillVersion = '0.8.4'
chillVersion = '0.9.3'
reflectionsVersion = '0.9.11'
collectionsVersion = '3.2.2'
optimaizeLangDetectorVersion = '0.0.1'
tikaVersion = '1.21'
sparkTestingBaseVersion = '2.3.1_0.10.0'
sparkTestingBaseVersion = '2.4.3_0.12.0'
sourceCodeVersion = '0.1.3'
pegdownVersion = '1.4.2'
commonsValidatorVersion = '1.6'
commonsIOVersion = '2.6'
scoveragePluginVersion = '1.3.1'
xgboostVersion = '0.81'
xgboostVersion = '0.90'
akkaSlf4jVersion = '2.3.11'
mleapVersion = '0.13.0'
mleapVersion = '0.14.0'
memoryFilesystemVersion = '2.1.0'
}

Expand Down
6 changes: 1 addition & 5 deletions cli/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -69,20 +69,16 @@ task copyTemplates(type: Copy) {
fileName.replace(".gradle.template", ".gradle")
}
expand([
databaseHostname: 'db.company.com',
version: scalaVersion,
scalaVersion: scalaVersion,
scalaVersionRevision: scalaVersionRevision,
scalaTestVersion: scalaTestVersion,
junitVersion: junitVersion,
sparkVersion: sparkVersion,
avroVersion: avroVersion,
sparkAvroVersion: sparkAvroVersion,
hadoopVersion: hadoopVersion,
collectionsVersion: collectionsVersion,
transmogrifaiVersion: version,
buildNumber: (int)(Math.random() * 1000),
date: new Date()
transmogrifaiVersion: version
])
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ case class AutomaticSchema(recordClassName: String)(dataFile: File) extends Sche
case Some(actualType) =>
val newSchema = Schema.create(actualType)
val schemaField =
new Schema.Field(field.name, newSchema, "auto-generated", orgSchemaField.defaultValue)
new Schema.Field(field.name, newSchema, "auto-generated", orgSchemaField.defaultVal())
AvroField.from(schemaField)
}
} else field
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ class AvroFieldTest extends FlatSpec with TestCommon with Assertions {
val allSchemas = (enum::unions)++simpleSchemas // NULL does not work

val fields = allSchemas.zipWithIndex map {
case (s, i) => new Schema.Field("x" + i, s, "Who", null)
case (s, i) => new Schema.Field("x" + i, s, "Who", null: Object)
}

val expected = List(
Expand All @@ -86,7 +86,7 @@ class AvroFieldTest extends FlatSpec with TestCommon with Assertions {

an[IllegalArgumentException] should be thrownBy {
val nullSchema = Schema.create(Schema.Type.NULL)
val nullField = new Schema.Field("xxx", null, "Nobody", null)
val nullField = new Schema.Field("xxx", null, "Nobody", null: Object)
AvroField from nullField
}

Expand Down
78 changes: 74 additions & 4 deletions core/src/main/scala/com/salesforce/op/ModelInsights.scala
Original file line number Diff line number Diff line change
Expand Up @@ -484,10 +484,12 @@ case object ModelInsights {
s" to fill in model insights"
)

val labelSummary = getLabelSummary(label, checkerSummary)

ModelInsights(
label = getLabelSummary(label, checkerSummary),
label = labelSummary,
features = getFeatureInsights(vectorInput, checkerSummary, model, rawFeatures,
blacklistedFeatures, blacklistedMapKeys, rawFeatureFilterResults),
blacklistedFeatures, blacklistedMapKeys, rawFeatureFilterResults, labelSummary),
selectedModelInfo = getModelInfo(model),
trainingParams = trainingParams,
stageInfo = RawFeatureFilterConfig.toStageInfo(rawFeatureFilterResults.rawFeatureFilterConfig)
Expand Down Expand Up @@ -537,7 +539,8 @@ case object ModelInsights {
rawFeatures: Array[features.OPFeature],
blacklistedFeatures: Array[features.OPFeature],
blacklistedMapKeys: Map[String, Set[String]],
rawFeatureFilterResults: RawFeatureFilterResults = RawFeatureFilterResults()
rawFeatureFilterResults: RawFeatureFilterResults = RawFeatureFilterResults(),
label: LabelSummary
): Seq[FeatureInsights] = {
val featureInsights = (vectorInfo, summary) match {
case (Some(v), Some(s)) =>
Expand All @@ -557,6 +560,42 @@ case object ModelInsights {
case _ => None
}
val keptIndex = indexInToIndexKept.get(h.index)
val featureStd = math.sqrt(getIfExists(h.index, s.featuresStatistics.variance).getOrElse(1.0))
val sparkFtrContrib = keptIndex
.map(i => contributions.map(_.applyOrElse(i, (_: Int) => 0.0))).getOrElse(Seq.empty)
val defaultLabelStd = 1.0
val labelStd = label.distribution match {
case Some(Continuous(_, _, _, variance)) =>
if (variance == 0) {
log.warn("The standard deviation of the label is zero, " +
"so the coefficients and intercepts of the model will be zeros, training is not needed.")
defaultLabelStd
}
else math.sqrt(variance)
case Some(Discrete(domain, prob)) =>
// mean = sum (x_i * p_i)
val mean = (domain zip prob).foldLeft(0.0) {
case (weightSum, (d, p)) => weightSum + d.toDouble * p
}
// variance = sum (x_i - mu)^2 * p_i
val discreteVariance = (domain zip prob).foldLeft(0.0) {
case (sqweightSum, (d, p)) => sqweightSum + (d.toDouble - mean) * (d.toDouble - mean) * p
}
if (discreteVariance == 0) {
log.warn("The standard deviation of the label is zero, " +
"so the coefficients and intercepts of the model will be zeros, training is not needed.")
defaultLabelStd
}
else math.sqrt(discreteVariance)
case Some(_) => {
log.warn("Failing to perform weight descaling because distribution is unsupported.")
defaultLabelStd
}
case None => {
log.warn("Label does not exist, please check your data")
defaultLabelStd
}
}

h.parentFeatureOrigins ->
Insights(
Expand All @@ -579,7 +618,8 @@ case object ModelInsights {
case _ => Map.empty[String, Double]
},
contribution =
keptIndex.map(i => contributions.map(_.applyOrElse(i, (_: Int) => 0.0))).getOrElse(Seq.empty),
descaleLRContrib(model, sparkFtrContrib, featureStd, labelStd).getOrElse(sparkFtrContrib),

min = getIfExists(h.index, s.featuresStatistics.min),
max = getIfExists(h.index, s.featuresStatistics.max),
mean = getIfExists(h.index, s.featuresStatistics.mean),
Expand Down Expand Up @@ -647,6 +687,36 @@ case object ModelInsights {
}
}

private[op] def descaleLRContrib(
model: Option[Model[_]],
sparkFtrContrib: Seq[Double],
featureStd: Double,
labelStd: Double): Option[Seq[Double]] = {
val stage = model.flatMap {
case m: SparkWrapperParams[_] => m.getSparkMlStage()
case _ => None
}
stage.collect {
case m: LogisticRegressionModel =>
if (m.getStandardization && sparkFtrContrib.nonEmpty) {
// scale entire feature contribution vector
// See https://think-lab.github.io/d/205/
// § 4.5.2 Standardized Interpretations, An Introduction to Categorical Data Analysis, Alan Agresti
sparkFtrContrib.map(_ * featureStd)
}
else sparkFtrContrib
case m: LinearRegressionModel =>
if (m.getStandardization && sparkFtrContrib.nonEmpty) {
// need to also divide by labelStd for linear regression
// See https://u.demog.berkeley.edu/~andrew/teaching/standard_coeff.pdf
// See https://en.wikipedia.org/wiki/Standardized_coefficient
sparkFtrContrib.map(_ * featureStd / labelStd)
}
else sparkFtrContrib
case _ => sparkFtrContrib
}
}

private[op] def getModelContributions
(model: Option[Model[_]], featureVectorSize: Option[Int] = None): Seq[Seq[Double]] = {
val stage = model.flatMap {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ import com.salesforce.op.filters.RawFeatureFilterResults
import com.salesforce.op.stages.{OPStage, OpPipelineStageWriter}
import enumeratum._
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.compress.GzipCodec
import org.apache.spark.ml.util.MLWriter
import org.json4s.JsonAST.{JArray, JObject, JString}
import org.json4s.JsonDSL._
Expand All @@ -54,7 +55,8 @@ class OpWorkflowModelWriter(val model: OpWorkflowModel) extends MLWriter {
implicit val jsonFormats: Formats = DefaultFormats

override protected def saveImpl(path: String): Unit = {
sc.parallelize(Seq(toJsonString(path)), 1).saveAsTextFile(OpWorkflowModelReadWriteShared.jsonPath(path))
sc.parallelize(Seq(toJsonString(path)), 1)
.saveAsTextFile(OpWorkflowModelReadWriteShared.jsonPath(path), classOf[GzipCodec])
}

/**
Expand All @@ -63,7 +65,7 @@ class OpWorkflowModelWriter(val model: OpWorkflowModel) extends MLWriter {
* @param path to save the model and its stages
* @return model json string
*/
def toJsonString(path: String): String = pretty(render(toJson(path)))
def toJsonString(path: String): String = compact(render(toJson(path)))

/**
* Json serialize model instance
Expand Down
Loading

0 comments on commit 007467e

Please sign in to comment.