Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model combiner #385

Merged
merged 19 commits into from
Sep 3, 2019
Merged

Model combiner #385

merged 19 commits into from
Sep 3, 2019

Conversation

leahmcguire
Copy link
Collaborator

@leahmcguire leahmcguire commented Aug 19, 2019

Related issues
Want to do different feature engineering for different model types or do model combination. Basically this allows two different feature engineering flows to be passed into different model selectors (eg you want to do one kind of feature eng for random forest and one for regression but want grid search for each). The accuracy of the models can be compared and then the predictions willbe combined by either taking the best model, equal combination, or weighted combination of the predictions.

Describe the proposed solution
Made a stage which can combine the results of two model selectors to produce a new score. The inputs are the two prediction outputs and the label. The output is a new prediction which is the combined result of the input predictions. The metadata is updated to reflect the combined prediction.

@tovbinm tovbinm changed the title Lm/model combiner Model combiner Aug 19, 2019
@codecov
Copy link

codecov bot commented Aug 19, 2019

Codecov Report

Merging #385 into master will decrease coverage by 37.07%.
The diff coverage is 0.66%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #385       +/-   ##
===========================================
- Coverage   86.82%   49.75%   -37.08%     
===========================================
  Files         336      337        +1     
  Lines       10962    11076      +114     
  Branches      346      588      +242     
===========================================
- Hits         9518     5511     -4007     
- Misses       1444     5565     +4121
Impacted Files Coverage Δ
...salesforce/op/evaluators/OpForecastEvaluator.scala 0% <ø> (-96.43%) ⬇️
...orce/op/stages/impl/tuning/OpCrossValidation.scala 0% <ø> (-97.68%) ⬇️
...ssification/MultiClassificationModelSelector.scala 0% <ø> (-97.57%) ⬇️
...op/evaluators/OpMultiClassificationEvaluator.scala 0% <ø> (-94.74%) ⬇️
...ages/impl/regression/RegressionModelSelector.scala 0% <ø> (-98.15%) ⬇️
...salesforce/op/stages/impl/tuning/OpValidator.scala 0% <ø> (-94.6%) ⬇️
...p/evaluators/OpBinaryClassificationEvaluator.scala 0% <ø> (-82.5%) ⬇️
...sification/BinaryClassificationModelSelector.scala 0% <ø> (-98.25%) ⬇️
...rce/op/stages/impl/preparators/SanityChecker.scala 0% <ø> (-91.6%) ⬇️
...m/salesforce/op/utils/spark/OpVectorMetadata.scala 85.45% <ø> (ø) ⬆️
... and 163 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b91ffe3...41c6654. Read the comment docs.

override def hashCode(): Int = super.hashCode()
}

object CombinationStrategy extends Enum[CombinationStrategy] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this enum should be defined in feature module and registered with the json in OpPipelineStageReadWriteFormats

@@ -313,10 +316,22 @@ case class Correlations
corrMeta.putString(SanityCheckerNames.CorrelationType, corrType.sparkName)
corrMeta.build()
}

private[op] def +(corr: Correlations): Correlations =
new Correlations(featuresIn ++ corr.featuresIn, values ++ corr.values, nanCorrs ++ corr.nanCorrs, corrType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the Correlations have to have the same correlation type for this to make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes- but I am not sure it makes sense to error out on that... should I maybe log a warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added custom to reflect if have multiple corr types

if (m2e1.nonEmpty) {
(getMetricValue(summary1.trainEvaluation, eval1), m2e1, eval1)
} else if (m1e2.nonEmpty) {
(m1e2, getMetricValue(summary2.trainEvaluation, eval2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need an eval2 at the end of the tuple here? How did the Scala compiler miss that actually??

Can you add some tests that cover some of these weird edge cases? I feel like there's a lot of new stuff being added.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


lazy val labelColName: String = in1.name

@transient private var evaluatorList: Seq[OpEvaluatorBase[_ <: EvaluationMetrics]] = Seq.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be transient?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that we can save the model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire I think you need to write a custom serializer for SelectedCombinerModel that would serialize and deserialize the evaluators top make this work.

I wonder why OpEstimatorSpec base class did not catch it?!?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matthew this is exactly how it works in the model selector - we dont need these in the serialized model as they are only called during training

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So no we do not need a custom serializer - evaluators do not need to be serialized

}
}

object CombinationStrategy extends Enum[CombinationStrategy] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs here too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

override def entryName: String = name.toLowerCase
}
def withName(name: String, isLargerBetter: Boolean): OpEvaluatorNames =
Try(super.withName(name)).getOrElse(Custom(name, name, isLargerBetter))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid Try and exceptions in simple program flows, use super.withNameOption instead


def withNameInsensitive(name: String, isLargerBetter: Boolean): OpEvaluatorNames =
super.withNameInsensitiveOption(name).getOrElse(Custom(name, name, isLargerBetter))
override def withNameInsensitive(name: String): OpEvaluatorNames = withNameInsensitive(name, true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to have the default of true for isLargerBetter value? perhaps let's avoid specifying any defaults for withName and withNameInsensitive methods.

Instead it's better to have the method:

def withNameOrDefault(name: String, default: String => OpEvaluatorNames => name => Custom(name, name, isLargerBetter)): OpEvaluatorNames = {
    super.withNameInsensitiveOption(name).getOrElse(default)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are base methods @tovbinm , hence the override. if they want the fallback behavior your method provides they can use my definitions - but what happens when people call the base methods and there is no default?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See some of my comments.

Another question: would you think we should add a shortcut for it?

perhaps?

val (pred1, pred2) = ...
val combinedPred = label.combinePredictions(pred1, pred2)

* @param operationName name of operation
* @param uid stage uid
*/
class SelectedCombiner
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectedModelCombiner or ModelCombiner perhaps? it's difficult to understand what this stage does otherwise.

isValid = (in: String) => CombinationStrategy.values.map(_.entryName).contains(in)
)
def setCombinationStrategy(value: CombinationStrategy): this.type = set(combinationStrategy, value.entryName)
def getCombinationStrategy(): CombinationStrategy = CombinationStrategy.namesToValuesMap($(combinationStrategy))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CombinationStrategy.withNameIncensitive($(combinationStrategy))


def getMetricValue(metrics: EvaluationMetrics, name: EvalMetric) =
metrics.toMap.collectFirst{
case (k, v) if k.contains(name.humanFriendlyName) || k.contains(name.entryName) => v.asInstanceOf[Double]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics.toMap.collectFirst {
        case (k, v: Double) if k.contains(name.humanFriendlyName) || k.contains(name.entryName) => v 
}

} else (None, None, eval1)
}

def makeMeta(model: SelectedCombinerModel): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please make these private methods on the stage instead of them being nested inside fit?

dataPrepResults = summary1.dataPrepResults.orElse(summary2.dataPrepResults),
evaluationMetric = metricName,
problemType = summary1.problemType,
bestModelUID = summary1.bestModelUID + " " + summary2.bestModelUID,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is space a good separator here? perhaps _ or - instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space will be parsable to get them separated while "-" or "_" would not


val strategy = getCombinationStrategy()
val (weight1, weight2) = strategy match {
case CombinationStrategy.Best =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a method on CombinationStrategy itself. e.g. val (weight1, weight2) = getCombinationStrategy().computeWeights(metricValue1, metricValue2)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it actually cant unless I move the EvalMetric class to the features module


lazy val labelColName: String = in1.name

@transient private var evaluatorList: Seq[OpEvaluatorBase[_ <: EvaluationMetrics]] = Seq.empty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire I think you need to write a custom serializer for SelectedCombinerModel that would serialize and deserialize the evaluators top make this work.

I wonder why OpEstimatorSpec base class did not catch it?!?

@leahmcguire
Copy link
Collaborator Author

@tovbinm @Jauntbox please take a look

@leahmcguire leahmcguire merged commit 51037a8 into master Sep 3, 2019
@leahmcguire leahmcguire deleted the lm/modelCombiner branch September 3, 2019 18:04
gerashegalov added a commit that referenced this pull request Sep 11, 2019
Bug fixes:
- Ensure correct metrics despite model failures on some CV folds [#404](#404)
- Fix flaky `ModelInsight` tests [#395](#395)
- Avoid creating `SparseVector`s for LOCO [#377](#377)

New features / updates:
- Model combiner [#385](#399)
- Added new sample for HousingPrices [#365](#365)
- Test to verify that custom metrics appear in model insight metrics [#387](#387)
- Add `FeatureDistribution` to `SerializationFormat`s [#383](#383)
- Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378)
- Improve json serde error in `evalMetFromJson` [#380](#380)
- Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354)
- Making model selectors robust to failing models [#372](#372)
- Use compact and compressed model json by default [#375](#375)
- Descale feature contribution for Linear Regression & Logistic Regression [#345](#345)

Dependency updates:   
- Update tika version [#382](#382)
@salesforce-cla
Copy link

salesforce-cla bot commented Apr 4, 2021

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com> Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants