Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New model selector interface #55

Merged
merged 31 commits into from
Aug 20, 2018
Merged

New model selector interface #55

merged 31 commits into from
Aug 20, 2018

Conversation

leahmcguire
Copy link
Collaborator

Related issues
Changing model selector interface to use new more flexible model selector class.

Describe the proposed solution
Changing model selector interface to use new more flexible model selector class.

Describe alternatives you've considered
Leaving both interfaces. It would be confusing.

Additional context
Add any other context about the changes here.

@tovbinm
Copy link
Collaborator

tovbinm commented Aug 14, 2018

@leahmcguire this PR is quite large. is there anything we can do here? perhaps split it or provide recommendations on how to review it.

@leahmcguire
Copy link
Collaborator Author

Not really - this is just all the tests and files that touched the ModelSelector interface

@leahmcguire
Copy link
Collaborator Author

I can walk you through the actual changes - they are actually surprising small

@codecov
Copy link

codecov bot commented Aug 15, 2018

Codecov Report

Merging #55 into master will decrease coverage by 0.4%.
The diff coverage is 92.03%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #55      +/-   ##
=========================================
- Coverage    86.3%   85.9%   -0.41%     
=========================================
  Files         298     294       -4     
  Lines        9305    9521     +216     
  Branches      303     535     +232     
=========================================
+ Hits         8031    8179     +148     
- Misses       1274    1342      +68
Impacted Files Coverage Δ
...s/sparkwrappers/specific/SparkModelConverter.scala 93.33% <ø> (ø) ⬆️
...op/stages/impl/selector/ModelSelectorSummary.scala 92.13% <ø> (ø) ⬆️
...es/sparkwrappers/specific/OpPredictorWrapper.scala 100% <ø> (ø) ⬆️
...m/salesforce/op/stages/OpPipelineStageReader.scala 65.62% <ø> (ø) ⬆️
...p/evaluators/OpBinaryClassificationEvaluator.scala 81.57% <ø> (ø) ⬆️
...la/com/salesforce/op/stages/OpPipelineStages.scala 73% <ø> (ø) ⬆️
...com/salesforce/op/utils/stages/FitStagesUtil.scala 94.73% <ø> (ø) ⬆️
...lesforce/op/evaluators/OpRegressionEvaluator.scala 91.66% <ø> (ø) ⬆️
...alesforce/op/stages/impl/tuning/DataBalancer.scala 96.19% <ø> (ø) ⬆️
.../salesforce/op/stages/impl/tuning/DataCutter.scala 95.65% <ø> (ø) ⬆️
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77a6adc...f4c5f43. Read the comment docs.

@tovbinm tovbinm changed the title Lm/model selector interface New model selector interface Aug 15, 2018
modelsAndParameters: Seq[(EstimatorType, Array[ParamMap])]
): ModelSelector[ModelType, EstimatorType] = {
val modelStrings = modelTypesToUse.map(_.entryName)
val modelsToUse = if (modelsAndParameters == defaultModelsAndParams) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove check

val cv = new OpCrossValidation[ProbClassifierModel, ProbClassifier](
parallelism: Int = ValidatorParamDefaults.Parallelism,
modelTypesToUse: Seq[_ <: BinaryClassificationModelsToTry] = modelNames,
modelsAndParameters: Seq[(EstimatorType, Array[ParamMap])] = defaultModelsAndParams
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update docs

): BinaryClassificationModelSelector = {
val ts = new OpTrainValidationSplit[ProbClassifierModel, ProbClassifier](
parallelism: Int = ValidatorParamDefaults.Parallelism,
modelTypesToUse: Seq[_ <: BinaryClassificationModelsToTry] = modelNames,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split methods

.setInput(label, checkedFeatures)
.getOutput()

val evaluator =
Evaluators.BinaryClassification()
.setLabelCol(label).setPredictionCol(pred).setRawPredictionCol(raw)
.setLabelCol(label).setFullPredictionCol(pred)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename setFullPredictionCol to setPredictionCol



/**
* A factory for Binary Classification Model Selector
*/
case object BinaryClassificationModelSelector {

private[op] val modelNames: Seq[_ <: BinaryClassificationModelsToTry] = Seq(MTT.OpLogisticRegression,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simply do Seq[BinaryClassificationModelsToTry] here and everywhere for other selectors

case m => setDefault(sparkMlStage, Option(m))
}

lazy val recoveredStage: ModelType = getSparkMlStage() match {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private transient lazy val recoveredStage

@michaelweilsalesforce
Copy link
Contributor

michaelweilsalesforce commented Aug 16, 2018

@leahmcguire This is not be related to this PR, but I'm trying to understand the type safety in the models to try.
Here is an example : What happens if a (witty) user decides to do something this :
BinaryClassificationModelSelector.withCrossValidation(modelsAndParameters = Seq(new OpLinearRegression() -> Array.empty)... ?
In a word, it is a ClassificationModelSelector trying a regression model. Will it break? Will it do a kind of regression Model selector?

Should we consider in the future enforcing a type 'Classification' for OpLogisticRegression, OpRandomForest,... and a type Regression for OpLinearRegression, ... ?

@leahmcguire
Copy link
Collaborator Author

It is an not part of this PR but you are correct @michaelweilsalesforce there is no compile time type check for this now. It will fail at runtime because the evaluator will not find a raw prediction / probability. In order to support users being able to define their own estimators I had to relax the type checks. Any estimator that takes a label and feature vector and returns a prediction will try to run. So the default models and models that can be turned on by name are all of the correct type. A user can mess it up if they try :-P Good eye :-)

final val predictionCol: Param[String] = new Param[String](this, "predictionCol", "prediction column name")
setDefault(predictionCol, "prediction")
trait OpHasPredictionValueCol[T <: FeatureType] extends Params {
final val predictionValueCol: Param[String] = new Param[String](this, "predictionCol", "prediction column name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"predictionCol" -> "predictionValueCol"?

Copy link
Collaborator Author

@leahmcguire leahmcguire Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and fullPredictionCol -> predictionCol

trait OpHasFullPredictionCol extends Params {
final val fullPredictionCol: Param[String] = new Param[String](this, "fullPredictionCol", "prediction column name")
trait OpHasPredictionCol extends Params {
final val predictionCol: Param[String] = new Param[String](this, "fullPredictionCol", "prediction column name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update name and doc of the param?
Add setDefault?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no default that can be set for this

!(isSet(predictionCol) && data.schema.fieldNames.contains(getPredictionCol))) {
val fullPredictionColName = getFullPredictionCol
if (isSet(predictionCol) &&
!(isSet(predictionValueCol) && data.schema.fieldNames.contains(getPredictionValueCol))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.columns.contains(getPredictionValueCol)

!(isSet(predictionCol) && data.schema.fieldNames.contains(getPredictionCol))) {
val fullPredictionColName = getFullPredictionCol
if (isSet(predictionCol) &&
!(isSet(predictionValueCol) && data.schema.fieldNames.contains(getPredictionValueCol))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.columns.contains(getPredictionValueCol)

* @param modelTypesToUse list of model types to run grid search on must from supported types in
* BinaryClassificationModelsToTry (OpLogisticRegression, OpRandomForestClassifier,
* OpGBTClassifier, OpLinearSVC, OpDecisionTreeClassifier, OpNaiveBayes)
* @param modelsAndParameters pass in an explicit list pairs of estimators and the accompanying hyper parameters to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyper parameters -> hyperparameters

val modelStrings = modelTypesToUse.map(_.entryName)
val modelsToUse =
if (modelsAndParameters == defaultModelsAndParams || modelTypesToUse != modelNames) modelsAndParameters
.filter{ case (e, p) => modelStrings.contains(e.getClass.getSimpleName) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use a proper subset of the default models, one has to specify modelsAndParameters explicitly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can specify a subset of the model types using the modelTypesToUse parameter. To change the hyperparameters as well as the model types you have to specify the modelsAndParameters

) extends Stage1ClassificationModelSelector(validator, splitter, evaluators, uid, stage2uid, stage3uid)
object BinaryClassificationModelsToTry extends Enum[BinaryClassificationModelsToTry] {
val values = findValues
case object OpLogisticRegression extends BinaryClassificationModelsToTry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we associate each BinaryClassificationModelsToTry with the corresponding estimator and default params?

case object OpLogisticRegression extends BinaryClassificationModelsToTry {
  val estimator = new OpLogisticRegression()
  val params = new ParamGridBuilder()
      .addGrid(estimator.fitIntercept, DefaultSelectorParams.FitIntercept)
      .addGrid(estimator.elasticNetParam, DefaultSelectorParams.ElasticNet)
      .addGrid(estimator.maxIter, DefaultSelectorParams.MaxIterLin)
      .addGrid(estimator.regParam, DefaultSelectorParams.Regularization)
      .addGrid(estimator.standardization, DefaultSelectorParams.Standardized)
      .addGrid(estimator.tol, DefaultSelectorParams.Tol)
      .build()
}

Also, we might let users define custom BinaryClassificationModelsToTry?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make the base class public for BinaryClassificationModelsToTry. And I could add a sub value to have the class they are associated with rather than relying on the name...


type ModelType = Model[_ <: Model[_]] with OpTransformer2[RealNN, OPVector, Prediction]
type EstimatorType = Estimator[_ <: Model[_]] with OpPipelineStage2[RealNN, OPVector, Prediction]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire totally minor - wanna add type ModelSelector = ModelSelector[ModelType, EstimatorType]?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!!

@tovbinm tovbinm merged commit c0d6ecb into master Aug 20, 2018
@tovbinm tovbinm deleted the lm/modelSelectorInterface branch August 20, 2018 22:31
ericwayman pushed a commit that referenced this pull request Feb 8, 2019
@salesforce-cla
Copy link

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com> Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants