-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New model selector interface #55
Conversation
@leahmcguire this PR is quite large. is there anything we can do here? perhaps split it or provide recommendations on how to review it. |
Not really - this is just all the tests and files that touched the ModelSelector interface |
I can walk you through the actual changes - they are actually surprising small |
Codecov Report
@@ Coverage Diff @@
## master #55 +/- ##
=========================================
- Coverage 86.3% 85.9% -0.41%
=========================================
Files 298 294 -4
Lines 9305 9521 +216
Branches 303 535 +232
=========================================
+ Hits 8031 8179 +148
- Misses 1274 1342 +68
Continue to review full report at Codecov.
|
modelsAndParameters: Seq[(EstimatorType, Array[ParamMap])] | ||
): ModelSelector[ModelType, EstimatorType] = { | ||
val modelStrings = modelTypesToUse.map(_.entryName) | ||
val modelsToUse = if (modelsAndParameters == defaultModelsAndParams) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove check
val cv = new OpCrossValidation[ProbClassifierModel, ProbClassifier]( | ||
parallelism: Int = ValidatorParamDefaults.Parallelism, | ||
modelTypesToUse: Seq[_ <: BinaryClassificationModelsToTry] = modelNames, | ||
modelsAndParameters: Seq[(EstimatorType, Array[ParamMap])] = defaultModelsAndParams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update docs
): BinaryClassificationModelSelector = { | ||
val ts = new OpTrainValidationSplit[ProbClassifierModel, ProbClassifier]( | ||
parallelism: Int = ValidatorParamDefaults.Parallelism, | ||
modelTypesToUse: Seq[_ <: BinaryClassificationModelsToTry] = modelNames, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split methods
.setInput(label, checkedFeatures) | ||
.getOutput() | ||
|
||
val evaluator = | ||
Evaluators.BinaryClassification() | ||
.setLabelCol(label).setPredictionCol(pred).setRawPredictionCol(raw) | ||
.setLabelCol(label).setFullPredictionCol(pred) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rename setFullPredictionCol
to setPredictionCol
|
||
|
||
/** | ||
* A factory for Binary Classification Model Selector | ||
*/ | ||
case object BinaryClassificationModelSelector { | ||
|
||
private[op] val modelNames: Seq[_ <: BinaryClassificationModelsToTry] = Seq(MTT.OpLogisticRegression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simply do Seq[BinaryClassificationModelsToTry]
here and everywhere for other selectors
case m => setDefault(sparkMlStage, Option(m)) | ||
} | ||
|
||
lazy val recoveredStage: ModelType = getSparkMlStage() match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private transient lazy val recoveredStage
…nsmogrifAI into lm/modelSelectorInterface
@leahmcguire This is not be related to this PR, but I'm trying to understand the type safety in the models to try. Should we consider in the future enforcing a type 'Classification' for |
It is an not part of this PR but you are correct @michaelweilsalesforce there is no compile time type check for this now. It will fail at runtime because the evaluator will not find a raw prediction / probability. In order to support users being able to define their own estimators I had to relax the type checks. Any estimator that takes a label and feature vector and returns a prediction will try to run. So the default models and models that can be turned on by name are all of the correct type. A user can mess it up if they try :-P Good eye :-) |
final val predictionCol: Param[String] = new Param[String](this, "predictionCol", "prediction column name") | ||
setDefault(predictionCol, "prediction") | ||
trait OpHasPredictionValueCol[T <: FeatureType] extends Params { | ||
final val predictionValueCol: Param[String] = new Param[String](this, "predictionCol", "prediction column name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"predictionCol"
-> "predictionValueCol"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and fullPredictionCol
-> predictionCol
trait OpHasFullPredictionCol extends Params { | ||
final val fullPredictionCol: Param[String] = new Param[String](this, "fullPredictionCol", "prediction column name") | ||
trait OpHasPredictionCol extends Params { | ||
final val predictionCol: Param[String] = new Param[String](this, "fullPredictionCol", "prediction column name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update name and doc of the param?
Add setDefault
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no default that can be set for this
!(isSet(predictionCol) && data.schema.fieldNames.contains(getPredictionCol))) { | ||
val fullPredictionColName = getFullPredictionCol | ||
if (isSet(predictionCol) && | ||
!(isSet(predictionValueCol) && data.schema.fieldNames.contains(getPredictionValueCol))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data.columns.contains(getPredictionValueCol)
!(isSet(predictionCol) && data.schema.fieldNames.contains(getPredictionCol))) { | ||
val fullPredictionColName = getFullPredictionCol | ||
if (isSet(predictionCol) && | ||
!(isSet(predictionValueCol) && data.schema.fieldNames.contains(getPredictionValueCol))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data.columns.contains(getPredictionValueCol)
* @param modelTypesToUse list of model types to run grid search on must from supported types in | ||
* BinaryClassificationModelsToTry (OpLogisticRegression, OpRandomForestClassifier, | ||
* OpGBTClassifier, OpLinearSVC, OpDecisionTreeClassifier, OpNaiveBayes) | ||
* @param modelsAndParameters pass in an explicit list pairs of estimators and the accompanying hyper parameters to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hyper parameters -> hyperparameters
val modelStrings = modelTypesToUse.map(_.entryName) | ||
val modelsToUse = | ||
if (modelsAndParameters == defaultModelsAndParams || modelTypesToUse != modelNames) modelsAndParameters | ||
.filter{ case (e, p) => modelStrings.contains(e.getClass.getSimpleName) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use a proper subset of the default models, one has to specify modelsAndParameters
explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can specify a subset of the model types using the modelTypesToUse parameter. To change the hyperparameters as well as the model types you have to specify the modelsAndParameters
) extends Stage1ClassificationModelSelector(validator, splitter, evaluators, uid, stage2uid, stage3uid) | ||
object BinaryClassificationModelsToTry extends Enum[BinaryClassificationModelsToTry] { | ||
val values = findValues | ||
case object OpLogisticRegression extends BinaryClassificationModelsToTry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we associate each BinaryClassificationModelsToTry with the corresponding estimator and default params?
case object OpLogisticRegression extends BinaryClassificationModelsToTry {
val estimator = new OpLogisticRegression()
val params = new ParamGridBuilder()
.addGrid(estimator.fitIntercept, DefaultSelectorParams.FitIntercept)
.addGrid(estimator.elasticNetParam, DefaultSelectorParams.ElasticNet)
.addGrid(estimator.maxIter, DefaultSelectorParams.MaxIterLin)
.addGrid(estimator.regParam, DefaultSelectorParams.Regularization)
.addGrid(estimator.standardization, DefaultSelectorParams.Standardized)
.addGrid(estimator.tol, DefaultSelectorParams.Tol)
.build()
}
Also, we might let users define custom BinaryClassificationModelsToTry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make the base class public for BinaryClassificationModelsToTry. And I could add a sub value to have the class they are associated with rather than relying on the name...
|
||
type ModelType = Model[_ <: Model[_]] with OpTransformer2[RealNN, OPVector, Prediction] | ||
type EstimatorType = Estimator[_ <: Model[_]] with OpPipelineStage2[RealNN, OPVector, Prediction] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire totally minor - wanna add type ModelSelector = ModelSelector[ModelType, EstimatorType]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!!
Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com> Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Related issues
Changing model selector interface to use new more flexible model selector class.
Describe the proposed solution
Changing model selector interface to use new more flexible model selector class.
Describe alternatives you've considered
Leaving both interfaces. It would be confusing.
Additional context
Add any other context about the changes here.