-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new sample for HousingPrices #365
Conversation
Codecov Report
@@ Coverage Diff @@
## master #365 +/- ##
===========================================
- Coverage 86.05% 28.77% -57.29%
===========================================
Files 336 336
Lines 10950 8849 -2101
Branches 351 433 +82
===========================================
- Hits 9423 2546 -6877
- Misses 1527 6303 +4776
Continue to review full report at Codecov.
|
@Jauntbox can you take a look? you have done more with notebooks than I have. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome thank you so much for the contribution!
I made a couple minor comments. Mostly expansions to the descriptions, once those are done LGTM :-)
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%classpath add mvn com.salesforce.transmogrifai transmogrifai-core_2.11 0.5.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update to latest version 0.6.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%classpath add mvn org.apache.spark spark-mllib_2.11 2.3.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update to current spark dep 2.4.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought version 0.6.0 was still on Spark 2.3.x, and the upgrade to Spark 2.4.3 is in master but hasn't been released yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire @Jauntbox so what is the spark version i should go with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jauntbox is correct please go with spark 2.3.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire ok..
"\n", | ||
"When defining raw features, specify the extract logic to be applied to the raw data, and also annotate the features as either predictor or response variables via the FeatureBuilders.\n", | ||
"\n", | ||
"SalesType encoder from text to numeric." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SalesType encoder from text to numeric." What?
perhaps expand a bit as in the Iris notebook:
"#### Feature Engineering\n",
"\n",
"We then define the set of raw features that we would like to extract from the data. The raw features are defined using [FeatureBuilders](https://docs.transmogrif.ai/Developer-Guide#featurebuilders), and are strongly typed. TransmogrifAI supports the following basic feature types: `Text`, `Numeric`, `Vector`, `List` , `Set`, `Map`. \n",
"In addition it supports many specific feature types which extend these base types: Email extends Text; Integral, Real and Binary extend Numeric; Currency and Percentage extend Real. For a complete view of the types supported see the Type Hierarchy and Automatic Feature Engineering section in the Documentation.\n",
"\n",
"Basic `FeatureBuilders` will be created for you if you use the TransmogrifAI CLI to bootstrap your project as described here. However, it is often useful to edit this code to customize feature generation and take full advantage of the Feature types available (selecting the appropriate type will improve automatic feature engineering steps).\n",
"\n",
"When defining raw features, specify the extract logic to be applied to the raw data, and also annotate the features as either predictor or response variables via the FeatureBuilders:"```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
"source": [ | ||
"**Create a feature sequence and transmogrify it**\n", | ||
"\n", | ||
"The `.transmogrify()` shortcut is a special AutoML Estimator that applies a default set of transformations to all the specified inputs and combines them into a single vector. This is in essence the automatic feature engineering Stage of TransmogrifAI. This stage can be discarded in favor of hand-tuned feature engineering and manual vector creation followed by combination using the VectorsCombiner Transformer (short-hand Seq(....).combine()) if the user desires to have complete control over feature engineering.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a Transmografai shortcut to many estimators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
"\n", | ||
"The `.transmogrify()` shortcut is a special AutoML Estimator that applies a default set of transformations to all the specified inputs and combines them into a single vector. This is in essence the automatic feature engineering Stage of TransmogrifAI. This stage can be discarded in favor of hand-tuned feature engineering and manual vector creation followed by combination using the VectorsCombiner Transformer (short-hand Seq(....).combine()) if the user desires to have complete control over feature engineering.\n", | ||
"\n", | ||
"The next stage applies another powerful AutoML Estimator — the SanityChecker. The SanityChecker applies a variety of statistical tests to the data based on Feature types and discards predictors that are indicative of label leakage or that show little to no predictive power. This is in essence the automatic feature selection Stage of TransmogrifAI:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transmografai rather than AutoML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Create an OpWorkflow and call train() on it to create a model." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the pipeline dag to create a sequence of transformations that are saved in a workflow model."
"When we now call ‘train’ on this workflow, it automatically computes and executes the entire DAG of Stages needed to compute the features fitting all the estimators on the training data in the process. Calling score on the fitted workflow then transforms the underlying training data to produce a DataFrame with the all the features manifested. The score method can optionally be passed an evaluator that produces metrics.\n",
"workflow.train()
methods fits all of the estimators in the pipeline and return a pipeline model of only transformers. Uses data loaded as specified by the data reader to generate the initial data set."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clean up the extract functions, then it looks good!
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"val lotShape = FeatureBuilder.Integral[HousingPrices].extract(x =>\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like
FeatureBuilder.Integral[HousingPrices].extract(_.lotShape match {
case "IR1" => 1
case _ => 0
}.toIntegral).asPredictor
would be more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"val yrSold = FeatureBuilder.Integral[HousingPrices].extract(x =>\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use vars in here. Something like this is better:
FeatureBuilder.Integral[HousingPrices].extract(x => (2019 - x.yrSold).toIntegral).asPredictor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
" val saleType = FeatureBuilder.Integral[HousingPrices].extract(x =>\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for intermediate vals here:
FeatureBuilder.Integral[HousingPrices].extract(x => saleTypeEncoder.get(x.saleType).toIntegral).asPredictor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"val saleCondition = FeatureBuilder.Integral[HousingPrices].extract(x =>\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, cleaner as
FeatureBuilder.Integral[HousingPrices].extract(x =>
saleConditionEncoder.get(x.saleCondition).toIntegral).asPredictor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
"source": [ | ||
"import org.apache.spark.sql.{Encoders}\n", | ||
"implicit val srEncoder = Encoders.product[HousingPrices]\n", | ||
"val saleTypeEncoder = Map(\"COD\" -> 1, \"CWD\" -> 2, \"Con\" -> 3, \"ConLD\" -> 4,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why convert these to integers rather than treat them as picklists?
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"val yrSold = FeatureBuilder.Integral[HousingPrices].extract(x => (2019 - x.yrSold).toIntegral).asPredictor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just treat this as a date?
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"val saleConditionEncoder = Map(\"Abnorml\" -> 1, \"AdjLand\" -> 2, \"Alloca\" -> 3, \"Family\" -> 4,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here why are you converting to int? this is a picklist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they work better in the model as encoded variable can you use the string indexer before calling .transmografai()
"val saleTypeInd = saleType.indexed()"
val saleCondInd = saleCondition.indexed()"
"val features = Seq(lotFrontage,area,lotShape, yrSold, saleTypeInd, saleCondInd).transmogrify()\n",
This is preferred because model insights will have the correct names if you use our internal indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With encoded variables i am not able to use indexed()
<console>:149: error: value indexed is not a member of com.salesforce.op.features.Feature[com.salesforce.op.features.types.Integral]
val saleCondition = FeatureBuilder.Integral[HousingPrices].extract(x => saleConditionEncoder.get(x.saleCondition).toIntegral).asPredictor.indexed()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexer is for strings - put the original string in - not the mapped integer value. The indexer will then do the mapping for you. But unlike in the map you are creating the name to index mapping will be saved in the feature metadata so any feature insights will have the name associated with it. It will also not treat it as an ordered integer in modeling which would be inappropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for the contribution!
@Jauntbox can you take another look? your review is blocking |
@Jauntbox cleaned extract functions, please check |
Sorry for the delay - LGTM! |
@tovbinm please approve |
Matthew is on vacation. |
@leahmcguire thanks! |
Bug fixes: - Ensure correct metrics despite model failures on some CV folds [#404](#404) - Fix flaky `ModelInsight` tests [#395](#395) - Avoid creating `SparseVector`s for LOCO [#377](#377) New features / updates: - Model combiner [#385](#399) - Added new sample for HousingPrices [#365](#365) - Test to verify that custom metrics appear in model insight metrics [#387](#387) - Add `FeatureDistribution` to `SerializationFormat`s [#383](#383) - Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378) - Improve json serde error in `evalMetFromJson` [#380](#380) - Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354) - Making model selectors robust to failing models [#372](#372) - Use compact and compressed model json by default [#375](#375) - Descale feature contribution for Linear Regression & Logistic Regression [#345](#345) Dependency updates: - Update tika version [#382](#382)
Thanks for the contribution! Before we can merge this, we need @rajdeepd to sign the Salesforce.com Contributor License Agreement. |
Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Related issues
[Issues]#211
Describe the proposed solution
New more complex regression example on Jupyter
Describe alternatives you've considered
none
Additional context
Adding more examples to TransmogrifAI stack