Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix so not preparing data twice when calling model selector fit method #251

Merged
merged 7 commits into from
Mar 28, 2019

Conversation

leahmcguire
Copy link
Collaborator

Related issues
The prepare method on the spliters had unclear use - it was used both to measure the properties of the data and to resample data for cross validation / training split. This meant that the fit method of the model selectors had a bug where it would rebalance data before cross validation as well as during cross validation leading to potential label leakage. This was not generally a problem as within a workflow we called the findBestEstimator method rather than fit - but it would lead to bad results for anyone using the model selectors as a stand alone spark stage.

Describe the proposed solution
made the base splitter class have two methods which are called independently. one to measure the data and decide how to clean or rebalance it and one to actually do the cleaning or rebalancing

Describe alternatives you've considered
make prepare only estimate once - however this could lead to unexpected results

@codecov
Copy link

codecov bot commented Mar 26, 2019

Codecov Report

Merging #251 into master will decrease coverage by 3.94%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #251      +/-   ##
==========================================
- Coverage   86.53%   82.59%   -3.95%     
==========================================
  Files         314      314              
  Lines       10297    10301       +4     
  Branches      331      537     +206     
==========================================
- Hits         8911     8508     -403     
- Misses       1386     1793     +407
Impacted Files Coverage Δ
...om/salesforce/op/stages/impl/tuning/Splitter.scala 96.87% <ø> (ø) ⬆️
...salesforce/op/stages/impl/tuning/OpValidator.scala 94.11% <ø> (ø) ⬆️
...alesforce/op/stages/impl/tuning/DataBalancer.scala 96.22% <100%> (+0.03%) ⬆️
...sforce/op/stages/impl/selector/ModelSelector.scala 98.14% <100%> (+0.07%) ⬆️
...alesforce/op/stages/impl/tuning/DataSplitter.scala 60% <100%> (ø) ⬆️
.../salesforce/op/stages/impl/tuning/DataCutter.scala 95.74% <100%> (+0.09%) ⬆️
...alesforce/op/cli/gen/templates/SimpleProject.scala 0% <0%> (-100%) ⬇️
.../scala/com/salesforce/op/cli/gen/ProblemKind.scala 0% <0%> (-100%) ⬇️
...cala/com/salesforce/op/cli/gen/FileInProject.scala 0% <0%> (-100%) ⬇️
...in/scala/com/salesforce/op/cli/CommandParser.scala 0% <0%> (-98.12%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2566a08...1cedbbe. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 26, 2019

Codecov Report

Merging #251 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #251      +/-   ##
==========================================
- Coverage   86.56%   86.55%   -0.02%     
==========================================
  Files         314      314              
  Lines       10294    10298       +4     
  Branches      339      342       +3     
==========================================
+ Hits         8911     8913       +2     
- Misses       1383     1385       +2
Impacted Files Coverage Δ
...om/salesforce/op/stages/impl/tuning/Splitter.scala 96.87% <ø> (ø) ⬆️
...salesforce/op/stages/impl/tuning/OpValidator.scala 94.11% <ø> (ø) ⬆️
...alesforce/op/stages/impl/tuning/DataBalancer.scala 96.22% <100%> (+0.03%) ⬆️
...sforce/op/stages/impl/selector/ModelSelector.scala 98.14% <100%> (+0.07%) ⬆️
...alesforce/op/stages/impl/tuning/DataSplitter.scala 60% <100%> (ø) ⬆️
.../salesforce/op/stages/impl/tuning/DataCutter.scala 95.74% <100%> (+0.09%) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 85.71% <0%> (-4.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e89a43...bc0ea7f. Read the comment docs.

* @param data
* @return Parameters set in examining data
*/
def examine(data: Dataset[Row]): Option[SplitterSummary]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is designed to have a side-effects for prepare whereas examine sounds read-only. Can we call it something like setupPrepare or prePrepare?

Copy link
Collaborator Author

@leahmcguire leahmcguire Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about assessForPrepare?

Copy link
Contributor

@gerashegalov gerashegalov Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same read-only connotation, I think your current is preValidationPrepare is good

@@ -113,6 +113,7 @@ E <: Estimator[_] with OpPipelineStage2[RealNN, OPVector, Prediction]]
protected[op] def findBestEstimator(data: Dataset[_], dag: StagesDAG, persistEveryKStages: Int = 0)
(implicit spark: SparkSession): Unit = {

splitter.map(_.examine(data.select(labelColName).toDF()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't use the result of transformation let us do splitter.foreach(_.examine(...))

@tovbinm
Copy link
Collaborator

tovbinm commented Mar 27, 2019

@leahmcguire please update PR title

@leahmcguire leahmcguire changed the title Lm/leakfix Fix so not preparing data twice when calling model selector fit method Mar 27, 2019
Copy link
Contributor

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. modulo rename followup work

/**
* Split into a training set and a test set and balance the training set
* Function to use examine the data set to set parameters for preparation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search/replace examine, sorry for the post-rename work.

@michaelweilsalesforce
Copy link
Contributor

Don't realize the bug, but it's been a while. I remember something like the first call of Splitter would "examine", but doesn't seem to be the case

@leahmcguire
Copy link
Collaborator Author

@mweilsalesforce yes that it what I remember as well but it is not the case now - it is hard to unravel the history since this got moved to the public repo about the time this code went in.

Copy link
Contributor

@michaelweilsalesforce michaelweilsalesforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leahmcguire leahmcguire merged commit 3aa144a into master Mar 28, 2019
@leahmcguire leahmcguire deleted the lm/leakfix branch March 28, 2019 18:27
Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one comment

*/
def validationPrepare(data: Dataset[Row]): Dataset[Row] = {

if (summary.isEmpty) throw new RuntimeException("Cannot call prepare until examine has been called")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examine was renamed to preValidationPrepare? so please update the error message.

@salesforce-cla
Copy link

salesforce-cla bot commented Dec 1, 2020

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants