Ensure data is `prepare`d even if there is no DAG #246

gerashegalov · 2019-03-15T19:32:29Z

Related issues
#245

Describe the proposed solution
WIP investigating

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context about the changes here.

codecov · 2019-03-15T19:48:23Z

Codecov Report

❗ No coverage uploaded for pull request base (master@a5df82e). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master     #246   +/-   ##
=========================================
  Coverage          ?   79.15%           
=========================================
  Files             ?      312           
  Lines             ?    10190           
  Branches          ?      541           
=========================================
  Hits              ?     8066           
  Misses            ?     2124           
  Partials          ?        0

Impacted Files	Coverage Δ
...op/stages/impl/tuning/OpTrainValidationSplit.scala	`100% <ø> (ø)`
...orce/op/stages/impl/tuning/OpCrossValidation.scala	`97.67% <ø> (ø)`
.../salesforce/op/stages/impl/tuning/DataCutter.scala	`95.65% <100%> (ø)`
...salesforce/op/stages/impl/tuning/OpValidator.scala	`94.36% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5df82e...0055894. Read the comment docs.

tovbinm · 2019-03-15T20:35:36Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/OpValidator.scala

+ training: DataFrame,
+ validation: DataFrame
+ ): (DataFrame, DataFrame) = {
+ val trainingValidation = Vector(training, validation).flatMap(df =>


val Array(prepTraining, prepValidation) = Array(training, validation)...

@tovbinm Are there any guidelines for coding style that should be used when contributing to Transmogrify describing details like this one?

Nope. We currently rely on scalastyle. I was considering to add a scalafmt to auto format the code, but I struggled with creating a good style. We might use this tool to autogen a config for us? https://github.com/tanishiking/scalaunfmt

tovbinm · 2019-03-15T20:35:49Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/OpValidator.scala

- s.prepare(selectTrain).train,
- s.prepare(selectTest).train)
- ).getOrElse((selectTrain, selectTest))
+ val balancedTrainTest = Vector(newTrain, newTest)


tovbinm · 2019-03-15T20:38:00Z

...cala/com/salesforce/op/stages/impl/classification/MultiClassificationModelSelectorTest.scala

+ .setInput(label, features)
+
+ testEstimator.fit(labelsBeyondMaxIndexed)
+ assert(dataCutter.getLabelsToKeep.size === dataCutter.getMaxLabelCategories)


withClue(<some useful message in case of error>) { dataCutter.getLabelsToKeep.size shouldBe dataCutter.getMaxLabelCategories }

tovbinm · 2019-03-15T20:40:09Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataCutter.scala

@@ -98,7 +98,7 @@ class DataCutter(uid: String = UID[DataCutter]) extends Splitter(uid = uid) with
 val dataUse = data.filter(r => keep.contains(r.getDouble(0)))
 val summary = DataCutterSummary(labelsKept = getLabelsToKeep, labelsDropped = getLabelsToDrop)

- ModelData(dataUse, Some(summary))
+ ModelData(dataUse.persist(), Some(summary))


Are you sure we dont persist somewhere upstream?

my 5c: my problem with persisting inside methods is that users of the methods wont know that they have to unpersist. So I prefer having methods where the input is persisted and unpersisted at the end of the same method. @leahmcguire what are your thoughts on this?

It is not quite enough for some cases, however it is probably better to control from the workflow. In theory spark may have gotten better at knowing when to forget but I haven't tested it in a while.

leahmcguire · 2019-04-01T20:17:17Z

@gerashegalov should we close this PR since we changed the way the Splitters work?

gerashegalov · 2019-04-02T13:46:43Z

@leahmcguire yes, and it was not quite addressing the prod issue yet. I will open another PR once we have the fix confirmed

Ensure data is prepared even if there is no DAG

de3c73b

gerashegalov requested review from leahmcguire and tovbinm as code owners March 15, 2019 19:32

salesforce-cla bot added the cla:signed label Mar 15, 2019

wip

cfe5de1

tovbinm reviewed Mar 15, 2019

View reviewed changes

tovbinm added 2 commits March 18, 2019 21:26

Merge branch 'master' into gera/cutter-without-dag

7031227

Merge branch 'master' into gera/cutter-without-dag

0055894

gerashegalov closed this Apr 2, 2019

gerashegalov mentioned this pull request Oct 16, 2019

added call of validation prepare before model selection when no dag i… #424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure data is `prepare`d even if there is no DAG #246

Ensure data is `prepare`d even if there is no DAG #246

gerashegalov commented Mar 15, 2019

codecov bot commented Mar 15, 2019 •

edited

Loading

tovbinm Mar 15, 2019

wsuchy Apr 1, 2019

tovbinm Apr 1, 2019

tovbinm Mar 15, 2019

tovbinm Mar 15, 2019

tovbinm Mar 15, 2019 •

edited

Loading

leahmcguire Mar 18, 2019

leahmcguire commented Apr 1, 2019

gerashegalov commented Apr 2, 2019

Ensure data is prepared even if there is no DAG #246

Ensure data is prepared even if there is no DAG #246

Conversation

gerashegalov commented Mar 15, 2019

codecov bot commented Mar 15, 2019 • edited Loading

Codecov Report

tovbinm Mar 15, 2019

Choose a reason for hiding this comment

wsuchy Apr 1, 2019

Choose a reason for hiding this comment

tovbinm Apr 1, 2019

Choose a reason for hiding this comment

tovbinm Mar 15, 2019

Choose a reason for hiding this comment

tovbinm Mar 15, 2019

Choose a reason for hiding this comment

tovbinm Mar 15, 2019 • edited Loading

Choose a reason for hiding this comment

leahmcguire Mar 18, 2019

Choose a reason for hiding this comment

leahmcguire commented Apr 1, 2019

gerashegalov commented Apr 2, 2019

Ensure data is `prepare`d even if there is no DAG #246

Ensure data is `prepare`d even if there is no DAG #246

codecov bot commented Mar 15, 2019 •

edited

Loading

tovbinm Mar 15, 2019 •

edited

Loading