Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure data is prepared even if there is no DAG #246

Closed

Conversation

gerashegalov
Copy link
Contributor

Related issues
#245

Describe the proposed solution
WIP investigating

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context about the changes here.

@codecov
Copy link

codecov bot commented Mar 15, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@a5df82e). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #246   +/-   ##
=========================================
  Coverage          ?   79.15%           
=========================================
  Files             ?      312           
  Lines             ?    10190           
  Branches          ?      541           
=========================================
  Hits              ?     8066           
  Misses            ?     2124           
  Partials          ?        0
Impacted Files Coverage Δ
...op/stages/impl/tuning/OpTrainValidationSplit.scala 100% <ø> (ø)
...orce/op/stages/impl/tuning/OpCrossValidation.scala 97.67% <ø> (ø)
.../salesforce/op/stages/impl/tuning/DataCutter.scala 95.65% <100%> (ø)
...salesforce/op/stages/impl/tuning/OpValidator.scala 94.36% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5df82e...0055894. Read the comment docs.

training: DataFrame,
validation: DataFrame
): (DataFrame, DataFrame) = {
val trainingValidation = Vector(training, validation).flatMap(df =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val Array(prepTraining, prepValidation) = Array(training, validation)...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tovbinm Are there any guidelines for coding style that should be used when contributing to Transmogrify describing details like this one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. We currently rely on scalastyle. I was considering to add a scalafmt to auto format the code, but I struggled with creating a good style. We might use this tool to autogen a config for us? https://github.com/tanishiking/scalaunfmt

s.prepare(selectTrain).train,
s.prepare(selectTest).train)
).getOrElse((selectTrain, selectTest))
val balancedTrainTest = Vector(newTrain, newTest)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

.setInput(label, features)

testEstimator.fit(labelsBeyondMaxIndexed)
assert(dataCutter.getLabelsToKeep.size === dataCutter.getMaxLabelCategories)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withClue(<some useful message in case of error>) {
   dataCutter.getLabelsToKeep.size shouldBe dataCutter.getMaxLabelCategories
}

@@ -98,7 +98,7 @@ class DataCutter(uid: String = UID[DataCutter]) extends Splitter(uid = uid) with
val dataUse = data.filter(r => keep.contains(r.getDouble(0)))
val summary = DataCutterSummary(labelsKept = getLabelsToKeep, labelsDropped = getLabelsToDrop)

ModelData(dataUse, Some(summary))
ModelData(dataUse.persist(), Some(summary))
Copy link
Collaborator

@tovbinm tovbinm Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we dont persist somewhere upstream?

my 5c: my problem with persisting inside methods is that users of the methods wont know that they have to unpersist. So I prefer having methods where the input is persisted and unpersisted at the end of the same method. @leahmcguire what are your thoughts on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not quite enough for some cases, however it is probably better to control from the workflow. In theory spark may have gotten better at knowing when to forget but I haven't tested it in a while.

@leahmcguire
Copy link
Collaborator

@gerashegalov should we close this PR since we changed the way the Splitters work?

@gerashegalov
Copy link
Contributor Author

@leahmcguire yes, and it was not quite addressing the prod issue yet. I will open another PR once we have the fix confirmed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants