-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure data is prepare
d even if there is no DAG
#246
Ensure data is prepare
d even if there is no DAG
#246
Conversation
Codecov Report
@@ Coverage Diff @@
## master #246 +/- ##
=========================================
Coverage ? 79.15%
=========================================
Files ? 312
Lines ? 10190
Branches ? 541
=========================================
Hits ? 8066
Misses ? 2124
Partials ? 0
Continue to review full report at Codecov.
|
training: DataFrame, | ||
validation: DataFrame | ||
): (DataFrame, DataFrame) = { | ||
val trainingValidation = Vector(training, validation).flatMap(df => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val Array(prepTraining, prepValidation) = Array(training, validation)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tovbinm Are there any guidelines for coding style that should be used when contributing to Transmogrify describing details like this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. We currently rely on scalastyle. I was considering to add a scalafmt to auto format the code, but I struggled with creating a good style. We might use this tool to autogen a config for us? https://github.com/tanishiking/scalaunfmt
s.prepare(selectTrain).train, | ||
s.prepare(selectTest).train) | ||
).getOrElse((selectTrain, selectTest)) | ||
val balancedTrainTest = Vector(newTrain, newTest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
.setInput(label, features) | ||
|
||
testEstimator.fit(labelsBeyondMaxIndexed) | ||
assert(dataCutter.getLabelsToKeep.size === dataCutter.getMaxLabelCategories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withClue(<some useful message in case of error>) {
dataCutter.getLabelsToKeep.size shouldBe dataCutter.getMaxLabelCategories
}
@@ -98,7 +98,7 @@ class DataCutter(uid: String = UID[DataCutter]) extends Splitter(uid = uid) with | |||
val dataUse = data.filter(r => keep.contains(r.getDouble(0))) | |||
val summary = DataCutterSummary(labelsKept = getLabelsToKeep, labelsDropped = getLabelsToDrop) | |||
|
|||
ModelData(dataUse, Some(summary)) | |||
ModelData(dataUse.persist(), Some(summary)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we dont persist somewhere upstream?
my 5c: my problem with persisting inside methods is that users of the methods wont know that they have to unpersist. So I prefer having methods where the input is persisted and unpersisted at the end of the same method. @leahmcguire what are your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not quite enough for some cases, however it is probably better to control from the workflow. In theory spark may have gotten better at knowing when to forget but I haven't tested it in a while.
@gerashegalov should we close this PR since we changed the way the Splitters work? |
@leahmcguire yes, and it was not quite addressing the prod issue yet. I will open another PR once we have the fix confirmed |
Related issues
#245
Describe the proposed solution
WIP investigating
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context about the changes here.