-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Streaming Histogram into RawFeatureFilter. #179
Conversation
…histogram-integration
…histogram-integration
…histogram-integration
Codecov Report
@@ Coverage Diff @@
## master #179 +/- ##
==========================================
+ Coverage 86.36% 86.38% +0.01%
==========================================
Files 310 311 +1
Lines 10136 10304 +168
Branches 351 553 +202
==========================================
+ Hits 8754 8901 +147
- Misses 1382 1403 +21
Continue to review full report at Codecov.
|
…smogrifAI into mv/histogram-integration
def plus(l: TextSummary, r: TextSummary): TextSummary = l.merge(r) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add some comments for the newly added classes, is this for all numerics? (compared to textSummary)?
is there a easy way for us to fall back to original simple binning after this change? |
@@ -64,3 +75,162 @@ case object Summary { | |||
} | |||
} | |||
} | |||
|
|||
class TextSummary(textFormula: TextSummary => Int) extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite confused here, is old Summary class still in use? or it's completely replaced by the new ones (text summary and numeric summary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary class seems no longer needed.
if (points.nonEmpty) key -> summary.update(points) else key -> summary | ||
}.toMap | ||
|
||
def updateTextSummaries( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would text bins formula be passed in to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add textBinsFormula as input argument to both getAllSummaries
and updateTextSummaries
.
|
||
val scoringUnfilled = | ||
if (scoringDistribs.nonEmpty) { | ||
require(scoringDistribs.length == featureSize, "scoring and training features must match") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will happen if scoring Distribution is empty or feature size does not match now, is it checked somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to separate scoringUnfilled check out from distribMismatches check in getAllReasons
.
@@ -38,15 +38,15 @@ object DefaultSelectorParams { | |||
val MaxBin = Array(32) // bins for cont variables in trees - 32 is spark default | |||
val MinInstancesPerNode = Array(10, 100) // spark default 1 | |||
val MinInfoGain = Array(0.001, 0.01, 0.1) // spark default 0 | |||
val Regularization = Array(0.001, 0.01, 0.1, 0.2) // spark default 0 | |||
val Regularization = Array(0.0001, 0.0015, 0.001, 0.01, 0.1) // spark default 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we make those hyper-parameter change, are there any related experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only related doc I can find: https://salesforce.quip.com/JhIiA1PLIYUY (streaming histogram results)
val (totalCount, responseSummaries, numericSummaries, textSummaries) = sum | ||
val (responseFeatures, numericFeatures, textFeatures) = feat | ||
|
||
def updateNumericSummaries( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be just my personal feelings, is this nested too much? e.g. three levels nested method
} | ||
|
||
// This will initialize existing text summary hashing TFs | ||
textSummaries.foreach { case (_, textSum) => textSum.setHashingTF() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps call setHashingTF when creating a new TextSummary (i.e., in updateTextSummaries)?
.getOrElse(key -> summary) | ||
}.toMap | ||
|
||
val newResponseSummaries = updateNumericSummaries(responseSummaries, responseFeatures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
…ansmogrifAI into mv/histogram-integration
…histogram-integration
@marcovivero are you not going to work on this or you are making a clean / smaller one? |
Related issues
No related issue, this is an enhancement.
Describe the proposed solution
Integrating StreamingHistogram with histogram generation which happens in RawFeatureFilter. Text distribution computation has also been separated from numeric distribution computations.