Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Streaming Histogram into RawFeatureFilter. #179

Closed
wants to merge 23 commits into from

Conversation

marcovivero
Copy link
Contributor

Related issues
No related issue, this is an enhancement.

Describe the proposed solution
Integrating StreamingHistogram with histogram generation which happens in RawFeatureFilter. Text distribution computation has also been separated from numeric distribution computations.

@marcovivero marcovivero changed the title Mv/histogram integration Integrate Streaming Histogram into RawFeatureFilter. Nov 8, 2018
@codecov
Copy link

codecov bot commented Nov 9, 2018

Codecov Report

Merging #179 into master will increase coverage by 0.01%.
The diff coverage is 89.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
+ Coverage   86.36%   86.38%   +0.01%     
==========================================
  Files         310      311       +1     
  Lines       10136    10304     +168     
  Branches      351      553     +202     
==========================================
+ Hits         8754     8901     +147     
- Misses       1382     1403      +21
Impacted Files Coverage Δ
...scala/com/salesforce/op/features/FeatureLike.scala 42.39% <ø> (ø) ⬆️
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 87.5% <100%> (ø) ⬆️
...a/com/salesforce/op/filters/PreparedFeatures.scala 78.94% <100%> (-1.55%) ⬇️
...sforce/op/utils/stats/RichStreamingHistogram.scala 87.5% <100%> (ø) ⬆️
...m/salesforce/op/utils/kryo/OpKryoRegistrator.scala 97.72% <100%> (+0.29%) ⬆️
.../main/scala/com/salesforce/op/OpWorkflowCore.scala 93.65% <100%> (+0.1%) ⬆️
...om/salesforce/op/filters/FeatureDistribution.scala 93.93% <76.92%> (-4.4%) ⬇️
...main/scala/com/salesforce/op/filters/Summary.scala 81.39% <81.01%> (-18.61%) ⬇️
...a/com/salesforce/op/filters/RawFeatureFilter.scala 91.15% <93.47%> (+1.87%) ⬆️
...main/scala/com/salesforce/op/filters/package.scala 97.77% <97.77%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ee7cc7...8fa42ac. Read the comment docs.

def plus(l: TextSummary, r: TextSummary): TextSummary = l.merge(r)
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some comments for the newly added classes, is this for all numerics? (compared to textSummary)?

@sxd929
Copy link
Contributor

sxd929 commented Nov 28, 2018

is there a easy way for us to fall back to original simple binning after this change?

@@ -64,3 +75,162 @@ case object Summary {
}
}
}

class TextSummary(textFormula: TextSummary => Int) extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused here, is old Summary class still in use? or it's completely replaced by the new ones (text summary and numeric summary)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary class seems no longer needed.

if (points.nonEmpty) key -> summary.update(points) else key -> summary
}.toMap

def updateTextSummaries(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would text bins formula be passed in to this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add textBinsFormula as input argument to both getAllSummaries and updateTextSummaries.


val scoringUnfilled =
if (scoringDistribs.nonEmpty) {
require(scoringDistribs.length == featureSize, "scoring and training features must match")
Copy link
Contributor

@sxd929 sxd929 Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if scoring Distribution is empty or feature size does not match now, is it checked somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to separate scoringUnfilled check out from distribMismatches check in getAllReasons.

@@ -38,15 +38,15 @@ object DefaultSelectorParams {
val MaxBin = Array(32) // bins for cont variables in trees - 32 is spark default
val MinInstancesPerNode = Array(10, 100) // spark default 1
val MinInfoGain = Array(0.001, 0.01, 0.1) // spark default 0
val Regularization = Array(0.001, 0.01, 0.1, 0.2) // spark default 0
val Regularization = Array(0.0001, 0.0015, 0.001, 0.01, 0.1) // spark default 0
Copy link
Contributor

@sxd929 sxd929 Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we make those hyper-parameter change, are there any related experiments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only related doc I can find: https://salesforce.quip.com/JhIiA1PLIYUY (streaming histogram results)

val (totalCount, responseSummaries, numericSummaries, textSummaries) = sum
val (responseFeatures, numericFeatures, textFeatures) = feat

def updateNumericSummaries(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be just my personal feelings, is this nested too much? e.g. three levels nested method

}

// This will initialize existing text summary hashing TFs
textSummaries.foreach { case (_, textSum) => textSum.setHashingTF() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps call setHashingTF when creating a new TextSummary (i.e., in updateTextSummaries)?

.getOrElse(key -> summary)
}.toMap

val newResponseSummaries = updateNumericSummaries(responseSummaries, responseFeatures)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

@marcovivero marcovivero closed this Mar 4, 2019
@tovbinm
Copy link
Collaborator

tovbinm commented Mar 4, 2019

@marcovivero are you not going to work on this or you are making a clean / smaller one?

@tovbinm tovbinm mentioned this pull request Jul 11, 2019
@tovbinm tovbinm deleted the mv/histogram-integration branch November 26, 2019 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants