Integrate Streaming Histogram into RawFeatureFilter. #179

marcovivero · 2018-11-08T00:25:31Z

Related issues
No related issue, this is an enhancement.

Describe the proposed solution
Integrating StreamingHistogram with histogram generation which happens in RawFeatureFilter. Text distribution computation has also been separated from numeric distribution computations.

…histogram-integration

codecov · 2018-11-09T18:57:56Z

Codecov Report

Merging #179 into master will increase coverage by 0.01%.
The diff coverage is 89.75%.

@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
+ Coverage   86.36%   86.38%   +0.01%     
==========================================
  Files         310      311       +1     
  Lines       10136    10304     +168     
  Branches      351      553     +202     
==========================================
+ Hits         8754     8901     +147     
- Misses       1382     1403      +21

Impacted Files	Coverage Δ
...scala/com/salesforce/op/features/FeatureLike.scala	`42.39% <ø> (ø)`	⬆️
.../src/main/scala/com/salesforce/op/OpWorkflow.scala	`87.5% <100%> (ø)`	⬆️
...a/com/salesforce/op/filters/PreparedFeatures.scala	`78.94% <100%> (-1.55%)`	⬇️
...sforce/op/utils/stats/RichStreamingHistogram.scala	`87.5% <100%> (ø)`	⬆️
...m/salesforce/op/utils/kryo/OpKryoRegistrator.scala	`97.72% <100%> (+0.29%)`	⬆️
.../main/scala/com/salesforce/op/OpWorkflowCore.scala	`93.65% <100%> (+0.1%)`	⬆️
...om/salesforce/op/filters/FeatureDistribution.scala	`93.93% <76.92%> (-4.4%)`	⬇️
...main/scala/com/salesforce/op/filters/Summary.scala	`81.39% <81.01%> (-18.61%)`	⬇️
...a/com/salesforce/op/filters/RawFeatureFilter.scala	`91.15% <93.47%> (+1.87%)`	⬆️
...main/scala/com/salesforce/op/filters/package.scala	`97.77% <97.77%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ee7cc7...8fa42ac. Read the comment docs.

…smogrifAI into mv/histogram-integration

sxd929 · 2018-11-27T23:53:27Z

core/src/main/scala/com/salesforce/op/filters/Summary.scala

+ def plus(l: TextSummary, r: TextSummary): TextSummary = l.merge(r)
+ }
+}
+


could you add some comments for the newly added classes, is this for all numerics? (compared to textSummary)?

sxd929 · 2018-11-28T18:07:46Z

is there a easy way for us to fall back to original simple binning after this change?

sxd929 · 2018-11-28T18:17:52Z

core/src/main/scala/com/salesforce/op/filters/Summary.scala

@@ -64,3 +75,162 @@ case object Summary {
 }
 }
 }
+
+class TextSummary(textFormula: TextSummary => Int) extends Serializable {


I'm quite confused here, is old Summary class still in use? or it's completely replaced by the new ones (text summary and numeric summary)

Summary class seems no longer needed.

sxd929 · 2018-11-28T18:38:19Z

core/src/main/scala/com/salesforce/op/filters/RawFeatureFilter.scala

+ if (points.nonEmpty) key -> summary.update(points) else key -> summary
+ }.toMap
+
+ def updateTextSummaries(


how would text bins formula be passed in to this?

I think we should add textBinsFormula as input argument to both getAllSummaries and updateTextSummaries.

sxd929 · 2018-11-28T18:47:03Z

core/src/main/scala/com/salesforce/op/filters/RawFeatureFilter.scala

-
- val scoringUnfilled =
- if (scoringDistribs.nonEmpty) {
- require(scoringDistribs.length == featureSize, "scoring and training features must match")


what will happen if scoring Distribution is empty or feature size does not match now, is it checked somewhere?

I think it's better to separate scoringUnfilled check out from distribMismatches check in getAllReasons.

sxd929 · 2018-11-28T18:49:07Z

core/src/main/scala/com/salesforce/op/stages/impl/selector/DefaultSelectorParams.scala

@@ -38,15 +38,15 @@ object DefaultSelectorParams {
 val MaxBin = Array(32) // bins for cont variables in trees - 32 is spark default
 val MinInstancesPerNode = Array(10, 100) // spark default 1
 val MinInfoGain = Array(0.001, 0.01, 0.1) // spark default 0
- val Regularization = Array(0.001, 0.01, 0.1, 0.2) // spark default 0
+ val Regularization = Array(0.0001, 0.0015, 0.001, 0.01, 0.1) // spark default 0


why did we make those hyper-parameter change, are there any related experiments?

The only related doc I can find: https://salesforce.quip.com/JhIiA1PLIYUY (streaming histogram results)

sxd929 · 2018-11-28T18:50:28Z

core/src/main/scala/com/salesforce/op/filters/RawFeatureFilter.scala

+ val (totalCount, responseSummaries, numericSummaries, textSummaries) = sum
+ val (responseFeatures, numericFeatures, textFeatures) = feat
+
+ def updateNumericSummaries(


this might be just my personal feelings, is this nested too much? e.g. three levels nested method

kinfaikan · 2018-12-04T10:47:13Z

core/src/main/scala/com/salesforce/op/filters/RawFeatureFilter.scala

+ }
+
+ // This will initialize existing text summary hashing TFs
+ textSummaries.foreach { case (_, textSum) => textSum.setHashingTF() }


Perhaps call setHashingTF when creating a new TextSummary (i.e., in updateTextSummaries)?

kinfaikan · 2018-12-04T10:59:19Z

core/src/main/scala/com/salesforce/op/filters/RawFeatureFilter.scala

+ .getOrElse(key -> summary)
+ }.toMap
+
+ val newResponseSummaries = updateNumericSummaries(responseSummaries, responseFeatures)


Indentation

…ansmogrifAI into mv/histogram-integration

…histogram-integration

tovbinm · 2019-03-04T21:37:51Z

@marcovivero are you not going to work on this or you are making a clean / smaller one?

marcovivero and others added 11 commits October 16, 2018 16:37

Initial integration steps

369b47e

Next step

a67ed6a

Scalastyle

586d8d9

Fix feature dropping logic

fd4b3ab

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mv/…

7db364f

…histogram-integration

Small clean up

9b9c3b7

Debug setters

9ca9438

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mv/…

7cc5a94

…histogram-integration

Use density instead of mass for text JS divergence

b67d52c

Update default selector params

b30630e

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mv/…

56306ef

…histogram-integration

marcovivero requested review from tovbinm, sxd929 and Jauntbox November 8, 2018 00:25

marcovivero requested a review from leahmcguire as a code owner November 8, 2018 00:25

salesforce-cla bot added the cla:signed label Nov 8, 2018

marcovivero changed the title ~~Mv/histogram integration~~ Integrate Streaming Histogram into RawFeatureFilter. Nov 8, 2018

Merge branch 'master' into mv/histogram-integration

54debed

tovbinm and others added 7 commits November 9, 2018 12:37

Merge branch 'master' into mv/histogram-integration

f40528d

Use Map monoid for reasons

c40b23d

Merge branch 'mv/histogram-integration' of github.com:salesforce/Tran…

cf8f550

…smogrifAI into mv/histogram-integration

Merge branch 'master' into mv/histogram-integration

65b8700

Merge branch 'master' into mv/histogram-integration

76d2ac4

Merge branch 'master' into mv/histogram-integration

01253d4

Merge branch 'master' into mv/histogram-integration

1d6389a

sxd929 reviewed Nov 27, 2018

View reviewed changes

sxd929 reviewed Nov 28, 2018

View reviewed changes

Some more features

7d5b649

kinfaikan reviewed Dec 4, 2018

View reviewed changes

marcovivero added 3 commits January 28, 2019 22:09

Merge branch 'mv/histogram-integration-2' of github.com:salesforce/Tr…

22a2ad7

…ansmogrifAI into mv/histogram-integration

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mv/…

1eff149

…histogram-integration

Revert b30630e

8fa42ac

marcovivero closed this Mar 4, 2019

tovbinm mentioned this pull request Jul 11, 2019

Release 3.3.3 #26

Merged

tovbinm deleted the mv/histogram-integration branch November 26, 2019 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Streaming Histogram into RawFeatureFilter. #179

Integrate Streaming Histogram into RawFeatureFilter. #179

marcovivero commented Nov 8, 2018

codecov bot commented Nov 9, 2018 •

edited

Loading

sxd929 Nov 27, 2018

sxd929 commented Nov 28, 2018

sxd929 Nov 28, 2018

kinfaikan Dec 4, 2018

sxd929 Nov 28, 2018

kinfaikan Dec 4, 2018

sxd929 Nov 28, 2018 •

edited

Loading

kinfaikan Dec 4, 2018

sxd929 Nov 28, 2018 •

edited

Loading

kinfaikan Dec 4, 2018

sxd929 Nov 28, 2018

kinfaikan Dec 4, 2018

kinfaikan Dec 4, 2018

tovbinm commented Mar 4, 2019

Integrate Streaming Histogram into RawFeatureFilter. #179

Integrate Streaming Histogram into RawFeatureFilter. #179

Conversation

marcovivero commented Nov 8, 2018

codecov bot commented Nov 9, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

sxd929 commented Nov 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxd929 Nov 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxd929 Nov 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm commented Mar 4, 2019

codecov bot commented Nov 9, 2018 •

edited

Loading

sxd929 Nov 28, 2018 •

edited

Loading

sxd929 Nov 28, 2018 •

edited

Loading