Track mean & standard deviation of text length as a metric for text feature #354

TuanNguyen27 · 2019-07-03T22:43:45Z

Problem context
If not treated as a categorical, text features are tokenized and hashed during feature engineering. However, things like IDs, dates, geographical information, etc should be treated differently. Even when hashing is the right approach, the current default hash space of TransmogrifAI is too small to capture all the information in the text. To better detect what is contained in a text field and dynamically determine an appropriate hash space, we want to track the mean and standard deviation of the string length of a text feature.

Describe the proposed solution
Mean and Std of text length is computed inside RawFeatureFilter and will be part of FeatureDistribution.

Describe alternatives you've considered
N/A. RawFeatureFilter is the appropriate place to track this information, because similar calculations (e.g distribution of tokens) also happen here, and the additional information about text length could help inform these other calculations to remove raw features more intelligently.

codecov · 2019-07-03T23:29:09Z

Codecov Report

Merging #354 into master will decrease coverage by 0.1%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #354      +/-   ##
==========================================
- Coverage    86.8%   86.69%   -0.11%     
==========================================
  Files         336      336              
  Lines       10928    10943      +15     
  Branches      354      343      -11     
==========================================
+ Hits         9486     9487       +1     
- Misses       1442     1456      +14

Impacted Files	Coverage Δ
...a/com/salesforce/op/filters/RawFeatureFilter.scala	`92.97% <100%> (+0.03%)`	⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`98.85% <100%> (+1.17%)`	⬆️
...om/salesforce/op/filters/FeatureDistribution.scala	`98.63% <100%> (+0.29%)`	⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`100% <100%> (ø)`	⬆️
...src/main/scala/com/salesforce/op/cli/CliExec.scala	`55% <0%> (-25%)`	⬇️
...src/main/scala/com/salesforce/op/cli/gen/Ops.scala	`86% <0%> (-8%)`	⬇️
...ain/scala/com/salesforce/op/cli/SchemaSource.scala	`82.75% <0%> (-5.18%)`	⬇️
...es/src/main/scala/com/salesforce/op/OpParams.scala	`85.71% <0%> (-4.09%)`	⬇️
...cala/com/salesforce/op/cli/gen/FileGenerator.scala	`74.24% <0%> (-3.04%)`	⬇️
.../src/main/scala/com/salesforce/op/OpWorkflow.scala	`88.19% <0%> (+0.69%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 496174c...4736a9c. Read the comment docs.

tovbinm

please add a test where avgTextLen result is not 0

core/src/main/scala/com/salesforce/op/OpWorkflow.scala

core/src/main/scala/com/salesforce/op/filters/FeatureDistribution.scala

core/src/main/scala/com/salesforce/op/filters/RawFeatureFilter.scala

core/src/main/scala/com/salesforce/op/filters/FeatureDistribution.scala

core/src/main/scala/com/salesforce/op/filters/AllFeatureInformation.scala

…ogrifAI into tn/cardinality

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala

…tionMonoid for less boilerplate in FeatureDistribution aggregation

Jauntbox

LGTM

tovbinm · 2019-08-03T05:36:02Z

@TuanNguyen27 please remember to clean the commit message prior to merging next time.

koertkuipers · 2019-08-05T20:05:43Z

FYI we are running into some issues with this in spark 3 which has json4s 3.6.6 instead of 3.5.3
i first thought it was maybe a scala 2.12 issue but our scala 2.12 branch (not spark3) works fine.

seems it dislikes the companion object for Moments with its own apply methods which have context bounds.

    org.json4s.package$MappingException: Can't find constructor for Moments
        at org.json4s.reflect.package$.fail(package.scala:95)
        at org.json4s.reflect.ScalaSigReader$.$anonfun$readConstructor$3(ScalaSigReader.scala:26)
        at scala.Option.getOrElse(Option.scala:138)
        at org.json4s.reflect.ScalaSigReader$.readConstructor(ScalaSigReader.scala:26)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.ctorParamType(Reflector.scala:93)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.$anonfun$createConstructorDescriptors$7(Reflector.scala:177)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:237)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.$anonfun$createConstructorDescriptors$3(Reflector.scala:159)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:39)
        at scala.collection.TraversableLike.map(TraversableLike.scala:237)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.createConstructorDescriptors(Reflector.scala:139)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.constructorsAndCompanion(Reflector.scala:135)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.result(Reflector.scala:204)
        at org.json4s.reflect.Reflector$.createDescriptor(Reflector.scala:53)
        at org.json4s.reflect.Reflector$.$anonfun$describe$1(Reflector.scala:48)
        at org.json4s.reflect.package$Memo.apply(package.scala:36)
        at org.json4s.reflect.Reflector$.describe(Reflector.scala:48)
        at org.json4s.Extraction$.decomposeObject$1(Extraction.scala:119)
        at org.json4s.Extraction$.internalDecomposeWithBuilder(Extraction.scala:231)
        at org.json4s.Extraction$.addField$1(Extraction.scala:111)
        at org.json4s.Extraction$.decomposeObject$1(Extraction.scala:142)
        at org.json4s.Extraction$.internalDecomposeWithBuilder(Extraction.scala:231)

leahmcguire · 2019-08-07T16:52:46Z

thanks for the heads up!

@TuanNguyen27 can you please add the FeatureDistribution class and subclasses to the json serialization formats for record insights https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/ModelInsights.scala#L394

Bug fixes: - Ensure correct metrics despite model failures on some CV folds [#404](#404) - Fix flaky `ModelInsight` tests [#395](#395) - Avoid creating `SparseVector`s for LOCO [#377](#377) New features / updates: - Model combiner [#385](#399) - Added new sample for HousingPrices [#365](#365) - Test to verify that custom metrics appear in model insight metrics [#387](#387) - Add `FeatureDistribution` to `SerializationFormat`s [#383](#383) - Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378) - Improve json serde error in `evalMetFromJson` [#380](#380) - Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354) - Making model selectors robust to failing models [#372](#372) - Use compact and compressed model json by default [#375](#375) - Descale feature contribution for Linear Regression & Logistic Regression [#345](#345) Dependency updates: - Update tika version [#382](#382)

salesforce-cla · 2021-03-19T11:07:16Z

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

TuanNguyen27 added 5 commits July 2, 2019 14:34

starter code

d504ace

spaghetti code

9cd2790

better place to put avgTextLen

61c52c7

first fix of unit test

bf1c283

fix most tests

d64d087

TuanNguyen27 requested review from tovbinm and Jauntbox July 3, 2019 22:43

TuanNguyen27 requested a review from leahmcguire as a code owner July 3, 2019 22:43

salesforce-cla bot added the cla:signed label Jul 3, 2019

fix some styles

3f5da2c

TuanNguyen27 added the work in progress label Jul 3, 2019

TuanNguyen27 added 2 commits July 3, 2019 16:08

fix more style

6bb0256

Merge branch 'master' into tn/cardinality

0f91a40

handling division by zero

cebf02d

tovbinm reviewed Jul 5, 2019

View reviewed changes

TuanNguyen27 added 6 commits July 5, 2019 13:40

address comments

c1e50ca

adding some doc on how to use text len cardinality

75da25e

Merge branch 'master' into tn/cardinality

2d7c233

add default value for avg text len

9e0e2f9

add docs

47ad700

fix scala style

42a47ba

tovbinm reviewed Jul 8, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/filters/FeatureDistribution.scala Outdated Show resolved Hide resolved

TuanNguyen27 added 2 commits July 8, 2019 16:31

delete extra line

9082b86

Merge branch 'master' into tn/cardinality

1c2235e

tovbinm reviewed Jul 9, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/filters/AllFeatureInformation.scala Outdated Show resolved Hide resolved

TuanNguyen27 added 4 commits July 9, 2019 10:19

remove avgtextLength from doc

387f0ea

Merge branch 'tn/cardinality' of https://github.com/salesforce/Transm…

fd64fd8

…ogrifAI into tn/cardinality

starter code on moments & textstat

0b27fed

fix moments aggregation?

3d1eea9

TuanNguyen27 added 3 commits July 30, 2019 15:55

wip

d1d7dc0

update test

29d1925

fix scala style

d72004f

TuanNguyen27 added ready for review and removed work in progress labels Jul 31, 2019

removing verbose lines

6802558

Jauntbox reviewed Jul 31, 2019

View reviewed changes

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala Show resolved Hide resolved

Jauntbox reviewed Jul 31, 2019

View reviewed changes

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala Outdated Show resolved Hide resolved

Jauntbox reviewed Aug 1, 2019

View reviewed changes

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala Outdated Show resolved Hide resolved

Jauntbox reviewed Aug 1, 2019

View reviewed changes

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala Outdated Show resolved Hide resolved

Jauntbox reviewed Aug 1, 2019

View reviewed changes

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala Outdated Show resolved Hide resolved

TuanNguyen27 and others added 5 commits August 1, 2019 10:04

clean up test for cardinality and moments

88c082f

fix scala style

5b71fca

clean up summation of Option[Moments]

85d207f

Changed the TextStats SemiGroup to a Monoid so that we can make an Op…

c9bdc27

…tionMonoid for less boilerplate in FeatureDistribution aggregation

Fix merge conflict

d3c36f2

leahmcguire approved these changes Aug 2, 2019

View reviewed changes

Merge branch 'master' into tn/cardinality

4736a9c

Jauntbox approved these changes Aug 2, 2019

View reviewed changes

TuanNguyen27 merged commit e1bab3b into master Aug 2, 2019

TuanNguyen27 deleted the tn/cardinality branch August 2, 2019 17:36

TuanNguyen27 mentioned this pull request Aug 12, 2019

add FeatureDistribution to SerializationFormats #383

Merged

gerashegalov mentioned this pull request Sep 8, 2019

0.6.1 release #403

Merged

salesforce-cla bot added cla:missing and removed cla:signed labels Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track mean & standard deviation of text length as a metric for text feature #354

Track mean & standard deviation of text length as a metric for text feature #354

TuanNguyen27 commented Jul 3, 2019 •

edited

Loading

codecov bot commented Jul 3, 2019 •

edited

Loading

tovbinm left a comment

Jauntbox left a comment

tovbinm commented Aug 3, 2019

koertkuipers commented Aug 5, 2019

leahmcguire commented Aug 7, 2019

salesforce-cla bot commented Mar 19, 2021

Track mean & standard deviation of text length as a metric for text feature #354

Track mean & standard deviation of text length as a metric for text feature #354

Conversation

TuanNguyen27 commented Jul 3, 2019 • edited Loading

codecov bot commented Jul 3, 2019 • edited Loading

Codecov Report

tovbinm left a comment

Choose a reason for hiding this comment

Jauntbox left a comment

Choose a reason for hiding this comment

tovbinm commented Aug 3, 2019

koertkuipers commented Aug 5, 2019

leahmcguire commented Aug 7, 2019

salesforce-cla bot commented Mar 19, 2021

TuanNguyen27 commented Jul 3, 2019 •

edited

Loading

codecov bot commented Jul 3, 2019 •

edited

Loading