Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track mean & standard deviation of text length as a metric for text feature #354

Merged
merged 78 commits into from
Aug 2, 2019

Conversation

TuanNguyen27
Copy link
Collaborator

@TuanNguyen27 TuanNguyen27 commented Jul 3, 2019

Problem context
If not treated as a categorical, text features are tokenized and hashed during feature engineering. However, things like IDs, dates, geographical information, etc should be treated differently. Even when hashing is the right approach, the current default hash space of TransmogrifAI is too small to capture all the information in the text. To better detect what is contained in a text field and dynamically determine an appropriate hash space, we want to track the mean and standard deviation of the string length of a text feature.

Describe the proposed solution
Mean and Std of text length is computed inside RawFeatureFilter and will be part of FeatureDistribution.

Describe alternatives you've considered
N/A. RawFeatureFilter is the appropriate place to track this information, because similar calculations (e.g distribution of tokens) also happen here, and the additional information about text length could help inform these other calculations to remove raw features more intelligently.

@codecov
Copy link

codecov bot commented Jul 3, 2019

Codecov Report

Merging #354 into master will decrease coverage by 0.1%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #354      +/-   ##
==========================================
- Coverage    86.8%   86.69%   -0.11%     
==========================================
  Files         336      336              
  Lines       10928    10943      +15     
  Branches      354      343      -11     
==========================================
+ Hits         9486     9487       +1     
- Misses       1442     1456      +14
Impacted Files Coverage Δ
...a/com/salesforce/op/filters/RawFeatureFilter.scala 92.97% <100%> (+0.03%) ⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala 98.85% <100%> (+1.17%) ⬆️
...om/salesforce/op/filters/FeatureDistribution.scala 98.63% <100%> (+0.29%) ⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100% <100%> (ø) ⬆️
...src/main/scala/com/salesforce/op/cli/CliExec.scala 55% <0%> (-25%) ⬇️
...src/main/scala/com/salesforce/op/cli/gen/Ops.scala 86% <0%> (-8%) ⬇️
...ain/scala/com/salesforce/op/cli/SchemaSource.scala 82.75% <0%> (-5.18%) ⬇️
...es/src/main/scala/com/salesforce/op/OpParams.scala 85.71% <0%> (-4.09%) ⬇️
...cala/com/salesforce/op/cli/gen/FileGenerator.scala 74.24% <0%> (-3.04%) ⬇️
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 88.19% <0%> (+0.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 496174c...4736a9c. Read the comment docs.

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test where avgTextLen result is not 0

Copy link
Contributor

@Jauntbox Jauntbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TuanNguyen27 TuanNguyen27 merged commit e1bab3b into master Aug 2, 2019
@TuanNguyen27 TuanNguyen27 deleted the tn/cardinality branch August 2, 2019 17:36
@tovbinm
Copy link
Collaborator

tovbinm commented Aug 3, 2019

@TuanNguyen27 please remember to clean the commit message prior to merging next time.

@koertkuipers
Copy link

FYI we are running into some issues with this in spark 3 which has json4s 3.6.6 instead of 3.5.3
i first thought it was maybe a scala 2.12 issue but our scala 2.12 branch (not spark3) works fine.

seems it dislikes the companion object for Moments with its own apply methods which have context bounds.

    org.json4s.package$MappingException: Can't find constructor for Moments
        at org.json4s.reflect.package$.fail(package.scala:95)
        at org.json4s.reflect.ScalaSigReader$.$anonfun$readConstructor$3(ScalaSigReader.scala:26)
        at scala.Option.getOrElse(Option.scala:138)
        at org.json4s.reflect.ScalaSigReader$.readConstructor(ScalaSigReader.scala:26)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.ctorParamType(Reflector.scala:93)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.$anonfun$createConstructorDescriptors$7(Reflector.scala:177)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:237)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.$anonfun$createConstructorDescriptors$3(Reflector.scala:159)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:39)
        at scala.collection.TraversableLike.map(TraversableLike.scala:237)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.createConstructorDescriptors(Reflector.scala:139)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.constructorsAndCompanion(Reflector.scala:135)
        at org.json4s.reflect.Reflector$ClassDescriptorBuilder.result(Reflector.scala:204)
        at org.json4s.reflect.Reflector$.createDescriptor(Reflector.scala:53)
        at org.json4s.reflect.Reflector$.$anonfun$describe$1(Reflector.scala:48)
        at org.json4s.reflect.package$Memo.apply(package.scala:36)
        at org.json4s.reflect.Reflector$.describe(Reflector.scala:48)
        at org.json4s.Extraction$.decomposeObject$1(Extraction.scala:119)
        at org.json4s.Extraction$.internalDecomposeWithBuilder(Extraction.scala:231)
        at org.json4s.Extraction$.addField$1(Extraction.scala:111)
        at org.json4s.Extraction$.decomposeObject$1(Extraction.scala:142)
        at org.json4s.Extraction$.internalDecomposeWithBuilder(Extraction.scala:231)

@leahmcguire
Copy link
Collaborator

thanks for the heads up!

@TuanNguyen27 can you please add the FeatureDistribution class and subclasses to the json serialization formats for record insights https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/ModelInsights.scala#L394

@gerashegalov gerashegalov mentioned this pull request Sep 8, 2019
gerashegalov added a commit that referenced this pull request Sep 11, 2019
Bug fixes:
- Ensure correct metrics despite model failures on some CV folds [#404](#404)
- Fix flaky `ModelInsight` tests [#395](#395)
- Avoid creating `SparseVector`s for LOCO [#377](#377)

New features / updates:
- Model combiner [#385](#399)
- Added new sample for HousingPrices [#365](#365)
- Test to verify that custom metrics appear in model insight metrics [#387](#387)
- Add `FeatureDistribution` to `SerializationFormat`s [#383](#383)
- Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378)
- Improve json serde error in `evalMetFromJson` [#380](#380)
- Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354)
- Making model selectors robust to failing models [#372](#372)
- Use compact and compressed model json by default [#375](#375)
- Descale feature contribution for Linear Regression & Logistic Regression [#345](#345)

Dependency updates:   
- Update tika version [#382](#382)
@salesforce-cla
Copy link

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants