Add support for ignoring text that looks like IDs in SmartTextVectorizer #448

Jauntbox · 2019-12-16T20:14:53Z

Related issues
N/A

Describe the proposed solution
Adds a few parameters to SmartTextVectorizer to allow for ignoring text fields that would be hashed (eg. not categorical) if they have a token length variance below a specified threshold (eg. to catch machine-generated IDs).

Describe alternatives you've considered
Other alternatives are a sort of topK token counting (eg. with a countMinSketch). This works, but is difficult to robustly scale with dataset size, and may be implemented later via Algebird's TopKCMS data structure. Filtering data by raw text length std dev, or by how well the text length distribution fits a poisson distribution performed better on synthetic data and requires less modifications to SmartTextVectorizer.

Additional context
Extra thing we need to be careful of is that we still use the CJK tokenizer for Chinese and Korean text (Japanese uses a proper language-specific tokenizer already), and this tokenizer always splits the text into character bigrams which would cause it to fail any length distribution checks. We will need to update the Korean & Chinese tokenizers to language-specific ones that will pick out words rather than bigrams.

We plan to also add a way to filter based on goodness of fit of the text length distribution to a Poisson distribution in a future PR. All the information is already available to do this, so the modifications should be straightforward.

…n text length distributions

…token-lens2

codecov · 2019-12-16T20:34:21Z

Codecov Report

Merging #448 into master will decrease coverage by <.01%.
The diff coverage is 94.91%.

@@            Coverage Diff             @@
##           master     #448      +/-   ##
==========================================
- Coverage   86.95%   86.95%   -0.01%     
==========================================
  Files         337      337              
  Lines       11102    11131      +29     
  Branches      364      593     +229     
==========================================
+ Hits         9654     9679      +25     
- Misses       1448     1452       +4

Impacted Files	Coverage Δ
...om/salesforce/op/filters/FeatureDistribution.scala	`98.66% <100%> (ø)`	⬆️
...sforce/op/stages/OpPipelineStageReaderWriter.scala	`86.66% <100%> (+0.45%)`	⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`100% <100%> (ø)`	⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`95.61% <94.44%> (-3.24%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bad14d5...8051f75. Read the comment docs.

leahmcguire · 2019-12-16T22:05:47Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

 val shouldCleanText = $(cleanText)

 implicit val testStatsMonoid: Semigroup[TextStats] = TextStats.monoid(maxCard)
 val valueStats: Dataset[Array[TextStats]] = dataset.map(_.map(computeTextStats(_, shouldCleanText)).toArray)
 val aggregatedStats: Array[TextStats] = valueStats.reduce(_ + _)

- val (isCategorical, topValues) = aggregatedStats.map { stats =>
+ val (isCategorical, isIgnorable, topValues) = aggregatedStats.map { stats =>


can something be both ignorable and categorical? if not perhaps we should replace with an enum rather than multiplying our booleans...

Not currently. Right now "ignorable" refers just to fields that would have been hashed otherwise. If a field is already low cardinality, I don't think we'd have a reason to ignore it in general, even if it was an ID or something.

We could make this into an enum with three choices (Pivot, Hash, Ignore) if you think that's clearer.

Replaced! Mostly - I still need to propagate the changes to SmartTextMapVectorizer, but this did make things more readable. Good suggestion!

leahmcguire · 2019-12-16T22:08:44Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

 val isCategorical = stats.valueCounts.size <= maxCard
+ val isIgnorable = stats.lengthStdDev <= minLenStdDev


is this really all you need to identify IDs?

This is the simplest check that worked on most of the synthetic data. The topK counts from a count-min sketch worked about as well, but it was harder to construct a threshold that worked across different dataset sizes and required more changes to the TextStats class. We could add that in the future, but I wanted to try out the trivial one on real data first.

The best performing criteria was a goodness of fit threshold on how well the text lengths follow an MLE Poisson fit. I was going to add that one in another PR, but could put it in this PR too. It's simple enough that it shouldn't need any curve fitting libraries.

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

leahmcguire · 2019-12-16T22:12:12Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

+
+ val newLengthCounts = if (l.lengthCounts.size > maxCardinality) l.lengthCounts
+ else if (r.lengthCounts.size > maxCardinality) r.lengthCounts
+ else l.lengthCounts + r.lengthCounts


can we make this a function?

Probably, there's a chunk for Map[String, Int] and a chunk for Map[Int, Int] so we should be able to combine since each has a monoid.

MWYang · 2019-12-17T02:11:26Z

@Jauntbox Regarding creating an enum to replace the booleans in SmartTextVectorizer, I have done this already on my personal branch for incorporating name detection in STV (https://github.com/MWYang/TransmogrifAI/pull/1/files) (Look for SmartTextVectorizerAction.) Hopefully that's helpful, even though my changes are a lot to look through right now. 😅

gerashegalov · 2019-12-17T05:08:23Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

- case Some(v) => Map(cleanTextFn(v, shouldCleanText) -> 1)
- case None => Map.empty[String, Int]
+ val (valueCounts, lengthCounts) = text match {
+ case Some(v) => (Map(cleanTextFn(v, shouldCleanText) -> 1), Map(cleanTextFn(v, shouldCleanText).length -> 1))


not specific here, but will we have Int overflows in larger texts with more than 2G uniques?

hmmm good point. That would require a huge dataset but it's an easy change to switch these to Longs

tovbinm · 2019-12-17T05:53:02Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizer.scala

@@ -72,7 +72,8 @@ class SmartTextMapVectorizer[T <: OPMap[String]]
 textMap: T#Value, shouldCleanKeys: Boolean, shouldCleanValues: Boolean
 ): TextMapStats = {
 val keyValueCounts = textMap.map{ case (k, v) =>
- cleanTextFn(k, shouldCleanKeys) -> TextStats(Map(cleanTextFn(v, shouldCleanValues) -> 1))
+ cleanTextFn(k, shouldCleanKeys) ->
+ TextStats(Map(cleanTextFn(v, shouldCleanValues) -> 1), Map(cleanTextFn(v, shouldCleanValues).length -> 1))


if looks like you are going to have some merge conflicts - #449

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

core/src/test/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizerTest.scala

tovbinm

lgtm, +1 on enum comment that @leahmcguire had

tovbinm

undeletable comment ;)

tovbinm

oof!

leahmcguire · 2019-12-17T18:47:19Z

"oof!" someone has been hanging out with @snabar :-P

…method

snabar · 2019-12-20T18:36:47Z

"oof!" someone has been hanging out with @snabar :-P

Ooooooofffff!

tovbinm · 2020-01-07T06:41:06Z

@Jauntbox lgtm.

Compilation failed though. I presume merge conflict to blame? - https://travis-ci.com/salesforce/TransmogrifAI/jobs/269487360#L695

tovbinm · 2020-01-07T06:43:16Z

Also

warning file=/home/travis/build/salesforce/TransmogrifAI/features/src/main/scala/com/salesforce/op/stages/impl/feature/TextVectorizationMethod.scala message=Header does not match expected text line=2

https://travis-ci.com/salesforce/TransmogrifAI/jobs/272953824#L508

…token-lens2

…into km/token-lens2

Jauntbox added 4 commits November 25, 2019 15:45

Added Chinese and Korean examples to TextTokenizerTest

9f67367

Added text length distributions to TextStats so we can filter based o…

ba95d82

…n text length distributions

Updated SmartTextVectorizer with token length variance thresholding

3a4dd64

Updating tests

be545b2

Jauntbox requested review from gerashegalov, leahmcguire, tovbinm and wsuchy as code owners December 16, 2019 20:14

Jauntbox added 2 commits December 16, 2019 12:16

Fixed partitioning

1579ac5

Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…

727e849

…token-lens2

Removed logging and fixed test

21d3e08

leahmcguire reviewed Dec 16, 2019

View reviewed changes

gerashegalov reviewed Dec 17, 2019

View reviewed changes

tovbinm reviewed Dec 17, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala Outdated Show resolved Hide resolved

tovbinm reviewed Dec 17, 2019

View reviewed changes

core/src/test/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizerTest.scala Outdated Show resolved Hide resolved

tovbinm approved these changes Dec 17, 2019

View reviewed changes

tovbinm reviewed Dec 17, 2019

View reviewed changes

tovbinm requested changes Dec 17, 2019

View reviewed changes

Jauntbox added 3 commits December 17, 2019 17:37

Refactored text vectorization logic to use an enum for vectorization …

0ad8446

…method

Removed println

bfae9b2

Test cleanup

1aafdd6

Merge branch 'master' into km/token-lens2

2543bf1

salesforce-cla bot added the cla:signed label Dec 21, 2019

Merge branch 'master' into km/token-lens2

07bb12d

tovbinm approved these changes Jan 7, 2020

View reviewed changes

Jauntbox added 3 commits January 7, 2020 10:30

Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…

67de5d6

…token-lens2

Merge branch 'km/token-lens2' of github.com:salesforce/TransmogrifAI …

bab1885

…into km/token-lens2

Addressing comments

8051f75

leahmcguire approved these changes Jan 7, 2020

View reviewed changes

tovbinm merged commit a6aceb6 into master Jan 8, 2020

tovbinm deleted the km/token-lens2 branch January 8, 2020 00:37

Jauntbox mentioned this pull request Jan 11, 2020

Add support for ignoring text that looks like IDs in SmartTextMapVectorizer #455

Merged

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ignoring text that looks like IDs in SmartTextVectorizer #448

Add support for ignoring text that looks like IDs in SmartTextVectorizer #448

Jauntbox commented Dec 16, 2019

codecov bot commented Dec 16, 2019 •

edited

Loading

leahmcguire Dec 16, 2019

Jauntbox Dec 16, 2019

Jauntbox Dec 18, 2019

leahmcguire Dec 16, 2019

Jauntbox Dec 16, 2019

leahmcguire Dec 16, 2019

Jauntbox Dec 16, 2019

MWYang commented Dec 17, 2019

gerashegalov Dec 17, 2019

Jauntbox Dec 18, 2019

tovbinm Dec 17, 2019

tovbinm left a comment

tovbinm left a comment •

edited

Loading

tovbinm left a comment

leahmcguire commented Dec 17, 2019

snabar commented Dec 20, 2019

tovbinm commented Jan 7, 2020 •

edited

Loading

tovbinm commented Jan 7, 2020

		val isCategorical = stats.valueCounts.size <= maxCard
		val isIgnorable = stats.lengthStdDev <= minLenStdDev

Add support for ignoring text that looks like IDs in SmartTextVectorizer #448

Add support for ignoring text that looks like IDs in SmartTextVectorizer #448

Conversation

Jauntbox commented Dec 16, 2019

codecov bot commented Dec 16, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MWYang commented Dec 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

tovbinm left a comment • edited Loading

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

leahmcguire commented Dec 17, 2019

snabar commented Dec 20, 2019

tovbinm commented Jan 7, 2020 • edited Loading

tovbinm commented Jan 7, 2020

codecov bot commented Dec 16, 2019 •

edited

Loading

tovbinm left a comment •

edited

Loading

tovbinm commented Jan 7, 2020 •

edited

Loading