Add support for ignoring text that looks like IDs in SmartTextMapVectorizer #455

Jauntbox · 2020-01-11T00:10:23Z

Related issues
This is the map version of #448

Describe the proposed solution
Adds a few parameters to SmartTextMapVectorizer to allow for ignoring text fields that would be hashed (eg. not categorical) if they have a token length variance below a specified threshold (eg. to catch machine-generated IDs).

Describe alternatives you've considered
Other alternatives are a sort of topK token counting (eg. with a countMinSketch). This works, but is difficult to robustly scale with dataset size, and may be implemented later via Algebird's TopKCMS data structure. Filtering data by raw text length std dev, or by how well the text length distribution fits a poisson distribution performed better on synthetic data and requires less modifications to SmartTextVectorizer.

Additional context
Extra thing we need to be careful of is that we still use the CJK tokenizer for Chinese and Korean text (Japanese uses a proper language-specific tokenizer already), and this tokenizer always splits the text into character bigrams which would cause it to fail any length distribution checks. We will need to update the Korean & Chinese tokenizers to language-specific ones that will pick out words rather than bigrams.

We plan to also add a way to filter based on goodness of fit of the text length distribution to a Poisson distribution in a future PR. All the information is already available to do this, so the modifications should be straightforward.

codecov · 2020-01-11T00:34:12Z

Codecov Report

Merging #455 into master will increase coverage by 19.71%.
The diff coverage is 86.53%.

@@             Coverage Diff             @@
##           master     #455       +/-   ##
===========================================
+ Coverage   67.23%   86.95%   +19.71%     
===========================================
  Files         337      340        +3     
  Lines       11161    11418      +257     
  Branches      350      371       +21     
===========================================
+ Hits         7504     9928     +2424     
+ Misses       3657     1490     -2167

Impacted Files	Coverage Δ
...ain/scala/com/salesforce/op/aggregators/Maps.scala	`96.55% <ø> (+3.44%)`	⬆️
...alesforce/op/aggregators/TimeBasedAggregator.scala	`100% <ø> (+100%)`	⬆️
...la/com/salesforce/op/test/TestFeatureBuilder.scala	`100% <ø> (+100%)`	⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`95.61% <ø> (+4.38%)`	⬆️
...la/com/salesforce/op/features/FeatureBuilder.scala	`35.17% <0%> (+6.5%)`	⬆️
.../scala/com/salesforce/op/dsl/RichTextFeature.scala	`81.94% <0%> (+38.28%)`	⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala	`98.05% <100%> (+29.59%)`	⬆️
...sforce/op/stages/OpPipelineStageReaderWriter.scala	`87.09% <100%> (+0.43%)`	⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`100% <100%> (+2.5%)`	⬆️
...orce/op/aggregators/MonoidAggregatorDefaults.scala	`100% <100%> (+1.81%)`	⬆️
... and 147 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10644d8...069031c. Read the comment docs.

leahmcguire · 2020-01-13T22:41:29Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/OPCollectionHashingVectorizer.scala

 val hashColumns =
 if (isSharedHashSpace(params, Some(numFeatures))) {
 (0 until numHashes).map { i =>
 OpVectorColumnMetadata(
- parentFeatureName = features.map(_.name),
- parentFeatureType = features.map(_.typeName),


is features still used elsewhere?

Nope, removing it did not cause any problems anywhere else.

leahmcguire · 2020-01-13T22:42:28Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/OPCollectionHashingVectorizer.scala

 key <- keys
 i <- 0 until numHashes
 } yield f.toColumnMetaData().copy(grouping = Option(key))
 }

+ // All columns get null tracking or text length tracking, whether their contents are hashed or ignored
+ val allKeys = hashKeys.zip(ignoreKeys).map{ case(h, i) => h ++ i }


what about pivoted keys?

The pivot function should already do the necessary null tracking

I can rename to allTextKeys if your point is that it's a poor name

leahmcguire · 2020-01-13T22:43:22Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizer.scala

@@ -104,9 +104,24 @@ class SmartTextMapVectorizer[T <: OPMap[String]]
 )
 } else Array.empty[OpVectorColumnMetadata]

- val textColumns = if (args.textFeatureInfo.flatten.nonEmpty) {
+
+ /*


why is this all commented out?

Oops, it was removed in the next commit

leahmcguire · 2020-01-13T22:44:58Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizer.scala

+ shouldTrackNulls = args.shouldTrackNulls,
+ shouldTrackLen = $(trackTextLen)
+ )
+ } else Array.empty[OpVectorColumnMetadata]


do you not need to tract categorical?

Categorical tracking is done above in categoricalColumns

Jauntbox · 2020-01-13T23:58:43Z

@leahmcguire Sorry, you looked at it right before I got a fix and new test in. It should be ready to go now.

tovbinm · 2020-01-14T06:07:21Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizer.scala

+ val hashKeys = hashFeatureInfo.map(_.map(_.key))
+ val ignoreKeys = ignoreFeatureInfo.map(_.map(_.key))
+
+ val textKeys = hashKeys.zip(ignoreKeys).map{ case (hk, ik) => hk ++ ik }


I am no sure what's going on between the lines 252-265. Perhaps add some docs?

I added some for the next commit

tovbinm · 2020-01-14T06:08:55Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizer.scala

@@ -248,32 +285,47 @@ final class SmartTextMapVectorizerModel[T <: OPMap[String]] private[op]
 )

 private def partitionRow(row: Seq[OPMap[String]]):
- (Seq[OPMap[String]], Seq[Seq[String]], Seq[OPMap[String]], Seq[Seq[String]]) = {
+ (Seq[OPMap[String]], Seq[Seq[String]], Seq[OPMap[String]], Seq[Seq[String]], Seq[OPMap[String]], Seq[Seq[String]]) = {


this return type is bizzare! let's add a private case class PartitionRes(rowCategorical, keysCategorical, rowHash, keysHash, rowIgnore, keysIgnore)

Yeah, it's a bit silly - I can make it a case class for this

tovbinm

some comments...

…token-lens-map

…and SmartTextMapVectorizer the same as previous behavior - don't ignore any features

Jauntbox added 4 commits January 10, 2020 14:04

Got previous tests working

3ad8ce1

New test also working

b00775b

Remove debug output

03ae4d0

Added test for TextList monoid

9759747

Jauntbox requested review from gerashegalov, leahmcguire, tovbinm and wsuchy as code owners January 11, 2020 00:10

leahmcguire reviewed Jan 13, 2020

View reviewed changes

Added another test and fixed sneaky metadata issue

10644d8

Jauntbox added the ready for review label Jan 14, 2020

tovbinm reviewed Jan 14, 2020

View reviewed changes

Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…

9317250

…token-lens-map

MWYang mentioned this pull request Jan 14, 2020

Incorporate name detection into SmartTextVectorizer #456

Closed

Jauntbox added 3 commits January 14, 2020 14:06

Addressing comments

3ae735c

Spelling

bef2b22

Fixed failing test by making default behavior of SmartTextVectorizer …

069031c

…and SmartTextMapVectorizer the same as previous behavior - don't ignore any features

leahmcguire approved these changes Jan 21, 2020

View reviewed changes

Jauntbox merged commit b7e07e3 into master Jan 21, 2020

Jauntbox deleted the km/token-lens-map branch January 21, 2020 22:30

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ignoring text that looks like IDs in SmartTextMapVectorizer #455

Add support for ignoring text that looks like IDs in SmartTextMapVectorizer #455

Jauntbox commented Jan 11, 2020

codecov bot commented Jan 11, 2020 •

edited

Loading

leahmcguire Jan 13, 2020

Jauntbox Jan 13, 2020

leahmcguire Jan 13, 2020

Jauntbox Jan 13, 2020

Jauntbox Jan 13, 2020

leahmcguire Jan 13, 2020

Jauntbox Jan 13, 2020

leahmcguire Jan 13, 2020

Jauntbox Jan 13, 2020

Jauntbox commented Jan 13, 2020

tovbinm Jan 14, 2020

Jauntbox Jan 14, 2020

tovbinm Jan 14, 2020

Jauntbox Jan 14, 2020

tovbinm left a comment

Add support for ignoring text that looks like IDs in SmartTextMapVectorizer #455

Add support for ignoring text that looks like IDs in SmartTextMapVectorizer #455

Conversation

Jauntbox commented Jan 11, 2020

codecov bot commented Jan 11, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jauntbox commented Jan 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 11, 2020 •

edited

Loading