Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ignoring text that looks like IDs in SmartTextMapVectorizer #455

Merged
merged 9 commits into from
Jan 21, 2020

Conversation

Jauntbox
Copy link
Contributor

Related issues
This is the map version of #448

Describe the proposed solution
Adds a few parameters to SmartTextMapVectorizer to allow for ignoring text fields that would be hashed (eg. not categorical) if they have a token length variance below a specified threshold (eg. to catch machine-generated IDs).

Describe alternatives you've considered
Other alternatives are a sort of topK token counting (eg. with a countMinSketch). This works, but is difficult to robustly scale with dataset size, and may be implemented later via Algebird's TopKCMS data structure. Filtering data by raw text length std dev, or by how well the text length distribution fits a poisson distribution performed better on synthetic data and requires less modifications to SmartTextVectorizer.

Additional context
Extra thing we need to be careful of is that we still use the CJK tokenizer for Chinese and Korean text (Japanese uses a proper language-specific tokenizer already), and this tokenizer always splits the text into character bigrams which would cause it to fail any length distribution checks. We will need to update the Korean & Chinese tokenizers to language-specific ones that will pick out words rather than bigrams.

We plan to also add a way to filter based on goodness of fit of the text length distribution to a Poisson distribution in a future PR. All the information is already available to do this, so the modifications should be straightforward.

@codecov
Copy link

codecov bot commented Jan 11, 2020

Codecov Report

Merging #455 into master will increase coverage by 19.71%.
The diff coverage is 86.53%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #455       +/-   ##
===========================================
+ Coverage   67.23%   86.95%   +19.71%     
===========================================
  Files         337      340        +3     
  Lines       11161    11418      +257     
  Branches      350      371       +21     
===========================================
+ Hits         7504     9928     +2424     
+ Misses       3657     1490     -2167
Impacted Files Coverage Δ
...ain/scala/com/salesforce/op/aggregators/Maps.scala 96.55% <ø> (+3.44%) ⬆️
...alesforce/op/aggregators/TimeBasedAggregator.scala 100% <ø> (+100%) ⬆️
...la/com/salesforce/op/test/TestFeatureBuilder.scala 100% <ø> (+100%) ⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.61% <ø> (+4.38%) ⬆️
...la/com/salesforce/op/features/FeatureBuilder.scala 35.17% <0%> (+6.5%) ⬆️
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 81.94% <0%> (+38.28%) ⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala 98.05% <100%> (+29.59%) ⬆️
...sforce/op/stages/OpPipelineStageReaderWriter.scala 87.09% <100%> (+0.43%) ⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100% <100%> (+2.5%) ⬆️
...orce/op/aggregators/MonoidAggregatorDefaults.scala 100% <100%> (+1.81%) ⬆️
... and 147 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10644d8...069031c. Read the comment docs.

val hashColumns =
if (isSharedHashSpace(params, Some(numFeatures))) {
(0 until numHashes).map { i =>
OpVectorColumnMetadata(
parentFeatureName = features.map(_.name),
parentFeatureType = features.map(_.typeName),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is features still used elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, removing it did not cause any problems anywhere else.

key <- keys
i <- 0 until numHashes
} yield f.toColumnMetaData().copy(grouping = Option(key))
}

// All columns get null tracking or text length tracking, whether their contents are hashed or ignored
val allKeys = hashKeys.zip(ignoreKeys).map{ case(h, i) => h ++ i }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about pivoted keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pivot function should already do the necessary null tracking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename to allTextKeys if your point is that it's a poor name

@@ -104,9 +104,24 @@ class SmartTextMapVectorizer[T <: OPMap[String]]
)
} else Array.empty[OpVectorColumnMetadata]

val textColumns = if (args.textFeatureInfo.flatten.nonEmpty) {

/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this all commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, it was removed in the next commit

shouldTrackNulls = args.shouldTrackNulls,
shouldTrackLen = $(trackTextLen)
)
} else Array.empty[OpVectorColumnMetadata]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you not need to tract categorical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Categorical tracking is done above in categoricalColumns

@Jauntbox
Copy link
Contributor Author

@leahmcguire Sorry, you looked at it right before I got a fix and new test in. It should be ready to go now.

val hashKeys = hashFeatureInfo.map(_.map(_.key))
val ignoreKeys = ignoreFeatureInfo.map(_.map(_.key))

val textKeys = hashKeys.zip(ignoreKeys).map{ case (hk, ik) => hk ++ ik }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am no sure what's going on between the lines 252-265. Perhaps add some docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some for the next commit

@@ -248,32 +285,47 @@ final class SmartTextMapVectorizerModel[T <: OPMap[String]] private[op]
)

private def partitionRow(row: Seq[OPMap[String]]):
(Seq[OPMap[String]], Seq[Seq[String]], Seq[OPMap[String]], Seq[Seq[String]]) = {
(Seq[OPMap[String]], Seq[Seq[String]], Seq[OPMap[String]], Seq[Seq[String]], Seq[OPMap[String]], Seq[Seq[String]]) = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this return type is bizzare! let's add a private case class PartitionRes(rowCategorical, keysCategorical, rowHash, keysHash, rowIgnore, keysIgnore)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bit silly - I can make it a case class for this

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments...

…and SmartTextMapVectorizer the same as previous behavior - don't ignore any features
@Jauntbox Jauntbox merged commit b7e07e3 into master Jan 21, 2020
@Jauntbox Jauntbox deleted the km/token-lens-map branch January 21, 2020 22:30
@nicodv nicodv mentioned this pull request Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants