-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ignoring text that looks like IDs in SmartTextMapVectorizer #455
Conversation
Codecov Report
@@ Coverage Diff @@
## master #455 +/- ##
===========================================
+ Coverage 67.23% 86.95% +19.71%
===========================================
Files 337 340 +3
Lines 11161 11418 +257
Branches 350 371 +21
===========================================
+ Hits 7504 9928 +2424
+ Misses 3657 1490 -2167
Continue to review full report at Codecov.
|
val hashColumns = | ||
if (isSharedHashSpace(params, Some(numFeatures))) { | ||
(0 until numHashes).map { i => | ||
OpVectorColumnMetadata( | ||
parentFeatureName = features.map(_.name), | ||
parentFeatureType = features.map(_.typeName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is features still used elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, removing it did not cause any problems anywhere else.
key <- keys | ||
i <- 0 until numHashes | ||
} yield f.toColumnMetaData().copy(grouping = Option(key)) | ||
} | ||
|
||
// All columns get null tracking or text length tracking, whether their contents are hashed or ignored | ||
val allKeys = hashKeys.zip(ignoreKeys).map{ case(h, i) => h ++ i } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about pivoted keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pivot function should already do the necessary null tracking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can rename to allTextKeys
if your point is that it's a poor name
@@ -104,9 +104,24 @@ class SmartTextMapVectorizer[T <: OPMap[String]] | |||
) | |||
} else Array.empty[OpVectorColumnMetadata] | |||
|
|||
val textColumns = if (args.textFeatureInfo.flatten.nonEmpty) { | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this all commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, it was removed in the next commit
shouldTrackNulls = args.shouldTrackNulls, | ||
shouldTrackLen = $(trackTextLen) | ||
) | ||
} else Array.empty[OpVectorColumnMetadata] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you not need to tract categorical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Categorical tracking is done above in categoricalColumns
@leahmcguire Sorry, you looked at it right before I got a fix and new test in. It should be ready to go now. |
val hashKeys = hashFeatureInfo.map(_.map(_.key)) | ||
val ignoreKeys = ignoreFeatureInfo.map(_.map(_.key)) | ||
|
||
val textKeys = hashKeys.zip(ignoreKeys).map{ case (hk, ik) => hk ++ ik } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am no sure what's going on between the lines 252-265. Perhaps add some docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some for the next commit
@@ -248,32 +285,47 @@ final class SmartTextMapVectorizerModel[T <: OPMap[String]] private[op] | |||
) | |||
|
|||
private def partitionRow(row: Seq[OPMap[String]]): | |||
(Seq[OPMap[String]], Seq[Seq[String]], Seq[OPMap[String]], Seq[Seq[String]]) = { | |||
(Seq[OPMap[String]], Seq[Seq[String]], Seq[OPMap[String]], Seq[Seq[String]], Seq[OPMap[String]], Seq[Seq[String]]) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this return type is bizzare! let's add a private case class PartitionRes(rowCategorical, keysCategorical, rowHash, keysHash, rowIgnore, keysIgnore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a bit silly - I can make it a case class for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments...
…and SmartTextMapVectorizer the same as previous behavior - don't ignore any features
Related issues
This is the map version of #448
Describe the proposed solution
Adds a few parameters to SmartTextMapVectorizer to allow for ignoring text fields that would be hashed (eg. not categorical) if they have a token length variance below a specified threshold (eg. to catch machine-generated IDs).
Describe alternatives you've considered
Other alternatives are a sort of topK token counting (eg. with a countMinSketch). This works, but is difficult to robustly scale with dataset size, and may be implemented later via Algebird's TopKCMS data structure. Filtering data by raw text length std dev, or by how well the text length distribution fits a poisson distribution performed better on synthetic data and requires less modifications to SmartTextVectorizer.
Additional context
Extra thing we need to be careful of is that we still use the CJK tokenizer for Chinese and Korean text (Japanese uses a proper language-specific tokenizer already), and this tokenizer always splits the text into character bigrams which would cause it to fail any length distribution checks. We will need to update the Korean & Chinese tokenizers to language-specific ones that will pick out words rather than bigrams.
We plan to also add a way to filter based on goodness of fit of the text length distribution to a Poisson distribution in a future PR. All the information is already available to do this, so the modifications should be straightforward.