-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ignoring text that looks like IDs in SmartTextVectorizer #448
Conversation
Codecov Report
@@ Coverage Diff @@
## master #448 +/- ##
==========================================
- Coverage 86.95% 86.95% -0.01%
==========================================
Files 337 337
Lines 11102 11131 +29
Branches 364 593 +229
==========================================
+ Hits 9654 9679 +25
- Misses 1448 1452 +4
Continue to review full report at Codecov.
|
val shouldCleanText = $(cleanText) | ||
|
||
implicit val testStatsMonoid: Semigroup[TextStats] = TextStats.monoid(maxCard) | ||
val valueStats: Dataset[Array[TextStats]] = dataset.map(_.map(computeTextStats(_, shouldCleanText)).toArray) | ||
val aggregatedStats: Array[TextStats] = valueStats.reduce(_ + _) | ||
|
||
val (isCategorical, topValues) = aggregatedStats.map { stats => | ||
val (isCategorical, isIgnorable, topValues) = aggregatedStats.map { stats => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can something be both ignorable and categorical? if not perhaps we should replace with an enum rather than multiplying our booleans...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not currently. Right now "ignorable" refers just to fields that would have been hashed otherwise. If a field is already low cardinality, I don't think we'd have a reason to ignore it in general, even if it was an ID or something.
We could make this into an enum with three choices (Pivot, Hash, Ignore) if you think that's clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced! Mostly - I still need to propagate the changes to SmartTextMapVectorizer, but this did make things more readable. Good suggestion!
val isCategorical = stats.valueCounts.size <= maxCard | ||
val isIgnorable = stats.lengthStdDev <= minLenStdDev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really all you need to identify IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the simplest check that worked on most of the synthetic data. The topK counts from a count-min sketch worked about as well, but it was harder to construct a threshold that worked across different dataset sizes and required more changes to the TextStats class. We could add that in the future, but I wanted to try out the trivial one on real data first.
The best performing criteria was a goodness of fit threshold on how well the text lengths follow an MLE Poisson fit. I was going to add that one in another PR, but could put it in this PR too. It's simple enough that it shouldn't need any curve fitting libraries.
core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala
Outdated
Show resolved
Hide resolved
|
||
val newLengthCounts = if (l.lengthCounts.size > maxCardinality) l.lengthCounts | ||
else if (r.lengthCounts.size > maxCardinality) r.lengthCounts | ||
else l.lengthCounts + r.lengthCounts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, there's a chunk for Map[String, Int]
and a chunk for Map[Int, Int]
so we should be able to combine since each has a monoid.
@Jauntbox Regarding creating an enum to replace the booleans in SmartTextVectorizer, I have done this already on my personal branch for incorporating name detection in STV (https://github.com/MWYang/TransmogrifAI/pull/1/files) (Look for |
case Some(v) => Map(cleanTextFn(v, shouldCleanText) -> 1) | ||
case None => Map.empty[String, Int] | ||
val (valueCounts, lengthCounts) = text match { | ||
case Some(v) => (Map(cleanTextFn(v, shouldCleanText) -> 1), Map(cleanTextFn(v, shouldCleanText).length -> 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not specific here, but will we have Int
overflows in larger texts with more than 2G uniques?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm good point. That would require a huge dataset but it's an easy change to switch these to Longs
@@ -72,7 +72,8 @@ class SmartTextMapVectorizer[T <: OPMap[String]] | |||
textMap: T#Value, shouldCleanKeys: Boolean, shouldCleanValues: Boolean | |||
): TextMapStats = { | |||
val keyValueCounts = textMap.map{ case (k, v) => | |||
cleanTextFn(k, shouldCleanKeys) -> TextStats(Map(cleanTextFn(v, shouldCleanValues) -> 1)) | |||
cleanTextFn(k, shouldCleanKeys) -> | |||
TextStats(Map(cleanTextFn(v, shouldCleanValues) -> 1), Map(cleanTextFn(v, shouldCleanValues).length -> 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if looks like you are going to have some merge conflicts - #449
core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizerTest.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, +1 on enum comment that @leahmcguire had
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undeletable comment ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof!
"oof!" someone has been hanging out with @snabar :-P |
Ooooooofffff! |
@Jauntbox lgtm. Compilation failed though. I presume merge conflict to blame? - https://travis-ci.com/salesforce/TransmogrifAI/jobs/269487360#L695 |
Also
https://travis-ci.com/salesforce/TransmogrifAI/jobs/272953824#L508 |
Related issues
N/A
Describe the proposed solution
Adds a few parameters to SmartTextVectorizer to allow for ignoring text fields that would be hashed (eg. not categorical) if they have a token length variance below a specified threshold (eg. to catch machine-generated IDs).
Describe alternatives you've considered
Other alternatives are a sort of topK token counting (eg. with a countMinSketch). This works, but is difficult to robustly scale with dataset size, and may be implemented later via Algebird's TopKCMS data structure. Filtering data by raw text length std dev, or by how well the text length distribution fits a poisson distribution performed better on synthetic data and requires less modifications to SmartTextVectorizer.
Additional context
Extra thing we need to be careful of is that we still use the CJK tokenizer for Chinese and Korean text (Japanese uses a proper language-specific tokenizer already), and this tokenizer always splits the text into character bigrams which would cause it to fail any length distribution checks. We will need to update the Korean & Chinese tokenizers to language-specific ones that will pick out words rather than bigrams.
We plan to also add a way to filter based on goodness of fit of the text length distribution to a Poisson distribution in a future PR. All the information is already available to do this, so the modifications should be straightforward.