Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ignoring text that looks like IDs in SmartTextVectorizer #448

Merged
merged 15 commits into from
Jan 8, 2020

Conversation

Jauntbox
Copy link
Contributor

Related issues
N/A

Describe the proposed solution
Adds a few parameters to SmartTextVectorizer to allow for ignoring text fields that would be hashed (eg. not categorical) if they have a token length variance below a specified threshold (eg. to catch machine-generated IDs).

Describe alternatives you've considered
Other alternatives are a sort of topK token counting (eg. with a countMinSketch). This works, but is difficult to robustly scale with dataset size, and may be implemented later via Algebird's TopKCMS data structure. Filtering data by raw text length std dev, or by how well the text length distribution fits a poisson distribution performed better on synthetic data and requires less modifications to SmartTextVectorizer.

Additional context
Extra thing we need to be careful of is that we still use the CJK tokenizer for Chinese and Korean text (Japanese uses a proper language-specific tokenizer already), and this tokenizer always splits the text into character bigrams which would cause it to fail any length distribution checks. We will need to update the Korean & Chinese tokenizers to language-specific ones that will pick out words rather than bigrams.

We plan to also add a way to filter based on goodness of fit of the text length distribution to a Poisson distribution in a future PR. All the information is already available to do this, so the modifications should be straightforward.

@codecov
Copy link

codecov bot commented Dec 16, 2019

Codecov Report

Merging #448 into master will decrease coverage by <.01%.
The diff coverage is 94.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #448      +/-   ##
==========================================
- Coverage   86.95%   86.95%   -0.01%     
==========================================
  Files         337      337              
  Lines       11102    11131      +29     
  Branches      364      593     +229     
==========================================
+ Hits         9654     9679      +25     
- Misses       1448     1452       +4
Impacted Files Coverage Δ
...om/salesforce/op/filters/FeatureDistribution.scala 98.66% <100%> (ø) ⬆️
...sforce/op/stages/OpPipelineStageReaderWriter.scala 86.66% <100%> (+0.45%) ⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100% <100%> (ø) ⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.61% <94.44%> (-3.24%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bad14d5...8051f75. Read the comment docs.

val shouldCleanText = $(cleanText)

implicit val testStatsMonoid: Semigroup[TextStats] = TextStats.monoid(maxCard)
val valueStats: Dataset[Array[TextStats]] = dataset.map(_.map(computeTextStats(_, shouldCleanText)).toArray)
val aggregatedStats: Array[TextStats] = valueStats.reduce(_ + _)

val (isCategorical, topValues) = aggregatedStats.map { stats =>
val (isCategorical, isIgnorable, topValues) = aggregatedStats.map { stats =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can something be both ignorable and categorical? if not perhaps we should replace with an enum rather than multiplying our booleans...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not currently. Right now "ignorable" refers just to fields that would have been hashed otherwise. If a field is already low cardinality, I don't think we'd have a reason to ignore it in general, even if it was an ID or something.

We could make this into an enum with three choices (Pivot, Hash, Ignore) if you think that's clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced! Mostly - I still need to propagate the changes to SmartTextMapVectorizer, but this did make things more readable. Good suggestion!

val isCategorical = stats.valueCounts.size <= maxCard
val isIgnorable = stats.lengthStdDev <= minLenStdDev
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really all you need to identify IDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the simplest check that worked on most of the synthetic data. The topK counts from a count-min sketch worked about as well, but it was harder to construct a threshold that worked across different dataset sizes and required more changes to the TextStats class. We could add that in the future, but I wanted to try out the trivial one on real data first.

The best performing criteria was a goodness of fit threshold on how well the text lengths follow an MLE Poisson fit. I was going to add that one in another PR, but could put it in this PR too. It's simple enough that it shouldn't need any curve fitting libraries.


val newLengthCounts = if (l.lengthCounts.size > maxCardinality) l.lengthCounts
else if (r.lengthCounts.size > maxCardinality) r.lengthCounts
else l.lengthCounts + r.lengthCounts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, there's a chunk for Map[String, Int] and a chunk for Map[Int, Int] so we should be able to combine since each has a monoid.

@MWYang
Copy link
Contributor

MWYang commented Dec 17, 2019

@Jauntbox Regarding creating an enum to replace the booleans in SmartTextVectorizer, I have done this already on my personal branch for incorporating name detection in STV (https://github.com/MWYang/TransmogrifAI/pull/1/files) (Look for SmartTextVectorizerAction.) Hopefully that's helpful, even though my changes are a lot to look through right now. 😅

case Some(v) => Map(cleanTextFn(v, shouldCleanText) -> 1)
case None => Map.empty[String, Int]
val (valueCounts, lengthCounts) = text match {
case Some(v) => (Map(cleanTextFn(v, shouldCleanText) -> 1), Map(cleanTextFn(v, shouldCleanText).length -> 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not specific here, but will we have Int overflows in larger texts with more than 2G uniques?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm good point. That would require a huge dataset but it's an easy change to switch these to Longs

@@ -72,7 +72,8 @@ class SmartTextMapVectorizer[T <: OPMap[String]]
textMap: T#Value, shouldCleanKeys: Boolean, shouldCleanValues: Boolean
): TextMapStats = {
val keyValueCounts = textMap.map{ case (k, v) =>
cleanTextFn(k, shouldCleanKeys) -> TextStats(Map(cleanTextFn(v, shouldCleanValues) -> 1))
cleanTextFn(k, shouldCleanKeys) ->
TextStats(Map(cleanTextFn(v, shouldCleanValues) -> 1), Map(cleanTextFn(v, shouldCleanValues).length -> 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if looks like you are going to have some merge conflicts - #449

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, +1 on enum comment that @leahmcguire had

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undeletable comment ;)

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof!

@leahmcguire
Copy link
Collaborator

"oof!" someone has been hanging out with @snabar :-P

@snabar
Copy link
Contributor

snabar commented Dec 20, 2019

"oof!" someone has been hanging out with @snabar :-P

Ooooooofffff!

@tovbinm
Copy link
Collaborator

tovbinm commented Jan 7, 2020

@Jauntbox lgtm.

Compilation failed though. I presume merge conflict to blame? - https://travis-ci.com/salesforce/TransmogrifAI/jobs/269487360#L695

@tovbinm
Copy link
Collaborator

tovbinm commented Jan 7, 2020

Also

warning file=/home/travis/build/salesforce/TransmogrifAI/features/src/main/scala/com/salesforce/op/stages/impl/feature/TextVectorizationMethod.scala message=Header does not match expected text line=2

https://travis-ci.com/salesforce/TransmogrifAI/jobs/272953824#L508

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants