Add categorical detection to be coverage based in addition to unique count based #473
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related issues
Currently SmartTextVectorizer and SmartTextMapVectorizer will count the number of unique entries in a text field (up to a threshold, currently 50) and treat the feature as categorical if it has < 50 unique entries.
You can still run into features that are effectively categorical, but may have a long tail of low-frequency entries. We would get better signal extraction if we treated these as categorical instead of hashing them.
Describe the proposed solution
Adding an extra check for Text(Map) features in order to become categoricals. This only applies to features that have a cardinality higher than the threshold and therefore would be hashed.
A better approach to detecting text features that are really categorical would be to use a coverage criteria. For example, the topK entries with minimum support cover at least 90% of the entries, then this would be a good feature to pivot by entry instead of hash by token. The value of 90% can be tuned by the user thanks to a param.
Extra checks need to be passed :
If there is m < TopK elements with the required minimum support, then we are looking at the coverage of these m elements.
Describe alternatives you've considered
I've considered using Algebird Count Min Sketch in order to compute the current
TextStats
.However I ran into multiple issue :
TopNCMS
only returns the "HeavyHitters" however you need much more than that(e.g. cardinality) in order to use the coverage method.A branch still exists : mw/coverage, but it is in shambles.
Additional context
Some criticism regarding
TextStats
. It seems not to be a semi group as it is not associative. Was it intended?