-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make EmailVectorizer not clean the email domains by default. #426
Conversation
Codecov Report
@@ Coverage Diff @@
## master #426 +/- ##
==========================================
- Coverage 77.93% 77.91% -0.03%
==========================================
Files 337 337
Lines 11082 11101 +19
Branches 355 370 +15
==========================================
+ Hits 8637 8649 +12
- Misses 2445 2452 +7
Continue to review full report at Codecov.
|
} else { | ||
cleanTextFn(k, shouldCleanKey) -> v | ||
cleanTextFn(k, shouldCleanKey, TransmogrifierDefaults.CleanParams) -> v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace TransmogrifierDefaults.CleanParams
with shouldCleanValue
} | ||
case (k: String, v) => cleanTextFn(k, shouldCleanKey) -> v | ||
case (k: String, v) => cleanTextFn(k, shouldCleanKey, TransmogrifierDefaults.CleanParams) -> v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same - replace TransmogrifierDefaults.CleanParams
with shouldCleanValue
@@ -608,8 +621,8 @@ trait MapPivotParams extends Params { | |||
protected def filterKeys[V](m: Map[String, V], shouldCleanKey: Boolean, shouldCleanValue: Boolean): Map[String, V] = { | |||
val map = cleanMap[V](m, shouldCleanKey, shouldCleanValue) | |||
val (whiteList, blackList) = ( | |||
$(whiteListKeys).map(cleanTextFn(_, shouldCleanKey)), | |||
$(blackListKeys).map(cleanTextFn(_, shouldCleanKey)) | |||
$(whiteListKeys).map(cleanTextFn(_, shouldCleanKey, TransmogrifierDefaults.CleanParams)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and same here - replace TransmogrifierDefaults.CleanParams
with shouldCleanValue
@@ -600,11 +602,12 @@ trait RichTextFeature { | |||
minSupport: Int, | |||
trackNulls: Boolean = TransmogrifierDefaults.TrackNulls, | |||
others: Array[FeatureLike[Email]] = Array.empty, | |||
maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality | |||
maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality, | |||
cleanTextParams: CleanTextParams = CleanTextParams(true, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name boolean arguments
@@ -1026,15 +1028,16 @@ trait RichMapFeature { | |||
blackListKeys: Array[String] = Array.empty, | |||
trackNulls: Boolean = TransmogrifierDefaults.TrackNulls, | |||
others: Array[FeatureLike[EmailMap]] = Array.empty, | |||
maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality | |||
maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality, | |||
cleanTextParams: CleanTextParams = CleanTextParams(true, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to evolve these params, keep adding booleans? should this be a list of enum values with corresponding text transformers?
raw | ||
.toLowerCase | ||
def cleanString(raw: String, splitOn: String = " ", cleanTextParams: CleanTextParams = defaultCleanParams): String = { | ||
val l = if (cleanTextParams.ignoreCase) raw.toLowerCase else raw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like JDK has an interesting twist on ignoreCase and actually prefers to normalize to the upper case unless for the Georgian alphabet http:https://hg.openjdk.java.net/jdk7u/jdk7u6/jdk/file/8c2c5d63a17e/src/share/classes/java/lang/String.java#l1356
Consider more explicit flags: lowerCase, upperCase instead of more general ignoreCase
@gerashegalov For now I am closing this PR since I am going to just turn off the email cleaning directly in AutoML and leave TMOG as it is. In future we can make changes to TMOG to provide more granular control on how the text is cleaned if there are many use-cases which require it. |
Related issues
Users want to see the text of the email domain as an indicator variable, the way a true email address does (including punctuation)
EG: Today, email field [email protected] will have a column with indicator value salesforcecom, because the text of the domain name has been cleaned. Instead we would like it say "salesforce.com"
Describe the proposed solution
Added a
case class CleanTextParams(ignoreCase: Boolean, cleanPunctuations: Boolean)
which will give us more control on how the text is going to cleaned in general. In future more parameters can be added.By default across all features, this would be
CleanTextParams(true, true)
except for email/emailMap features, in which it would beCleanTextParams(true, false)