Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make EmailVectorizer not clean the email domains by default. #426

Closed
wants to merge 6 commits into from

Conversation

sanmitra
Copy link
Contributor

Related issues
Users want to see the text of the email domain as an indicator variable, the way a true email address does (including punctuation)
EG: Today, email field [email protected] will have a column with indicator value salesforcecom, because the text of the domain name has been cleaned. Instead we would like it say "salesforce.com"

Describe the proposed solution
Added a case class CleanTextParams(ignoreCase: Boolean, cleanPunctuations: Boolean) which will give us more control on how the text is going to cleaned in general. In future more parameters can be added.
By default across all features, this would be CleanTextParams(true, true) except for email/emailMap features, in which it would be CleanTextParams(true, false)

@codecov
Copy link

codecov bot commented Oct 17, 2019

Codecov Report

Merging #426 into master will decrease coverage by 0.02%.
The diff coverage is 80.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #426      +/-   ##
==========================================
- Coverage   77.93%   77.91%   -0.03%     
==========================================
  Files         337      337              
  Lines       11082    11101      +19     
  Branches      355      370      +15     
==========================================
+ Hits         8637     8649      +12     
- Misses       2445     2452       +7
Impacted Files Coverage Δ
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 61.97% <ø> (ø) ⬆️
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala 41.17% <ø> (ø) ⬆️
...ce/op/stages/impl/feature/OpOneHotVectorizer.scala 96.84% <100%> (+0.06%) ⬆️
...p/stages/impl/feature/TextMapPivotVectorizer.scala 100% <100%> (ø) ⬆️
...scala/com/salesforce/op/utils/text/TextUtils.scala 63.63% <60%> (-36.37%) ⬇️
...sforce/op/stages/impl/feature/Transmogrifier.scala 73.27% <92.3%> (+0.6%) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 85.71% <0%> (-4.09%) ⬇️
.../op/features/types/FeatureTypeSparkConverter.scala 98.23% <0%> (-0.89%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8bae1c...9136753. Read the comment docs.

} else {
cleanTextFn(k, shouldCleanKey) -> v
cleanTextFn(k, shouldCleanKey, TransmogrifierDefaults.CleanParams) -> v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace TransmogrifierDefaults.CleanParams with shouldCleanValue

}
case (k: String, v) => cleanTextFn(k, shouldCleanKey) -> v
case (k: String, v) => cleanTextFn(k, shouldCleanKey, TransmogrifierDefaults.CleanParams) -> v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same - replace TransmogrifierDefaults.CleanParams with shouldCleanValue

@@ -608,8 +621,8 @@ trait MapPivotParams extends Params {
protected def filterKeys[V](m: Map[String, V], shouldCleanKey: Boolean, shouldCleanValue: Boolean): Map[String, V] = {
val map = cleanMap[V](m, shouldCleanKey, shouldCleanValue)
val (whiteList, blackList) = (
$(whiteListKeys).map(cleanTextFn(_, shouldCleanKey)),
$(blackListKeys).map(cleanTextFn(_, shouldCleanKey))
$(whiteListKeys).map(cleanTextFn(_, shouldCleanKey, TransmogrifierDefaults.CleanParams)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and same here - replace TransmogrifierDefaults.CleanParams with shouldCleanValue

@@ -600,11 +602,12 @@ trait RichTextFeature {
minSupport: Int,
trackNulls: Boolean = TransmogrifierDefaults.TrackNulls,
others: Array[FeatureLike[Email]] = Array.empty,
maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality
maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality,
cleanTextParams: CleanTextParams = CleanTextParams(true, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name boolean arguments

@@ -1026,15 +1028,16 @@ trait RichMapFeature {
blackListKeys: Array[String] = Array.empty,
trackNulls: Boolean = TransmogrifierDefaults.TrackNulls,
others: Array[FeatureLike[EmailMap]] = Array.empty,
maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality
maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality,
cleanTextParams: CleanTextParams = CleanTextParams(true, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to evolve these params, keep adding booleans? should this be a list of enum values with corresponding text transformers?

raw
.toLowerCase
def cleanString(raw: String, splitOn: String = " ", cleanTextParams: CleanTextParams = defaultCleanParams): String = {
val l = if (cleanTextParams.ignoreCase) raw.toLowerCase else raw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like JDK has an interesting twist on ignoreCase and actually prefers to normalize to the upper case unless for the Georgian alphabet http:https://hg.openjdk.java.net/jdk7u/jdk7u6/jdk/file/8c2c5d63a17e/src/share/classes/java/lang/String.java#l1356
Consider more explicit flags: lowerCase, upperCase instead of more general ignoreCase

@sanmitra
Copy link
Contributor Author

sanmitra commented Dec 3, 2019

@gerashegalov For now I am closing this PR since I am going to just turn off the email cleaning directly in AutoML and leave TMOG as it is. In future we can make changes to TMOG to provide more granular control on how the text is cleaned if there are many use-cases which require it.

@sanmitra sanmitra closed this Dec 3, 2019
@tovbinm tovbinm deleted the san/email-clean-2 branch June 12, 2020 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants