Make EmailVectorizer not clean the email domains by default. #426

sanmitra · 2019-10-17T18:55:05Z

Related issues
Users want to see the text of the email domain as an indicator variable, the way a true email address does (including punctuation)
EG: Today, email field [email protected] will have a column with indicator value salesforcecom, because the text of the domain name has been cleaned. Instead we would like it say "salesforce.com"

Describe the proposed solution
Added a case class CleanTextParams(ignoreCase: Boolean, cleanPunctuations: Boolean) which will give us more control on how the text is going to cleaned in general. In future more parameters can be added.
By default across all features, this would be CleanTextParams(true, true) except for email/emailMap features, in which it would be CleanTextParams(true, false)

codecov · 2019-10-17T19:17:40Z

Codecov Report

Merging #426 into master will decrease coverage by 0.02%.
The diff coverage is 80.76%.

@@            Coverage Diff             @@
##           master     #426      +/-   ##
==========================================
- Coverage   77.93%   77.91%   -0.03%     
==========================================
  Files         337      337              
  Lines       11082    11101      +19     
  Branches      355      370      +15     
==========================================
+ Hits         8637     8649      +12     
- Misses       2445     2452       +7

Impacted Files	Coverage Δ
.../scala/com/salesforce/op/dsl/RichTextFeature.scala	`61.97% <ø> (ø)`	⬆️
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala	`41.17% <ø> (ø)`	⬆️
...ce/op/stages/impl/feature/OpOneHotVectorizer.scala	`96.84% <100%> (+0.06%)`	⬆️
...p/stages/impl/feature/TextMapPivotVectorizer.scala	`100% <100%> (ø)`	⬆️
...scala/com/salesforce/op/utils/text/TextUtils.scala	`63.63% <60%> (-36.37%)`	⬇️
...sforce/op/stages/impl/feature/Transmogrifier.scala	`73.27% <92.3%> (+0.6%)`	⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala	`85.71% <0%> (-4.09%)`	⬇️
.../op/features/types/FeatureTypeSparkConverter.scala	`98.23% <0%> (-0.89%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8bae1c...9136753. Read the comment docs.

tovbinm · 2019-10-17T19:58:55Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/Transmogrifier.scala

 } else {
- cleanTextFn(k, shouldCleanKey) -> v
+ cleanTextFn(k, shouldCleanKey, TransmogrifierDefaults.CleanParams) -> v


replace TransmogrifierDefaults.CleanParams with shouldCleanValue

tovbinm · 2019-10-17T19:59:03Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/Transmogrifier.scala

 }
- case (k: String, v) => cleanTextFn(k, shouldCleanKey) -> v
+ case (k: String, v) => cleanTextFn(k, shouldCleanKey, TransmogrifierDefaults.CleanParams) -> v


same - replace TransmogrifierDefaults.CleanParams with shouldCleanValue

tovbinm · 2019-10-17T19:59:15Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/Transmogrifier.scala

@@ -608,8 +621,8 @@ trait MapPivotParams extends Params {
 protected def filterKeys[V](m: Map[String, V], shouldCleanKey: Boolean, shouldCleanValue: Boolean): Map[String, V] = {
 val map = cleanMap[V](m, shouldCleanKey, shouldCleanValue)
 val (whiteList, blackList) = (
- $(whiteListKeys).map(cleanTextFn(_, shouldCleanKey)),
- $(blackListKeys).map(cleanTextFn(_, shouldCleanKey))
+ $(whiteListKeys).map(cleanTextFn(_, shouldCleanKey, TransmogrifierDefaults.CleanParams)),


and same here - replace TransmogrifierDefaults.CleanParams with shouldCleanValue

tovbinm · 2019-10-17T20:00:48Z

core/src/main/scala/com/salesforce/op/dsl/RichTextFeature.scala

@@ -600,11 +602,12 @@ trait RichTextFeature {
 minSupport: Int,
 trackNulls: Boolean = TransmogrifierDefaults.TrackNulls,
 others: Array[FeatureLike[Email]] = Array.empty,
- maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality
+ maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality,
+ cleanTextParams: CleanTextParams = CleanTextParams(true, false)


name boolean arguments

gerashegalov · 2019-10-24T19:46:38Z

core/src/main/scala/com/salesforce/op/dsl/RichMapFeature.scala

@@ -1026,15 +1028,16 @@ trait RichMapFeature {
 blackListKeys: Array[String] = Array.empty,
 trackNulls: Boolean = TransmogrifierDefaults.TrackNulls,
 others: Array[FeatureLike[EmailMap]] = Array.empty,
- maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality
+ maxPctCardinality: Double = OpOneHotVectorizer.MaxPctCardinality,
+ cleanTextParams: CleanTextParams = CleanTextParams(true, false)


How are we going to evolve these params, keep adding booleans? should this be a list of enum values with corresponding text transformers?

gerashegalov · 2019-10-24T19:59:19Z

utils/src/main/scala/com/salesforce/op/utils/text/TextUtils.scala

- raw
- .toLowerCase
+ def cleanString(raw: String, splitOn: String = " ", cleanTextParams: CleanTextParams = defaultCleanParams): String = {
+ val l = if (cleanTextParams.ignoreCase) raw.toLowerCase else raw


looks like JDK has an interesting twist on ignoreCase and actually prefers to normalize to the upper case unless for the Georgian alphabet http:https://hg.openjdk.java.net/jdk7u/jdk7u6/jdk/file/8c2c5d63a17e/src/share/classes/java/lang/String.java#l1356
Consider more explicit flags: lowerCase, upperCase instead of more general ignoreCase

sanmitra · 2019-12-03T00:06:37Z

@gerashegalov For now I am closing this PR since I am going to just turn off the email cleaning directly in AutoML and leave TMOG as it is. In future we can make changes to TMOG to provide more granular control on how the text is cleaned if there are many use-cases which require it.

sanmitra added 3 commits October 2, 2019 14:51

[TDD] Adding red tests

dd22290

Setting trackNulls=false in a red test

03a0bf8

Do not clean email domains

9da210b

sanmitra requested review from tovbinm, gerashegalov and leahmcguire October 17, 2019 18:55

sanmitra requested review from Jauntbox and wsuchy as code owners October 17, 2019 18:55

Merge branch 'master' into san/email-clean-2

33c3587

salesforce-cla bot added the cla:signed label Oct 17, 2019

tovbinm reviewed Oct 17, 2019

View reviewed changes

tovbinm and others added 2 commits October 19, 2019 10:19

Merge branch 'master' into san/email-clean-2

b8965f6

Merge branch 'master' into san/email-clean-2

9136753

gerashegalov reviewed Oct 24, 2019

View reviewed changes

sanmitra closed this Dec 3, 2019

sanmitra added the DO NOT MERGE label Dec 3, 2019

tovbinm deleted the san/email-clean-2 branch June 12, 2020 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make EmailVectorizer not clean the email domains by default. #426

Make EmailVectorizer not clean the email domains by default. #426

sanmitra commented Oct 17, 2019

codecov bot commented Oct 17, 2019 •

edited

Loading

tovbinm Oct 17, 2019

tovbinm Oct 17, 2019

tovbinm Oct 17, 2019

tovbinm Oct 17, 2019

gerashegalov Oct 24, 2019

gerashegalov Oct 24, 2019

sanmitra commented Dec 3, 2019

Make EmailVectorizer not clean the email domains by default. #426

Make EmailVectorizer not clean the email domains by default. #426

Conversation

sanmitra commented Oct 17, 2019

codecov bot commented Oct 17, 2019 • edited Loading

Codecov Report

tovbinm Oct 17, 2019

Choose a reason for hiding this comment

tovbinm Oct 17, 2019

Choose a reason for hiding this comment

tovbinm Oct 17, 2019

Choose a reason for hiding this comment

tovbinm Oct 17, 2019

Choose a reason for hiding this comment

gerashegalov Oct 24, 2019

Choose a reason for hiding this comment

gerashegalov Oct 24, 2019

Choose a reason for hiding this comment

sanmitra commented Dec 3, 2019

codecov bot commented Oct 17, 2019 •

edited

Loading