Detect and remove IDs disguised in text features #415

TuanNguyen27 · 2019-10-08T20:55:02Z

Related issues

We currently hash all text features whose cardinality is too high for pivoting. However, a text feature could contain all unique values (ID-like text, e.g SSN, zip code), which won't provide useful signal to the modeling via hashing. We would like Raw Feature Filter to exclude these features from feature engineering and model training.

Describe the proposed solution

We keep track of all the different lengths of the tokenized text. If a text feature has too few unique lengths, it's likely to not contain natural language text.

Describe alternatives you've considered

Cramer's V correlation between the distribution of hashed token, and a uniform distribution.
Count of k-th most frequent token in the text feature.
Portion of tokens that's covered by the top-k most frequent tokens.

Additional context

…esting

… into ID_detect

tovbinm · 2019-10-11T03:48:00Z

core/src/main/scala/com/salesforce/op/filters/FeatureDistribution.scala

@@ -249,26 +257,11 @@ object FeatureDistribution {
 nulls = nullCount,
 summaryInfo = summaryInfo,
 distribution = distribution,
- moments = moments,


What about backwards compatibility?

tovbinm · 2019-10-11T03:48:45Z

core/src/main/scala/com/salesforce/op/filters/RawFeatureFilterResults.scala

@@ -142,7 +142,8 @@ case class RawFeatureFilterMetrics
 scoringFillRate: Option[Double],
 jsDivergence: Option[Double],
 fillRateDiff: Option[Double],
- fillRatioDiff: Option[Double]
+ fillRatioDiff: Option[Double],
+ trainingCardSize: Option[Int]


let's name it fully, i.e trainingCardinalitySize

tovbinm · 2019-10-11T03:48:57Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/IdMapRemover.scala

@@ -0,0 +1,32 @@
+package com.salesforce.op.stages.impl.feature


License header is missing

tovbinm · 2019-10-11T03:49:05Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/IdMapRemover.scala

+import com.salesforce.op.features.types.{TextMap}
+import com.salesforce.op.stages.base.unary.UnaryTransformer
+
+class IdMapRemover(


docs please

tovbinm · 2019-10-11T03:50:18Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/IdRemover.scala

+import com.salesforce.op.features.types.Text
+import com.salesforce.op.stages.base.unary.UnaryTransformer
+
+class IdRemover(


docs please

tovbinm · 2019-10-11T03:51:09Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/IdRemover.scala

+ minUniqueTokLen: Int,
+ uid: String = UID[IdRemover],
+ operationName: String = "IDremover"
+) extends UnaryTransformer[Text, Text] (operationName = operationName, uid = uid) {


no need to expose operationName on the IdRemover ctor args. instead do:
extends UnaryTransformer[Text, Text] (operationName = "IdRemover", uid = uid)

tovbinm · 2019-10-11T03:52:37Z

core/src/test/scala/com/salesforce/op/filters/FeatureDistributionTest.scala

@@ -99,7 +95,7 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
 distribs(0).distribution.length shouldBe 100
 distribs(0).distribution.sum shouldBe 10000
 distribs.foreach(d => d.featureKey shouldBe d.name -> d.key)
- distribs(0).moments.get.count shouldBe 10000
+// distribs(0).moments.get.count shouldBe 10000


remove the line

tovbinm · 2019-10-11T03:53:12Z

core/src/test/scala/com/salesforce/op/stages/impl/feature/IdRemoverTest.scala

+
+@RunWith(classOf[JUnitRunner])
+class IdRemoverTest extends OpTransformerSpec[Text, IdRemover] {
+ val sample = Seq(Text("ball"), Text("stray"), Text("happy"),


please add newlines between the vals for readability

tovbinm · 2019-10-11T03:53:20Z

core/src/test/scala/com/salesforce/op/stages/impl/feature/IdRemoverTest.scala

@@ -0,0 +1,17 @@
+package com.salesforce.op.stages.impl.feature


header is missing

tovbinm · 2019-10-11T03:54:41Z

helloworld/src/main/scala/com/salesforce/hw/OpTitanicSimple.scala

@@ -56,30 +56,32 @@ import org.apache.spark.sql.SparkSession
 * @param cabin cabin id string
 * @param embarked location where passenger embarked
 */
-case class Passenger
+case class IDTextClassification


did this get here by mistake?

tovbinm

see comments

leahmcguire

No

TuanNguyen27 added 27 commits September 25, 2019 14:12

starter code

681d98a

fix weird compilation error

ef7cdfb

fix some tests

6a57693

fix more errors resulting from removing moments calculation

bb858c7

Update ModelInsightsTest.scala

cae7fe4

Update FeatureDistributionTest.scala

aa55539

add new rules to remove raw feature based on topK & starter code on t…

7950924

…esting

fix scala style

8f3befe

more code

9ed2a02

fix more style error

90350d0

adding isID as an exclusion criteria

8640d6d

fix scala style

91285b6

bunch of broken tests

61d26b1

move IdDetect app to hw

a7a0781

try modify titanic instead

6c887f3

add app

7829dfd

switch to a different metric

b611289

remove extra calculations

a30eca6

remove more stuff

c0ceaa6

fix naming issue

b3930dc

Update IdDetectTest.scala

33afe00

Update FeatureDistributionTest.scala

b7f050b

finishing up RFF

d79456e

update default so that tests will pass

b99b395

Update OpWorkflow.scala

ac8757e

Update OpWorkflow.scala

2a4ccbc

Update OpTitanicSimple.scala

febdc13

TuanNguyen27 requested review from gerashegalov, Jauntbox and leahmcguire as code owners October 8, 2019 20:55

TuanNguyen27 added 16 commits October 8, 2019 14:24

Merge branch 'master' into ID_detect

93267ab

new transformer wip

d73f3c5

Merge branch 'ID_detect' of https://github.com/salesforce/TransmogrifAI…

7c1f262

… into ID_detect

Update FeatureDistributionTest.scala

88b5867

added transformer for map

a133392

Update FeatureDistribution.scala

095a180

Merge branch 'master' into ID_detect

7031d16

more updates

1e767c1

more

72ce224

fix unecessary changes

df5562e

more updates

c3cc3b0

Delete IdDetectTest.scala

b022921

more fix

9bad875

Update FeatureDistribution.scala

9f7bc99

fix unit tests

126ddcd

Update SmartTextVectorizerTest.scala

7753280

tovbinm reviewed Oct 11, 2019

View reviewed changes

tovbinm requested changes Oct 11, 2019

View reviewed changes

leahmcguire requested changes Oct 11, 2019

View reviewed changes

TuanNguyen27 closed this Oct 19, 2019

tovbinm deleted the ID_detect branch October 19, 2019 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect and remove IDs disguised in text features #415

Detect and remove IDs disguised in text features #415

TuanNguyen27 commented Oct 8, 2019

tovbinm Oct 11, 2019

tovbinm Oct 11, 2019

tovbinm Oct 11, 2019

tovbinm Oct 11, 2019

tovbinm Oct 11, 2019

tovbinm Oct 11, 2019

tovbinm Oct 11, 2019

tovbinm Oct 11, 2019

tovbinm Oct 11, 2019

tovbinm Oct 11, 2019

tovbinm left a comment

leahmcguire left a comment

		@@ -0,0 +1,32 @@
		package com.salesforce.op.stages.impl.feature

		@@ -0,0 +1,17 @@
		package com.salesforce.op.stages.impl.feature

Detect and remove IDs disguised in text features #415

Detect and remove IDs disguised in text features #415

Conversation

TuanNguyen27 commented Oct 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment