Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and remove IDs disguised in text features #415

Closed
wants to merge 44 commits into from

Conversation

TuanNguyen27
Copy link
Collaborator

Related issues

We currently hash all text features whose cardinality is too high for pivoting. However, a text feature could contain all unique values (ID-like text, e.g SSN, zip code), which won't provide useful signal to the modeling via hashing. We would like Raw Feature Filter to exclude these features from feature engineering and model training.

Describe the proposed solution

We keep track of all the different lengths of the tokenized text. If a text feature has too few unique lengths, it's likely to not contain natural language text.

Describe alternatives you've considered

Cramer's V correlation between the distribution of hashed token, and a uniform distribution.
Count of k-th most frequent token in the text feature.
Portion of tokens that's covered by the top-k most frequent tokens.

Additional context

@@ -249,26 +257,11 @@ object FeatureDistribution {
nulls = nullCount,
summaryInfo = summaryInfo,
distribution = distribution,
moments = moments,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about backwards compatibility?

@@ -142,7 +142,8 @@ case class RawFeatureFilterMetrics
scoringFillRate: Option[Double],
jsDivergence: Option[Double],
fillRateDiff: Option[Double],
fillRatioDiff: Option[Double]
fillRatioDiff: Option[Double],
trainingCardSize: Option[Int]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's name it fully, i.e trainingCardinalitySize

@@ -0,0 +1,32 @@
package com.salesforce.op.stages.impl.feature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License header is missing

import com.salesforce.op.features.types.{TextMap}
import com.salesforce.op.stages.base.unary.UnaryTransformer

class IdMapRemover(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs please

import com.salesforce.op.features.types.Text
import com.salesforce.op.stages.base.unary.UnaryTransformer

class IdRemover(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs please

minUniqueTokLen: Int,
uid: String = UID[IdRemover],
operationName: String = "IDremover"
) extends UnaryTransformer[Text, Text] (operationName = operationName, uid = uid) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to expose operationName on the IdRemover ctor args. instead do:
extends UnaryTransformer[Text, Text] (operationName = "IdRemover", uid = uid)

@@ -99,7 +95,7 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
distribs(0).distribution.length shouldBe 100
distribs(0).distribution.sum shouldBe 10000
distribs.foreach(d => d.featureKey shouldBe d.name -> d.key)
distribs(0).moments.get.count shouldBe 10000
// distribs(0).moments.get.count shouldBe 10000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the line


@RunWith(classOf[JUnitRunner])
class IdRemoverTest extends OpTransformerSpec[Text, IdRemover] {
val sample = Seq(Text("ball"), Text("stray"), Text("happy"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add newlines between the vals for readability

@@ -0,0 +1,17 @@
package com.salesforce.op.stages.impl.feature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header is missing

@@ -56,30 +56,32 @@ import org.apache.spark.sql.SparkSession
* @param cabin cabin id string
* @param embarked location where passenger embarked
*/
case class Passenger
case class IDTextClassification
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this get here by mistake?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

@tovbinm tovbinm deleted the ID_detect branch October 19, 2019 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants