-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and remove IDs disguised in text features #415
Conversation
@@ -249,26 +257,11 @@ object FeatureDistribution { | |||
nulls = nullCount, | |||
summaryInfo = summaryInfo, | |||
distribution = distribution, | |||
moments = moments, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about backwards compatibility?
@@ -142,7 +142,8 @@ case class RawFeatureFilterMetrics | |||
scoringFillRate: Option[Double], | |||
jsDivergence: Option[Double], | |||
fillRateDiff: Option[Double], | |||
fillRatioDiff: Option[Double] | |||
fillRatioDiff: Option[Double], | |||
trainingCardSize: Option[Int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's name it fully, i.e trainingCardinalitySize
@@ -0,0 +1,32 @@ | |||
package com.salesforce.op.stages.impl.feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
License header is missing
import com.salesforce.op.features.types.{TextMap} | ||
import com.salesforce.op.stages.base.unary.UnaryTransformer | ||
|
||
class IdMapRemover( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs please
import com.salesforce.op.features.types.Text | ||
import com.salesforce.op.stages.base.unary.UnaryTransformer | ||
|
||
class IdRemover( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs please
minUniqueTokLen: Int, | ||
uid: String = UID[IdRemover], | ||
operationName: String = "IDremover" | ||
) extends UnaryTransformer[Text, Text] (operationName = operationName, uid = uid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to expose operationName on the IdRemover
ctor args. instead do:
extends UnaryTransformer[Text, Text] (operationName = "IdRemover", uid = uid)
@@ -99,7 +95,7 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi | |||
distribs(0).distribution.length shouldBe 100 | |||
distribs(0).distribution.sum shouldBe 10000 | |||
distribs.foreach(d => d.featureKey shouldBe d.name -> d.key) | |||
distribs(0).moments.get.count shouldBe 10000 | |||
// distribs(0).moments.get.count shouldBe 10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the line
|
||
@RunWith(classOf[JUnitRunner]) | ||
class IdRemoverTest extends OpTransformerSpec[Text, IdRemover] { | ||
val sample = Seq(Text("ball"), Text("stray"), Text("happy"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add newlines between the vals for readability
@@ -0,0 +1,17 @@ | |||
package com.salesforce.op.stages.impl.feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
header is missing
@@ -56,30 +56,32 @@ import org.apache.spark.sql.SparkSession | |||
* @param cabin cabin id string | |||
* @param embarked location where passenger embarked | |||
*/ | |||
case class Passenger | |||
case class IDTextClassification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did this get here by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
Related issues
We currently hash all text features whose cardinality is too high for pivoting. However, a text feature could contain all unique values (ID-like text, e.g SSN, zip code), which won't provide useful signal to the modeling via hashing. We would like Raw Feature Filter to exclude these features from feature engineering and model training.
Describe the proposed solution
We keep track of all the different lengths of the tokenized text. If a text feature has too few unique lengths, it's likely to not contain natural language text.
Describe alternatives you've considered
Cramer's V correlation between the distribution of hashed token, and a uniform distribution.
Count of k-th most frequent token in the text feature.
Portion of tokens that's covered by the top-k most frequent tokens.
Additional context