Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and remove IDs disguised in text features #415

Closed
wants to merge 44 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
681d98a
starter code
TuanNguyen27 Sep 25, 2019
ef7cdfb
fix weird compilation error
TuanNguyen27 Sep 25, 2019
6a57693
fix some tests
TuanNguyen27 Sep 25, 2019
bb858c7
fix more errors resulting from removing moments calculation
TuanNguyen27 Sep 25, 2019
cae7fe4
Update ModelInsightsTest.scala
TuanNguyen27 Sep 25, 2019
aa55539
Update FeatureDistributionTest.scala
TuanNguyen27 Sep 25, 2019
7950924
add new rules to remove raw feature based on topK & starter code on t…
TuanNguyen27 Sep 27, 2019
8f3befe
fix scala style
TuanNguyen27 Sep 27, 2019
9ed2a02
more code
TuanNguyen27 Sep 27, 2019
90350d0
fix more style error
TuanNguyen27 Sep 28, 2019
8640d6d
adding isID as an exclusion criteria
TuanNguyen27 Sep 30, 2019
91285b6
fix scala style
TuanNguyen27 Oct 1, 2019
61d26b1
bunch of broken tests
TuanNguyen27 Oct 1, 2019
a7a0781
move IdDetect app to hw
TuanNguyen27 Oct 1, 2019
6c887f3
try modify titanic instead
TuanNguyen27 Oct 1, 2019
7829dfd
add app
TuanNguyen27 Oct 1, 2019
b611289
switch to a different metric
TuanNguyen27 Oct 7, 2019
a30eca6
remove extra calculations
TuanNguyen27 Oct 7, 2019
c0ceaa6
remove more stuff
TuanNguyen27 Oct 7, 2019
b3930dc
fix naming issue
TuanNguyen27 Oct 8, 2019
33afe00
Update IdDetectTest.scala
TuanNguyen27 Oct 8, 2019
b7f050b
Update FeatureDistributionTest.scala
TuanNguyen27 Oct 8, 2019
d79456e
finishing up RFF
TuanNguyen27 Oct 8, 2019
b99b395
update default so that tests will pass
TuanNguyen27 Oct 8, 2019
ac8757e
Update OpWorkflow.scala
TuanNguyen27 Oct 8, 2019
2a4ccbc
Update OpWorkflow.scala
TuanNguyen27 Oct 8, 2019
febdc13
Update OpTitanicSimple.scala
TuanNguyen27 Oct 8, 2019
0c016b8
Update RawFeatureFilter.scala
TuanNguyen27 Oct 8, 2019
93267ab
Merge branch 'master' into ID_detect
TuanNguyen27 Oct 8, 2019
d73f3c5
new transformer wip
TuanNguyen27 Oct 10, 2019
7c1f262
Merge branch 'ID_detect' of https://github.com/salesforce/Transmogrif…
TuanNguyen27 Oct 10, 2019
88b5867
Update FeatureDistributionTest.scala
TuanNguyen27 Oct 10, 2019
a133392
added transformer for map
TuanNguyen27 Oct 10, 2019
095a180
Update FeatureDistribution.scala
TuanNguyen27 Oct 10, 2019
7031d16
Merge branch 'master' into ID_detect
TuanNguyen27 Oct 10, 2019
1e767c1
more updates
TuanNguyen27 Oct 10, 2019
72ce224
more
TuanNguyen27 Oct 10, 2019
df5562e
fix unecessary changes
TuanNguyen27 Oct 10, 2019
c3cc3b0
more updates
TuanNguyen27 Oct 10, 2019
b022921
Delete IdDetectTest.scala
TuanNguyen27 Oct 10, 2019
9bad875
more fix
TuanNguyen27 Oct 10, 2019
9f7bc99
Update FeatureDistribution.scala
TuanNguyen27 Oct 10, 2019
126ddcd
fix unit tests
TuanNguyen27 Oct 10, 2019
7753280
Update SmartTextVectorizerTest.scala
TuanNguyen27 Oct 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix more errors resulting from removing moments calculation
  • Loading branch information
TuanNguyen27 committed Sep 25, 2019
commit bb858c7b1ee7b2585b6cbe05664663ca2b3397e9
17 changes: 4 additions & 13 deletions core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -167,15 +167,14 @@ class ModelInsightsTest extends FlatSpec with PassengerSparkFixtureTest with Dou
}

def getFeatureMomentsAndCard(inputModel: FeatureLike[Prediction],
DF: DataFrame): (Map[String, Moments], Map[String, TextStats]) = {
DF: DataFrame): Map[String, Moments] = {
lazy val workFlow = new OpWorkflow().setResultFeatures(inputModel).setInputDataset(DF)
lazy val dummyReader = workFlow.getReader()
lazy val workFlowRFF = workFlow.withRawFeatureFilter(Some(dummyReader), None)
lazy val model = workFlowRFF.train()
val insights = model.modelInsights(inputModel)
val featureMoments = insights.features.map(f => f.featureName -> f.distributions.head.moments.get).toMap
val featureCardinality = insights.features.map(f => f.featureName -> f.distributions.head.cardEstimate.get).toMap
return (featureMoments, featureCardinality)
return featureCardinality
}

val params = new OpParams()
Expand Down Expand Up @@ -777,23 +776,15 @@ class ModelInsightsTest extends FlatSpec with PassengerSparkFixtureTest with Dou
absError2 should be < tol * smallCoeffSum / 2
}

it should "correctly return moments calculation and cardinality calculation for numeric features" in {
it should "correctly return cardinality calculation for numeric features" in {

import spark.implicits._
val df = linRegDF._3
val meanTol = 0.01
val varTol = 0.01
val (moments, cardinality) = getFeatureMomentsAndCard(standardizedLinpred, linRegDF._3)
val cardinality = getFeatureMomentsAndCard(standardizedLinpred, linRegDF._3)

// Go through each feature and check that the mean, variance, and unique counts match the data
moments.foreach { case (featureName, value) => {
value.count shouldBe 1000
val (expectedMean, expectedVariance) =
df.select(avg(featureName), variance(featureName)).as[(Double, Double)].collect().head
math.abs((value.mean - expectedMean) / expectedMean) < meanTol shouldBe true
math.abs((value.variance - expectedVariance) / expectedVariance) < varTol shouldBe true
}
}

cardinality.foreach { case (featureName, value) => {
val actualUniques = df.select(featureName).as[Double].collect().toSet
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
it should "marshall to/from json" in {
val fd1 = FeatureDistribution("A", None, 10, 1, Array(1, 4, 0, 0, 6), Array.empty)
val fd2 = FeatureDistribution("A", None, 10, 1, Array(1, 4, 0, 0, 6),
Array.empty, Some(TextStats(Map("foo" -> 1, "bar" ->2))),
Array.empty, Some("String"), Some(TextStats(Map("foo" -> 1, "bar" ->2))),
FeatureDistributionType.Scoring)
val json = FeatureDistribution.toJson(Array(fd1, fd2))
FeatureDistribution.fromJson(json) match {
Expand Down