Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate name detection into SmartTextVectorizer #508

Merged
merged 148 commits into from
Oct 6, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
148 commits
Select commit Hold shift + click to select a range
2acf3fc
Re-added unary estimator code and started porting logic to Algebird m…
MWYang Dec 5, 2019
b55c31e
Re-added JRC name dictionary and cleaned up names of methods
MWYang Dec 6, 2019
f443952
Fixed bug with AveragedValue computation; Trying to debug current Alg…
MWYang Dec 6, 2019
557ef39
Fixed wrong inequality direction in guard checks
MWYang Dec 6, 2019
b6728ec
Added HLL back to monoid accumulator and code now compiles correctly;…
MWYang Dec 6, 2019
ff1b2ef
Fixed HLL in NameDetectStats not serializing correctly; Now need to f…
MWYang Dec 6, 2019
33b77c9
Fixed NameDetectStats printing
MWYang Dec 6, 2019
4caec75
Fixed guard stat calculation computing moments of number of tokens in…
MWYang Dec 6, 2019
b60dc4a
Fixed moments calculation and fixed divide by zero error when list of…
MWYang Dec 6, 2019
469111b
Added gender identification code transforming; All previous tests now…
MWYang Dec 7, 2019
e5f169e
Undid SparkUtils changes, which are no longer necessary
MWYang Dec 7, 2019
b701612
Renamed class names to be more consistent + small fixes
MWYang Dec 9, 2019
8342dae
Added honorific detection
MWYang Dec 9, 2019
2e0e85a
Implemented RegEx checking for gender
MWYang Dec 9, 2019
b079a27
Implemented mixed gender identification strategies
MWYang Dec 9, 2019
a1197a7
Removed TODOs and extraneous functions in preparation for PR
MWYang Dec 9, 2019
19bad0b
Updated documentation
MWYang Dec 9, 2019
3172bde
Ignore null values in detecting names
MWYang Dec 9, 2019
0d82eef
Added flag for ignoring nulls
MWYang Dec 9, 2019
7812d12
Added sir/madam to list of honorifics
MWYang Dec 9, 2019
345508f
Merge branch 'master' into my/unary-detect-names
MWYang Dec 9, 2019
a80e382
Fixed typo when adding sir/madam to list of honorifics that caused fa…
MWYang Dec 10, 2019
f7817d9
Fixed failing test due to divide by zero NA on some inputs
MWYang Dec 10, 2019
cf3eff0
Cleaned up redundant import in tests
MWYang Dec 10, 2019
eaaa23a
Added failing tests for STV
MWYang Dec 10, 2019
b928a49
Made small changes based on PR comments (updated inline comment and i…
MWYang Dec 10, 2019
048c084
Created metadata case class per PR review; Added tests for metadata; …
MWYang Dec 10, 2019
5d30e79
Added test for name threshold
MWYang Dec 10, 2019
78c321c
Updated comment about NameDetectStats.toJson
MWYang Dec 10, 2019
a597d21
Added tests for new NameStats feature type
MWYang Dec 10, 2019
12b3eae
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Dec 10, 2019
e299677
Started porting over name detection code before wanting to try and si…
MWYang Dec 11, 2019
997a132
Added private declaration to methods in NameDetectFun trait
MWYang Dec 11, 2019
6b8a039
Abstracted out even more name detection logic into NameDetectUtils
MWYang Dec 11, 2019
1250559
Added default dictionaries to NameDetectUtils object (for lazy and pe…
MWYang Dec 11, 2019
2d804ee
Fixed tests sometimes failing because they were not using the same na…
MWYang Dec 11, 2019
f4ecea1
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Dec 11, 2019
6c34935
Using new util abstractions for name detection and fixed encoder issue
MWYang Dec 11, 2019
9427cd3
Refactored STV changes to be cleaner
MWYang Dec 11, 2019
11f5bf3
Figured out Algebird-fu so that we can perform both reduce operations…
MWYang Dec 11, 2019
38bff0b
Using Algebird shortcuts again to reduce verbosity
MWYang Dec 11, 2019
e823ecb
Added custom enum for how to handle each column in SmartTextVectorizer
MWYang Dec 11, 2019
5f4f592
Updated NameDetectStats.toJson to be less verbose and use custom seri…
MWYang Dec 11, 2019
5336e71
Updated NameDetectStats.toJson to be less verbose and use custom seri…
MWYang Dec 11, 2019
9e9c149
Updated STV.partition name to be more meaningful
MWYang Dec 11, 2019
141f1af
Added back SensitiveFeatureInformation metadata files/changes
MWYang Dec 12, 2019
33f6809
Delete accidentally committed temporary test file
MWYang Dec 12, 2019
464fe52
Added shortcut for unary name detector
MWYang Dec 12, 2019
6cc90de
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Dec 12, 2019
87c568d
Started to make Changes to SmartTextMapVectorizer but ran into proble…
MWYang Dec 12, 2019
bff4d1b
Delete accidentally committed temporary test file
MWYang Dec 12, 2019
639af3d
Removed type parameter from NameDetectFun because of later conflict w…
MWYang Dec 12, 2019
a8b4423
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Dec 12, 2019
e7795dc
Added first failing test for SmartTextMapVectorizer
MWYang Dec 12, 2019
440068c
Removed Pythonic i.e. not Scala-ic index thing and added separate cas…
MWYang Dec 13, 2019
e6136f9
Removed extraneous case classes for dictionaries, per PR comment
MWYang Dec 13, 2019
283d76e
Small fixes (updated comments, re-ordered things) per PR comments
MWYang Dec 13, 2019
80d81ed
Removed usage of broadcast variables in transformer b/c it does not s…
MWYang Dec 13, 2019
e003eb2
Fixed serialization of GenderDetectStrategy, per PR recommendation to…
MWYang Dec 13, 2019
ee6c24b
Started merging my/unary-detect-names into my/stv-detect-names
MWYang Dec 13, 2019
2fc0c3d
Restored GenderDetectStrategy after merge
MWYang Dec 13, 2019
b51e14b
Fixed missing plus sign in OpPipelineStageReaderWriter causing double…
MWYang Dec 13, 2019
634b664
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Dec 13, 2019
03431e9
Cleaned up utils file by moving all implicit definitions to NameDetec…
MWYang Dec 13, 2019
cde7551
Passed first test for SmartTextMapVectorizer
MWYang Dec 14, 2019
393275c
Synced changes from upstream feature branch for STV changes
MWYang Dec 14, 2019
ad9574c
Tidied up monoid definition for NameDetectStats after figuring out ho…
MWYang Dec 14, 2019
72a1d48
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Dec 14, 2019
a8d91de
Added more passing tests for SmartTextMapVectorizer
MWYang Dec 16, 2019
0639af5
Started to add next test for excluding names from vector output
MWYang Dec 16, 2019
4f316e5
Updated tests based on my new correct understanding that Text.empty =…
MWYang Dec 17, 2019
714dcdc
Fixed failing test due to constructing to-be-compared estimators diff…
MWYang Dec 17, 2019
27a6ce6
Added first failing metadata test
MWYang Dec 18, 2019
9e7bfd2
Changed SensitiveFeatureInformation.Name to log gender detection stra…
MWYang Dec 18, 2019
5d7716b
Abstracted out ordering of gender detection strategies into utils file
MWYang Dec 18, 2019
bc9bc63
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Dec 18, 2019
7c218e9
Added warning logging into SmartTextVectorizer
MWYang Dec 18, 2019
500d16e
Passed first metadata test for SmartTextVectorizer; Started to re-wor…
MWYang Dec 19, 2019
08af16b
Added first passing metadata test for STMapV
MWYang Dec 20, 2019
573820c
Fixed OPVectorMetadataTest
MWYang Dec 20, 2019
eb5ff9b
Small fixes to tests
MWYang Dec 20, 2019
eaad6f8
Merge branch 'master' into my/unary-detect-names
MWYang Dec 20, 2019
2bea311
Merge branch 'my/unary-detect-names' of https://github.com/MWYang/Tra…
MWYang Dec 20, 2019
77f91a2
Small fixes (better Scala code, more safe, better patterns) from Matt…
MWYang Dec 20, 2019
b6b385a
Improved gender detection strategy tests to check that the correct st…
MWYang Dec 20, 2019
8ed02bd
Broke out guard check numbers into their own params
MWYang Dec 21, 2019
e92101a
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Dec 21, 2019
799eb58
Added operationName as an argument to HumanNameDetectorModel for easi…
MWYang Jan 6, 2020
ca69ed8
Merge branch 'master' into my/unary-detect-names
MWYang Jan 6, 2020
b82440a
Merge branch 'my/unary-detect-names' of https://github.com/MWYang/Tra…
MWYang Jan 6, 2020
d747c92
Revert to using container Text class for NameDetectUtils per PR comments
MWYang Jan 6, 2020
7d01d70
Added NameStats to FeatureBuilder
MWYang Jan 6, 2020
b4209c6
Added NameStats to a few more places
MWYang Jan 7, 2020
23d7a57
Added NameStats to TestFeatureBuilder and RandomMap
MWYang Jan 7, 2020
d86f2d6
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Jan 7, 2020
eea3a3c
Reordered tests to avoid flooding output with test logs
MWYang Jan 8, 2020
b9522de
Merge branch 'master' into my/unary-detect-names
tovbinm Jan 8, 2020
e4e3ddd
Incorporated PR comments (using enumeratum for NameStats map keys/val…
MWYang Jan 8, 2020
f551543
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Jan 8, 2020
a9d95a1
Passed most SmartTextVectorizer tests after merging changes
MWYang Jan 8, 2020
30476c8
Made all tests pass - Debuging wasn't being enabled due to non-intuit…
MWYang Jan 8, 2020
aa680b8
Fixed test to show that the output for SmartTextVectorizer is the sam…
MWYang Jan 8, 2020
a57aa29
Added all other metadata tests for SmartTextVectorizer
MWYang Jan 9, 2020
b9120a5
Removed some print statements
MWYang Jan 9, 2020
be2047d
Updated documentation
MWYang Jan 9, 2020
8b02dff
Incorporated PR comments (renamed GenderStrings to GenderValues and r…
MWYang Jan 9, 2020
8844ef5
Removed plural names from NameStats enums and factored out method in …
MWYang Jan 9, 2020
0cc10e0
More small fixes from PR comments
MWYang Jan 10, 2020
a0d97c7
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Jan 10, 2020
3ad8ce1
Got previous tests working
Jauntbox Jan 10, 2020
b00775b
New test also working
Jauntbox Jan 10, 2020
03ae4d0
Remove debug output
Jauntbox Jan 10, 2020
9759747
Added test for TextList monoid
Jauntbox Jan 10, 2020
63ad17a
Added all tests for SmartTextMapVectorizer
MWYang Jan 13, 2020
89fa865
Removed print statements
MWYang Jan 13, 2020
10644d8
Added another test and fixed sneaky metadata issue
Jauntbox Jan 13, 2020
4fea007
Removed emptiness check
MWYang Jan 14, 2020
8f7b125
Merge branch 'my/unary-detect-names' into my/stv-detect-names
MWYang Jan 14, 2020
9317250
Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…
Jauntbox Jan 14, 2020
6424463
Merge branch 'master' into my/stv-detect-names
MWYang Jan 14, 2020
a107db1
Pulled out SensitiveFeatureInformation metadata changes into its own …
MWYang Jan 14, 2020
32f29ce
Merge branch 'my/sensitive-metadata' into my/stv-detect-names
MWYang Jan 14, 2020
acb873e
Fixed SensitiveFeatureInformation tests failing due to not changing t…
MWYang Jan 14, 2020
973aabb
Merge branch 'my/sensitive-metadata' into my/stv-detect-names
MWYang Jan 14, 2020
3ae735c
Addressing comments
Jauntbox Jan 14, 2020
bef2b22
Spelling
Jauntbox Jan 15, 2020
069031c
Fixed failing test by making default behavior of SmartTextVectorizer …
Jauntbox Jan 15, 2020
d8f7f21
Merge branch 'km/token-lens-map' into my/sensitive-metadata
MWYang Jan 21, 2020
e13c4d1
Merge branch 'my/sensitive-metadata' into my/stv-detect-names
MWYang Jan 21, 2020
fe57c89
Merge branch 'km/token-lens-map' into my/sensitive-metadata
MWYang Jan 21, 2020
458af4a
Merge branch 'my/sensitive-metadata' into my/stv-detect-names
MWYang Jan 21, 2020
40cb64b
Made all tests pass after merge (Ignore in STMapV didn't handle empty…
MWYang Jan 22, 2020
fde5c8d
Merge branch 'master' into my/sensitive-metadata
MWYang Jan 22, 2020
2572332
Merge branch 'my/sensitive-metadata' into my/stv-detect-names
MWYang Jan 22, 2020
a8504da
Removed enum from SensitiveFeatureInformation per PR comments
MWYang Jan 24, 2020
eac0e05
Using case class for GenderDetectionStrategy information
MWYang Jan 24, 2020
26f30fb
Cleaning up tests per PR comments
MWYang Jan 24, 2020
f66d896
Merge branch 'my/sensitive-metadata' into my/stv-detect-names
MWYang Jan 24, 2020
9a3f5af
Made fixes for metadata changes
MWYang Jan 24, 2020
bd7b90d
Merge branch 'master' into my/sensitive-metadata
MWYang Jan 24, 2020
5c37c26
Merge branch 'my/sensitive-metadata' into my/stv-detect-names
MWYang Jan 24, 2020
9344e5f
Merge branch 'master' into my/stv-detect-names
MWYang Jan 29, 2020
e2566b2
Got Michael's branch up to date with master
Jauntbox Sep 3, 2020
89e7e70
Cleanup
Jauntbox Sep 11, 2020
3f23e3e
Fixed merge conflicts
Jauntbox Sep 11, 2020
27a17a2
Remove comments
Jauntbox Sep 11, 2020
d358812
Merge branch 'master' into my/stv-detect-names
leahmcguire Sep 28, 2020
d1ec7f1
Merge branch 'master' into my/stv-detect-names
leahmcguire Oct 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Started to make Changes to SmartTextMapVectorizer but ran into proble…
…m with utils and types
  • Loading branch information
MWYang committed Dec 12, 2019
commit 87c568d3ed8800cf246f0ebc082bbe45b6675bbd
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ import com.salesforce.op.stages.impl.feature.VectorizerUtils._
import com.salesforce.op.utils.json.JsonLike
import com.salesforce.op.utils.spark.RichDataset._
import com.salesforce.op.utils.spark.{OpVectorColumnMetadata, OpVectorMetadata}
import com.salesforce.op.utils.stages.NameDetectFun
import com.twitter.algebird.Monoid._
import com.twitter.algebird.Operators._
import com.twitter.algebird.Monoid
Expand All @@ -63,7 +64,8 @@ class SmartTextMapVectorizer[T <: OPMap[String]]
with PivotParams with CleanTextFun with SaveOthersParams
with TrackNullsParam with MinSupportParam with TextTokenizerParams with TrackTextLenParam
with HashingVectorizerParams with MapHashingFun with OneHotFun with MapStringPivotHelper
with MapVectorizerFuns[String, OPMap[String]] with MaxCardinalityParams {
with MapVectorizerFuns[String, OPMap[String]] with MaxCardinalityParams
with NameDetectFun[T] {

private implicit val textMapStatsSeqEnc: Encoder[Array[TextMapStats]] = ExpressionEncoder[Array[TextMapStats]]()

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
package com.salesforce.op.stages.impl.feature

import com.salesforce.op._
import com.salesforce.op.features.{Feature, FeatureLike}
import com.salesforce.op.stages.base.sequence.SequenceModel
import com.salesforce.op.test.{OpEstimatorSpec, TestFeatureBuilder, TestSparkContext}
import com.salesforce.op.utils.spark.{OpVectorColumnMetadata, OpVectorMetadata}
Expand All @@ -39,6 +40,8 @@ import org.apache.spark.ml.linalg.Vectors
import org.junit.runner.RunWith
import org.scalatest.junit.JUnitRunner
import com.salesforce.op.features.types._
import com.salesforce.op.testkit.RandomText
import com.salesforce.op.utils.stages.{NameDetectUtils, SensitiveFeatureMode}

@RunWith(classOf[JUnitRunner])
class SmartTextMapVectorizerTest
Expand Down Expand Up @@ -398,4 +401,112 @@ class SmartTextMapVectorizerTest
result.foreach { case (vec1, vec2) => vec1 shouldBe vec2 }
}

/* TESTS FOR DETECTING SENSITIVE FEATURES BEGIN */
lazy val (newInputData, features) = {
val N = 5

val baseText1 = Seq("hello world", "hello world", "good evening").toText ++ Seq(Text.empty, Text.empty)
val baseText2 = Seq(
"Hello world!", "What's up", "How are you doing, my friend?", "Not bad, my friend").toText :+ Text.empty
val baseNames = Seq("Michael", "Michelle", "Roxanne", "Ross").toText :+ Text.empty

val textMap1: Seq[TextMap] = (baseText1, baseText2, baseNames).zipped.map { case (a, b, c) =>
TextMap(Map("text1" -> a.toString, "text2" -> b.toString, "names" -> c.toString))
}
val textMap2: Seq[TextMap] = Seq.fill[TextMap](N)(TextMap.empty)

val textAreaMap1: Seq[TextAreaMap] = (baseText1, baseText2, baseNames).zipped.map { case (a, b, c) =>
TextAreaMap(Map("text1" -> a.toString, "text2" -> b.toString, "names" -> c.toString))
}
val textAreaMap2: Seq[TextAreaMap] = Seq.fill[TextAreaMap](N)(TextAreaMap.empty)

val allFeatures = Seq(
baseText1, // f0
baseText2, // f1
baseNames, // f2
textMap1, // f3
textMap2, // f4
textAreaMap1, // f5
textAreaMap2 // f6
)
assert(allFeatures.forall(_.length == N))
TestFeatureBuilder(allFeatures: _*)
}

val biasEstimator: SmartTextVectorizer[Text] = new SmartTextVectorizer()
.setMaxCardinality(2).setNumFeatures(4).setMinSupport(1)
.setTopK(2).setPrependFeatureName(false)
.setHashSpaceStrategy(HashSpaceStrategy.Shared)
.setSensitiveFeatureMode(SensitiveFeatureMode.DetectAndRemove)
.setInput(features(0).asInstanceOf[Feature[Text]], features(1).asInstanceOf[Feature[Text]])

val biasMapEstimator: SmartTextMapVectorizer[Text] = new SmartTextMapVectorizer()
.setMaxCardinality(2).setNumFeatures(4).setMinSupport(1)
.setTopK(2).setPrependFeatureName(false)
.setHashSpaceStrategy(HashSpaceStrategy.Shared)
.setSensitiveFeatureMode(SensitiveFeatureMode.DetectAndRemove)
.setInput(features(3).asInstanceOf[Feature[TextMap]], features(4).asInstanceOf[Feature[TextMap]])

private lazy val NameDictionaryGroundTruth: RandomText[Text] = RandomText.textFromDomain(
NameDetectUtils.DefaultNameDictionary.value.toList
)

// it should "detect a single name feature" in {
// val newEstimator: SmartTextVectorizer[Text] = biasEstimator.setInput(newF3)
// val model: SmartTextVectorizerModel[Text] = newEstimator
// .fit(newInputData)
// .asInstanceOf[SmartTextVectorizerModel[Text]]
// newInputData.show()
// model.args.whichAction shouldBe Array(Sensitive)
// }
//
// it should "detect a single name feature and return empty vectors" in {
// val newEstimator: SmartTextVectorizer[Text] = biasEstimator.setInput(newF3)
// newInputData.show()
//
// val smartVectorized = newEstimator.getOutput()
// val transformed = new OpWorkflow()
// .setResultFeatures(smartVectorized).transform(newInputData)
// val result = transformed.collect(smartVectorized)
// val (smart, expected) = result.map(smartVector => smartVector -> OPVector.empty).unzip
//
// smart shouldBe expected
// OpVectorMetadata("OutputVector", newEstimator.getMetadata()).size shouldBe 0
// }
//
// it should "detect a single name column among other non-name Text columns" in {
// val newEstimator: SmartTextVectorizer[Text] = biasEstimator.setInput(newF1, newF2, newF3)
// val model: SmartTextVectorizerModel[Text] = newEstimator
// .fit(newInputData)
// .asInstanceOf[SmartTextVectorizerModel[Text]]
// newInputData.show()
// model.args.whichAction shouldBe Array(Categorical, NonCategorical, Sensitive)
// }
//
// it should "not create information in the vector for a single name column among other non-name Text columns" in {
// newInputData.show()
//
// val newEstimator: SmartTextVectorizer[Text] = biasEstimator.setInput(newF1, newF2, newF3)
// val withNamesVectorized = newEstimator.getOutput()
//
// val oldEstimator: SmartTextVectorizer[Text] = new SmartTextVectorizer(uid = UID("newEstimator"))
// .setMaxCardinality(2).setNumFeatures(4).setMinSupport(1)
// .setTopK(2).setPrependFeatureName(false)
// .setHashSpaceStrategy(HashSpaceStrategy.Shared)
// .setSensitiveFeatureMode(SensitiveFeatureMode.DetectAndRemove)
// .setInput(newF1, newF2)
// val withoutNamesVectorized = oldEstimator.getOutput()
//
// val transformed = new OpWorkflow()
// .setResultFeatures(withNamesVectorized, withoutNamesVectorized).transform(newInputData)
// val result = transformed.collect(withNamesVectorized, withoutNamesVectorized)
//
// val (withNames, withoutNames) = result.unzip
//
// withNames shouldBe withoutNames
//
// OpVectorMetadata("OutputVector", newEstimator.getMetadata()).size shouldBe
// OpVectorMetadata("OutputVector", oldEstimator.getMetadata()).size
// }
/* TESTS FOR DETECTING SENSITIVE FEATURES END */
}