Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and remove IDs disguised in text features #415

Closed
wants to merge 44 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
681d98a
starter code
TuanNguyen27 Sep 25, 2019
ef7cdfb
fix weird compilation error
TuanNguyen27 Sep 25, 2019
6a57693
fix some tests
TuanNguyen27 Sep 25, 2019
bb858c7
fix more errors resulting from removing moments calculation
TuanNguyen27 Sep 25, 2019
cae7fe4
Update ModelInsightsTest.scala
TuanNguyen27 Sep 25, 2019
aa55539
Update FeatureDistributionTest.scala
TuanNguyen27 Sep 25, 2019
7950924
add new rules to remove raw feature based on topK & starter code on t…
TuanNguyen27 Sep 27, 2019
8f3befe
fix scala style
TuanNguyen27 Sep 27, 2019
9ed2a02
more code
TuanNguyen27 Sep 27, 2019
90350d0
fix more style error
TuanNguyen27 Sep 28, 2019
8640d6d
adding isID as an exclusion criteria
TuanNguyen27 Sep 30, 2019
91285b6
fix scala style
TuanNguyen27 Oct 1, 2019
61d26b1
bunch of broken tests
TuanNguyen27 Oct 1, 2019
a7a0781
move IdDetect app to hw
TuanNguyen27 Oct 1, 2019
6c887f3
try modify titanic instead
TuanNguyen27 Oct 1, 2019
7829dfd
add app
TuanNguyen27 Oct 1, 2019
b611289
switch to a different metric
TuanNguyen27 Oct 7, 2019
a30eca6
remove extra calculations
TuanNguyen27 Oct 7, 2019
c0ceaa6
remove more stuff
TuanNguyen27 Oct 7, 2019
b3930dc
fix naming issue
TuanNguyen27 Oct 8, 2019
33afe00
Update IdDetectTest.scala
TuanNguyen27 Oct 8, 2019
b7f050b
Update FeatureDistributionTest.scala
TuanNguyen27 Oct 8, 2019
d79456e
finishing up RFF
TuanNguyen27 Oct 8, 2019
b99b395
update default so that tests will pass
TuanNguyen27 Oct 8, 2019
ac8757e
Update OpWorkflow.scala
TuanNguyen27 Oct 8, 2019
2a4ccbc
Update OpWorkflow.scala
TuanNguyen27 Oct 8, 2019
febdc13
Update OpTitanicSimple.scala
TuanNguyen27 Oct 8, 2019
0c016b8
Update RawFeatureFilter.scala
TuanNguyen27 Oct 8, 2019
93267ab
Merge branch 'master' into ID_detect
TuanNguyen27 Oct 8, 2019
d73f3c5
new transformer wip
TuanNguyen27 Oct 10, 2019
7c1f262
Merge branch 'ID_detect' of https://github.com/salesforce/Transmogrif…
TuanNguyen27 Oct 10, 2019
88b5867
Update FeatureDistributionTest.scala
TuanNguyen27 Oct 10, 2019
a133392
added transformer for map
TuanNguyen27 Oct 10, 2019
095a180
Update FeatureDistribution.scala
TuanNguyen27 Oct 10, 2019
7031d16
Merge branch 'master' into ID_detect
TuanNguyen27 Oct 10, 2019
1e767c1
more updates
TuanNguyen27 Oct 10, 2019
72ce224
more
TuanNguyen27 Oct 10, 2019
df5562e
fix unecessary changes
TuanNguyen27 Oct 10, 2019
c3cc3b0
more updates
TuanNguyen27 Oct 10, 2019
b022921
Delete IdDetectTest.scala
TuanNguyen27 Oct 10, 2019
9bad875
more fix
TuanNguyen27 Oct 10, 2019
9f7bc99
Update FeatureDistribution.scala
TuanNguyen27 Oct 10, 2019
126ddcd
fix unit tests
TuanNguyen27 Oct 10, 2019
7753280
Update SmartTextVectorizerTest.scala
TuanNguyen27 Oct 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add app
  • Loading branch information
TuanNguyen27 committed Oct 1, 2019
commit 7829dfda6934e2b698ec56ee913e82a2801fdd61
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ import org.apache.spark.sql.SparkSession
*/
case class IDTextClassification
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this get here by mistake?

(
id: Int,
id: Option[Int],
eng_sentences: Option[String],
eng_paragraphs: Option[String],
news_data: Option[String],
Expand All @@ -79,13 +79,19 @@ case class IDTextClassification
faker_ipv6: Option[String]
)

object IdDetectTest {
object OpTitanicSimple {

/**
* Run this from the command line with
* ./gradlew sparkSubmit -Dmain=com.salesforce.op.filters.IdDetectTest -Dargs=/full/path/to/csv/file
*/
def main(): Unit = {
def main(args: Array[String]): Unit = {
if (args.isEmpty) {
println("You need to pass in the CSV file path as an argument")
sys.exit(1)
}
val csvFilePath = args(0)
println(s"Using user-supplied CSV file path: $csvFilePath")

// Set up a SparkSession as normal
implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
Expand Down Expand Up @@ -115,6 +121,7 @@ object IdDetectTest {
.extract(_.variable_nb_sentences.toText).asPredictor
val faker_ipv4 = FeatureBuilder.Text[IDTextClassification].extract(_.faker_ipv4.toText).asPredictor
val faker_ipv6 = FeatureBuilder.Text[IDTextClassification].extract(_.faker_ipv6.toText).asPredictor
val id = FeatureBuilder.Integral[IDTextClassification].extract(_.id.toIntegral).asResponse

val IDFeatures = Seq(
eng_sentences, eng_paragraphs, news_data, tox_data,
Expand All @@ -126,11 +133,11 @@ object IdDetectTest {

def thresHoldRFF(mTK: Int): Seq[String] = {
val dataReader = DataReaders.Simple.csvCase[IDTextClassification](
path = Option("~/Downloads/3kData.csv"),
path = Option(csvFilePath),
key = _.id.toString)
val workflow = new OpWorkflow()
.withRawFeatureFilter(Some(dataReader), None, minTopk = mTK)
.setResultFeatures(IDFeatures)
.setResultFeatures(id, IDFeatures)
.setReader(dataReader)

// Fit the workflow to the data
Expand Down