Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and remove IDs disguised in text features #415

Closed
wants to merge 44 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
681d98a
starter code
TuanNguyen27 Sep 25, 2019
ef7cdfb
fix weird compilation error
TuanNguyen27 Sep 25, 2019
6a57693
fix some tests
TuanNguyen27 Sep 25, 2019
bb858c7
fix more errors resulting from removing moments calculation
TuanNguyen27 Sep 25, 2019
cae7fe4
Update ModelInsightsTest.scala
TuanNguyen27 Sep 25, 2019
aa55539
Update FeatureDistributionTest.scala
TuanNguyen27 Sep 25, 2019
7950924
add new rules to remove raw feature based on topK & starter code on t…
TuanNguyen27 Sep 27, 2019
8f3befe
fix scala style
TuanNguyen27 Sep 27, 2019
9ed2a02
more code
TuanNguyen27 Sep 27, 2019
90350d0
fix more style error
TuanNguyen27 Sep 28, 2019
8640d6d
adding isID as an exclusion criteria
TuanNguyen27 Sep 30, 2019
91285b6
fix scala style
TuanNguyen27 Oct 1, 2019
61d26b1
bunch of broken tests
TuanNguyen27 Oct 1, 2019
a7a0781
move IdDetect app to hw
TuanNguyen27 Oct 1, 2019
6c887f3
try modify titanic instead
TuanNguyen27 Oct 1, 2019
7829dfd
add app
TuanNguyen27 Oct 1, 2019
b611289
switch to a different metric
TuanNguyen27 Oct 7, 2019
a30eca6
remove extra calculations
TuanNguyen27 Oct 7, 2019
c0ceaa6
remove more stuff
TuanNguyen27 Oct 7, 2019
b3930dc
fix naming issue
TuanNguyen27 Oct 8, 2019
33afe00
Update IdDetectTest.scala
TuanNguyen27 Oct 8, 2019
b7f050b
Update FeatureDistributionTest.scala
TuanNguyen27 Oct 8, 2019
d79456e
finishing up RFF
TuanNguyen27 Oct 8, 2019
b99b395
update default so that tests will pass
TuanNguyen27 Oct 8, 2019
ac8757e
Update OpWorkflow.scala
TuanNguyen27 Oct 8, 2019
2a4ccbc
Update OpWorkflow.scala
TuanNguyen27 Oct 8, 2019
febdc13
Update OpTitanicSimple.scala
TuanNguyen27 Oct 8, 2019
0c016b8
Update RawFeatureFilter.scala
TuanNguyen27 Oct 8, 2019
93267ab
Merge branch 'master' into ID_detect
TuanNguyen27 Oct 8, 2019
d73f3c5
new transformer wip
TuanNguyen27 Oct 10, 2019
7c1f262
Merge branch 'ID_detect' of https://github.com/salesforce/Transmogrif…
TuanNguyen27 Oct 10, 2019
88b5867
Update FeatureDistributionTest.scala
TuanNguyen27 Oct 10, 2019
a133392
added transformer for map
TuanNguyen27 Oct 10, 2019
095a180
Update FeatureDistribution.scala
TuanNguyen27 Oct 10, 2019
7031d16
Merge branch 'master' into ID_detect
TuanNguyen27 Oct 10, 2019
1e767c1
more updates
TuanNguyen27 Oct 10, 2019
72ce224
more
TuanNguyen27 Oct 10, 2019
df5562e
fix unecessary changes
TuanNguyen27 Oct 10, 2019
c3cc3b0
more updates
TuanNguyen27 Oct 10, 2019
b022921
Delete IdDetectTest.scala
TuanNguyen27 Oct 10, 2019
9bad875
more fix
TuanNguyen27 Oct 10, 2019
9f7bc99
Update FeatureDistribution.scala
TuanNguyen27 Oct 10, 2019
126ddcd
fix unit tests
TuanNguyen27 Oct 10, 2019
7753280
Update SmartTextVectorizerTest.scala
TuanNguyen27 Oct 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
try modify titanic instead
  • Loading branch information
TuanNguyen27 committed Oct 1, 2019
commit 6c887f357a1340abe9f3ae3c8fc4ae3f1e9530cf
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package com.salesforce.hw
package com.salesforce.op.filters

import com.salesforce.op._
import com.salesforce.op.features.FeatureBuilder
Expand Down
108 changes: 0 additions & 108 deletions helloworld/src/main/scala/com/salesforce/hw/IdDetectTest.scala

This file was deleted.

164 changes: 69 additions & 95 deletions helloworld/src/main/scala/com/salesforce/hw/OpTitanicSimple.scala
Original file line number Diff line number Diff line change
Expand Up @@ -56,38 +56,36 @@ import org.apache.spark.sql.SparkSession
* @param cabin cabin id string
* @param embarked location where passenger embarked
*/
case class Passenger
case class IDTextClassification
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this get here by mistake?

(
id: Int,
survived: Int,
pClass: Option[Int],
name: Option[String],
sex: Option[String],
age: Option[Double],
sibSp: Option[Int],
parCh: Option[Int],
ticket: Option[String],
fare: Option[Double],
cabin: Option[String],
embarked: Option[String]
eng_sentences: Option[String],
eng_paragraphs: Option[String],
news_data: Option[String],
tox_data: Option[String],
movie_data: Option[String],
movie_plot: Option[String],
reddit_science: Option[String],
fake_id_prefix: Option[String],
alpha_numeric: Option[String],
fake_id_sfdc: Option[String],
fake_uuid: Option[String],
fake_number_id: Option[String],
faker_sentence: Option[String],
variable_nb_words: Option[String],
faker_paragraph: Option[String],
variable_nb_sentences: Option[String],
faker_ipv4: Option[String],
faker_ipv6: Option[String]
)

/**
* A simplified TransmogrifAI example classification app using the Titanic dataset
*/
object OpTitanicSimple {
object IdDetectTest {

/**
* Run this from the command line with
* ./gradlew sparkSubmit -Dmain=com.salesforce.hw.OpTitanicSimple -Dargs=/full/path/to/csv/file
* ./gradlew sparkSubmit -Dmain=com.salesforce.op.filters.IdDetectTest -Dargs=/full/path/to/csv/file
*/
def main(args: Array[String]): Unit = {
if (args.isEmpty) {
println("You need to pass in the CSV file path as an argument")
sys.exit(1)
}
val csvFilePath = args(0)
println(s"Using user-supplied CSV file path: $csvFilePath")
def main(): Unit = {

// Set up a SparkSession as normal
implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
Expand All @@ -98,80 +96,56 @@ object OpTitanicSimple {
/////////////////////////////////////////////////////////////////////////////////

// Define features using the OP types based on the data
val survived = FeatureBuilder.RealNN[Passenger].extract(_.survived.toRealNN).asResponse
val pClass = FeatureBuilder.PickList[Passenger].extract(_.pClass.map(_.toString).toPickList).asPredictor
val name = FeatureBuilder.Text[Passenger].extract(_.name.toText).asPredictor
val sex = FeatureBuilder.PickList[Passenger].extract(_.sex.map(_.toString).toPickList).asPredictor
val age = FeatureBuilder.Real[Passenger].extract(_.age.toReal).asPredictor
val sibSp = FeatureBuilder.Integral[Passenger].extract(_.sibSp.toIntegral).asPredictor
val parCh = FeatureBuilder.Integral[Passenger].extract(_.parCh.toIntegral).asPredictor
val ticket = FeatureBuilder.PickList[Passenger].extract(_.ticket.map(_.toString).toPickList).asPredictor
val fare = FeatureBuilder.Real[Passenger].extract(_.fare.toReal).asPredictor
val cabin = FeatureBuilder.PickList[Passenger].extract(_.cabin.map(_.toString).toPickList).asPredictor
val embarked = FeatureBuilder.PickList[Passenger].extract(_.embarked.map(_.toString).toPickList).asPredictor

////////////////////////////////////////////////////////////////////////////////
// TRANSFORMED FEATURES
/////////////////////////////////////////////////////////////////////////////////

// Do some basic feature engineering using knowledge of the underlying dataset
val familySize = sibSp + parCh + 1
val estimatedCostOfTickets = familySize * fare
val pivotedSex = sex.pivot()
val normedAge = age.fillMissingWithMean().zNormalize()
val ageGroup = age.map[PickList](_.value.map(v => if (v > 18) "adult" else "child").toPickList)

// Define a feature of type vector containing all the predictors you'd like to use
val passengerFeatures = Seq(
pClass, name, age, sibSp, parCh, ticket,
cabin, embarked, familySize, estimatedCostOfTickets,
pivotedSex, ageGroup, normedAge
val eng_sentences = FeatureBuilder.Text[IDTextClassification].extract(_.eng_sentences.toText).asPredictor
val eng_paragraphs = FeatureBuilder.Text[IDTextClassification].extract(_.eng_paragraphs.toText).asPredictor
val news_data = FeatureBuilder.Text[IDTextClassification].extract(_.news_data.toText).asPredictor
val tox_data = FeatureBuilder.Text[IDTextClassification].extract(_.tox_data.toText).asPredictor
val movie_data = FeatureBuilder.Text[IDTextClassification].extract(_.movie_data.toText).asPredictor
val movie_plot = FeatureBuilder.Text[IDTextClassification].extract(_.movie_plot.toText).asPredictor
val reddit_science = FeatureBuilder.Text[IDTextClassification].extract(_.reddit_science.toText).asPredictor
val fake_id_prefix = FeatureBuilder.Text[IDTextClassification].extract(_.fake_id_prefix.toText).asPredictor
val alpha_numeric = FeatureBuilder.Text[IDTextClassification].extract(_.alpha_numeric.toText).asPredictor
val fake_id_sfdc = FeatureBuilder.Text[IDTextClassification].extract(_.fake_id_sfdc.toText).asPredictor
val fake_uuid = FeatureBuilder.Text[IDTextClassification].extract(_.fake_uuid.toText).asPredictor
val fake_number_id = FeatureBuilder.Text[IDTextClassification].extract(_.fake_number_id.toText).asPredictor
val faker_sentence = FeatureBuilder.Text[IDTextClassification].extract(_.faker_sentence.toText).asPredictor
val variable_nb_words = FeatureBuilder.Text[IDTextClassification].extract(_.variable_nb_words.toText).asPredictor
val faker_paragraph = FeatureBuilder.Text[IDTextClassification].extract(_.faker_paragraph.toText).asPredictor
val variable_nb_sentences = FeatureBuilder.Text[IDTextClassification]
.extract(_.variable_nb_sentences.toText).asPredictor
val faker_ipv4 = FeatureBuilder.Text[IDTextClassification].extract(_.faker_ipv4.toText).asPredictor
val faker_ipv6 = FeatureBuilder.Text[IDTextClassification].extract(_.faker_ipv6.toText).asPredictor

val IDFeatures = Seq(
eng_sentences, eng_paragraphs, news_data, tox_data,
movie_data, movie_plot, reddit_science, fake_id_prefix,
alpha_numeric, fake_id_sfdc, fake_uuid, fake_number_id,
faker_sentence, variable_nb_words, faker_paragraph,
variable_nb_sentences, faker_ipv4, faker_ipv6
).transmogrify()

// Optionally check the features with a sanity checker
val checkedFeatures = survived.sanityCheck(passengerFeatures, removeBadFeatures = true)

// Define the model we want to use (here a simple logistic regression) and get the resulting output
val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(
modelTypesToUse = Seq(OpLogisticRegression)
).setInput(survived, checkedFeatures).getOutput()

val evaluator = Evaluators.BinaryClassification().setLabelCol(survived).setPredictionCol(prediction)

////////////////////////////////////////////////////////////////////////////////
// WORKFLOW
/////////////////////////////////////////////////////////////////////////////////

// Define a way to read data into our Passenger class from our CSV file
val dataReader = DataReaders.Simple.csvCase[Passenger](path = Option(csvFilePath), key = _.id.toString)

// Define a new workflow and attach our data reader
val workflow = new OpWorkflow().setResultFeatures(survived, prediction).setReader(dataReader)

// Fit the workflow to the data
val model = workflow.train()
println(s"Model summary:\n${model.summaryPretty()}")

// Extract information (i.e. feature importance) via model insights
val modelInsights = model.modelInsights(prediction)
val modelFeatures = modelInsights.features.flatMap( feature => feature.derivedFeatures)
val featureContributions = modelFeatures.map( feature => (feature.derivedFeatureName,
feature.contribution.map( contribution => math.abs(contribution))
.foldLeft(0.0) { (max, contribution) => math.max(max, contribution)}))
val sortedContributions = featureContributions.sortBy( contribution => -contribution._2)

val topNum = math.min(20, sortedContributions.size)
println(s"Top $topNum feature contributions:")
sortedContributions.take(topNum).foreach( featureInfo => println(s"${featureInfo._1}: ${featureInfo._2}"))


// Manifest the result features of the workflow
println("Scoring the model")
val (scores, metrics) = model.scoreAndEvaluate(evaluator = evaluator)

println("Metrics:\n" + metrics)

def thresHoldRFF(mTK: Int): Seq[String] = {
val dataReader = DataReaders.Simple.csvCase[IDTextClassification](
path = Option("~/Downloads/3kData.csv"),
key = _.id.toString)
val workflow = new OpWorkflow()
.withRawFeatureFilter(Some(dataReader), None, minTopk = mTK)
.setResultFeatures(IDFeatures)
.setReader(dataReader)

// Fit the workflow to the data
val model = workflow.train()
println(s"Model summary:\n${model.summaryPretty()}")

// Extract information (i.e. feature importance) via model insights
val modelInsights = model.modelInsights(IDFeatures)
val exclusionReasons = modelInsights.features.flatMap( feature => feature.exclusionReasons)
exclusionReasons.map(_.name)
}
// Stop Spark gracefully
println(thresHoldRFF(300))
println(thresHoldRFF(500))
println(thresHoldRFF(1000))
spark.stop()
}
}