Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust bin values for text features in RFF #99

Merged
merged 27 commits into from
Sep 4, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
da5ba89
adjust bin value based on number of token
sxd929 Aug 28, 2018
1d90e8f
cleanup
sxd929 Aug 28, 2018
3b6c45f
address comments
sxd929 Aug 28, 2018
eea1cf2
merge
sxd929 Aug 28, 2018
b958887
add sum and count to summary
sxd929 Aug 29, 2018
a54131b
change formula
sxd929 Aug 29, 2018
dae829d
add todo
sxd929 Aug 29, 2018
d6ad74f
Merge branch 'master' into xs/adjustBinValueForText
tovbinm Aug 30, 2018
d9dee94
use bins as default
sxd929 Aug 30, 2018
1bb28c6
Merge branch 'xs/adjustBinValueForText' of https://github.com/salesfo…
sxd929 Aug 30, 2018
441ab06
Merge branch 'master' into xs/adjustBinValueForText
sxd929 Aug 30, 2018
546fcfc
Merge branch 'master' into xs/adjustBinValueForText
tovbinm Aug 30, 2018
13cb288
Merge branch 'master' into xs/adjustBinValueForText
tovbinm Aug 30, 2018
c47e9bc
address comments
sxd929 Aug 31, 2018
e8d616b
cleanup
sxd929 Aug 31, 2018
e133bb4
Merge branch 'xs/adjustBinValueForText' of https://github.com/salesfo…
sxd929 Aug 31, 2018
20a9c40
cleanup
sxd929 Aug 31, 2018
e39dad6
Merge branch 'master' into xs/adjustBinValueForText
sxd929 Aug 31, 2018
1c6d359
Merge branch 'master' into xs/adjustBinValueForText
tovbinm Aug 31, 2018
5e2857c
Added textBinsFormula to RFF
tovbinm Aug 31, 2018
a2ffd3a
make scalastyle happy
tovbinm Aug 31, 2018
3c35a13
Merge branch 'master' into xs/adjustBinValueForText
tovbinm Aug 31, 2018
2f4f77a
Merge branch 'master' into xs/adjustBinValueForText
tovbinm Sep 1, 2018
6187e55
Merge branch 'master' into xs/adjustBinValueForText
tovbinm Sep 2, 2018
7ff99fe
Merge branch 'master' into xs/adjustBinValueForText
tovbinm Sep 3, 2018
bdc0608
update docs
tovbinm Sep 4, 2018
dfc1a40
Added docs
tovbinm Sep 4, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add sum and count to summary
  • Loading branch information
sxd929 committed Aug 29, 2018
commit b958887e0159f13e43f6e34bec45da1cc125ed9c
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ private[op] object FeatureDistribution {
): FeatureDistribution = {
val (nullCount, (summaryInfo, distribution)): (Int, (Array[Double], Array[Double])) =
value.map(seq => 0 -> histValues(seq, summary, bins))
.getOrElse(1 -> (Array(summary.min, summary.max) -> Array.fill(bins)(0.0)))
.getOrElse(1 -> (Array(summary.min, summary.max, summary.sum, summary.count) -> Array.fill(bins)(0.0)))

FeatureDistribution(
name = featureKey._1,
Expand Down Expand Up @@ -194,12 +194,12 @@ private[op] object FeatureDistribution {
case Left(seq) => {
val minBins = bins
val maxBins = MaxBins
val numBins = math.min(math.max(bins, sum.max / AvgBinValue), maxBins).floor
val numBins = math.min(math.max(bins, sum.max / AvgBinValue), maxBins).intValue()

val hasher: HashingTF = new HashingTF(numFeatures = numBins)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to create hasher every time or perhaps we can create it once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hashing dimension can be different for different features, the minimum number would be bins
but we can have a shared one with numFeatures = bins, and use that for every case if there is not too many tokens; OR we can create a couple shared hashers with different scales, and choose one based on the scale of token numbers
What do you think? I was assuming creating a hasher for every feature will not be very resource consuming

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let see if we can reuse the hashing function without creating HashingTF everytime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I think we can. See HashingTF.transform and HashingTF object

.setBinary(false)
.setHashAlgorithm(HashAlgorithm.MurMur3.toString.toLowerCase)
Array(sum.min, sum.max) -> hasher.transform(seq).toArray
Array(sum.min, sum.max, sum.sum, sum.count) -> hasher.transform(seq).toArray
}
case Right(seq) => // TODO use kernel fit instead of histogram
if (sum == Summary.empty) {
Expand All @@ -218,7 +218,7 @@ private[op] object FeatureDistribution {
} else {
val same = seq.map(v => if (v == sum.max) 1.0 else 0.0).sum
val other = seq.map(v => if (v != sum.max) 1.0 else 0.0).sum
Array(sum.min, sum.max) -> Array(same, other)
Array(sum.min, sum.max, sum.sum, sum.count) -> Array(same, other)
}
}
}
Expand Down
17 changes: 10 additions & 7 deletions core/src/main/scala/com/salesforce/op/filters/Summary.scala
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,21 @@ import com.twitter.algebird.Monoid
/**
* Class used to get summaries of prepared features to determine distribution binning strategy
*
* @param min minimum value seen
* @param max maximum value seen
* @param min minimum value seen for double, minimum number of tokens in one text for text
* @param max maximum value seen for double, maximum number of tokens in one text for text
* @param sum sum of values for double, total number of tokens for text
* @param count number of doubles for double, number of texts for text
*/
private[op] case class Summary(min: Double, max: Double)
private[op] case class Summary(min: Double, max: Double, sum: Double, count: Double)

private[op] case object Summary {

val empty: Summary = Summary(Double.PositiveInfinity, Double.NegativeInfinity)
val empty: Summary = Summary(Double.PositiveInfinity, Double.NegativeInfinity, 0.0, 0.0)

implicit val monoid: Monoid[Summary] = new Monoid[Summary] {
override def zero = empty
override def plus(l: Summary, r: Summary) = Summary(math.min(l.min, r.min), math.max(l.max, r.max))
override def plus(l: Summary, r: Summary) = Summary(math.min(l.min, r.min), math.max(l.max, r.max),
l.sum + r.sum, l.count + r.count)
}

/**
Expand All @@ -55,8 +58,8 @@ private[op] case object Summary {
*/
def apply(preppedFeature: ProcessedSeq): Summary = {
preppedFeature match {
case Left(v) => Summary(v.size, v.size)
case Right(v) => monoid.sum(v.map(d => Summary(d, d)))
case Left(v) => Summary(v.size, v.size, v.size, 1.0)
case Right(v) => monoid.sum(v.map(d => Summary(d, d, d, 1.0)))
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
(true, Left(Seq.empty[String])), (false, Right(Seq(1.0, 3.0, 5.0)))
)
val summary =
Array(Summary(0.0, 1.0), Summary(-1.6, 10.6), Summary(0.0, 3.0), Summary(0.0, 0.0), Summary(1.0, 5.0))
Array(Summary(0.0, 1.0, 6.0, 10), Summary(-1.6, 10.6, 3.0, 10),
Summary(0.0, 3.0, 7.0, 10), Summary(0.0, 0.0, 5.0, 10), Summary(1.0, 5.0, 10.0, 10))
val bins = 10

val featureKeys: Array[FeatureKey] = features.map(f => (f.name, None))
Expand All @@ -66,7 +67,7 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
distribs(1).nulls shouldBe 1
distribs(1).distribution.sum shouldBe 0
distribs(2).distribution.sum shouldBe 2
distribs(2).summaryInfo should contain theSameElementsAs Array(0.0, 3.0)
distribs(2).summaryInfo should contain theSameElementsAs Array(0.0, 3.0, 7.0, 10.0)
distribs(3).distribution.sum shouldBe 0
distribs(4).distribution.sum shouldBe 3
distribs(4).summaryInfo.length shouldBe bins
Expand All @@ -75,10 +76,9 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
it should "be correctly created for text features" in {
val features = Array(description, gender)
val values: Array[(Boolean, ProcessedSeq)] = Array(
(false, Left(RandomText.strings(1, 10).take(10000).toSeq.map(_.value.get))),
(false, Left(RandomText.strings(1, 10).take(1000000).toSeq.map(_.value.get)))
(false, Left(RandomText.strings(1, 10).take(10000).toSeq.map(_.value.get)))
)
val summary = Array(Summary(10000.0, 10000.0), Summary(1000000, 1000000))
val summary = Array(Summary(1000.0, 50000.0, 70000.0, 10))
val bins = 100
val featureKeys: Array[FeatureKey] = features.map(f => (f.name, None))
val processedSeqs: Array[Option[ProcessedSeq]] = values.map { case (isEmpty, processed) =>
Expand All @@ -91,8 +91,6 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
distribs(0).distribution.length shouldBe 100
distribs(0).distribution.sum shouldBe 10000

distribs(1).distribution.length shouldBe 200
distribs(1).distribution.sum shouldBe 1000000
}

it should "be correctly created for map features" in {
Expand All @@ -102,9 +100,9 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
Map("A" -> Right(Seq(1.0)), "B" -> Right(Seq(1.0))),
Map("B" -> Right(Seq(0.0))))
val summary = Array(
Map("A" -> Summary(0.0, 1.0), "B" -> Summary(0.0, 5.0)),
Map("A" -> Summary(-1.6, 10.6), "B" -> Summary(0.0, 3.0)),
Map("B" -> Summary(0.0, 0.0)))
Map("A" -> Summary(0.0, 2.0, 100.0, 10), "B" -> Summary(0.0, 5.0, 10.0, 10)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the max change for just the A key here?

Copy link
Contributor Author

@sxd929 sxd929 Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardcoded, I changed it since a sample of "A" is Seq("male", "female"), number of hashes in the text is already 2, so the max should not be smaller than 2

Map("A" -> Summary(-1.6, 10.6, 30.0, 10), "B" -> Summary(0.0, 3.0, 11.0, 10)),
Map("B" -> Summary(0.0, 0.0, 0.0, 10)))
val bins = 10
val distribs = features.map(_.name).zip(summary).zip(values).flatMap { case ((name, summaryMaps), valueMaps) =>
summaryMaps.map { case (key, summary) =>
Expand All @@ -121,15 +119,15 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
else d.distribution.length shouldBe 2
}
distribs(0).nulls shouldBe 0
distribs(0).summaryInfo should contain theSameElementsAs Array(0.0, 1.0)
distribs(0).summaryInfo should contain theSameElementsAs Array(0.0, 2.0, 100.0, 10.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, why did the max change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one reads the info from hardcoded 'summaries' above

distribs(1).nulls shouldBe 1
distribs(0).distribution.sum shouldBe 2
distribs(1).distribution.sum shouldBe 0
distribs(2).summaryInfo.length shouldBe bins
distribs(2).distribution.sum shouldBe 1
distribs(4).distribution(0) shouldBe 1
distribs(4).distribution(1) shouldBe 0
distribs(4).summaryInfo.length shouldBe 2
distribs(4).summaryInfo.length shouldBe 4
}

it should "correctly compare fill rates" in {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,21 +78,23 @@ class PreparedFeaturesTest extends FlatSpec with TestSparkContext {
val (responseSummaries3, predictorSummaries3) = preparedFeatures3.summaries

responseSummaries1 should contain theSameElementsAs
Seq(responseKey1 -> Summary(1.0, 1.0), responseKey2 -> Summary(0.5, 0.5))
Seq(responseKey1 -> Summary(1.0, 1.0, 1.0, 1), responseKey2 -> Summary(0.5, 0.5, 0.5, 1))
predictorSummaries1 should contain theSameElementsAs
Seq(predictorKey1 -> Summary(0.0, 0.0), predictorKey2A -> Summary(2.0, 2.0), predictorKey2B -> Summary(1.0, 1.0))
Seq(predictorKey1 -> Summary(0.0, 0.0, 0.0, 2), predictorKey2A -> Summary(2.0, 2.0, 2.0, 1),
predictorKey2B -> Summary(1.0, 1.0, 1.0, 1))
responseSummaries2 should contain theSameElementsAs
Seq(responseKey1 -> Summary(0.0, 0.0))
Seq(responseKey1 -> Summary(0.0, 0.0, 0.0, 1))
predictorSummaries2 should contain theSameElementsAs
Seq(predictorKey1 -> Summary(0.4, 0.5))
Seq(predictorKey1 -> Summary(0.4, 0.5, 0.9, 2))
responseSummaries3 should contain theSameElementsAs
Seq(responseKey2 -> Summary(-0.5, -0.5))
Seq(responseKey2 -> Summary(-0.5, -0.5, -0.5, 1))
predictorSummaries3 should contain theSameElementsAs
Seq(predictorKey2A -> Summary(1.0, 1.0))
Seq(predictorKey2A -> Summary(1.0, 1.0, 1.0, 1))
allResponseSummaries should contain theSameElementsAs
Seq(responseKey1 -> Summary(0.0, 1.0), responseKey2 -> Summary(-0.5, 0.5))
Seq(responseKey1 -> Summary(0.0, 1.0, 1.0, 2), responseKey2 -> Summary(-0.5, 0.5, 0.0, 2))
allPredictorSummaries should contain theSameElementsAs
Seq(predictorKey1 -> Summary(0.0, 0.5), predictorKey2A -> Summary(1.0, 2.0), predictorKey2B -> Summary(1.0, 1.0))
Seq(predictorKey1 -> Summary(0.0, 0.5, 0.9, 4), predictorKey2A -> Summary(1.0, 2.0, 3.0, 2),
predictorKey2B -> Summary(1.0, 1.0, 1.0, 1))
}

it should "produce correct null-label leakage vector with single response" in {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ class RawFeatureFilterTest extends FlatSpec with PassengerSparkFixtureTest with
val allFeatureInfo = filter.computeFeatureStats(passengersDataSet, features)

allFeatureInfo.responseSummaries.size shouldBe 1
allFeatureInfo.responseSummaries.headOption.map(_._2) shouldEqual Option(Summary(0, 1))
allFeatureInfo.responseSummaries.headOption.map(_._2) shouldEqual Option(Summary(0, 1, 1, 2))
allFeatureInfo.responseDistributions.size shouldBe 1
allFeatureInfo.predictorSummaries.size shouldBe 12
allFeatureInfo.predictorDistributions.size shouldBe 12
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,11 @@ class SummaryTest extends FlatSpec with TestCommon {
val f2s = Summary(f2)
f1s.min shouldBe 3
f1s.max shouldBe 3
f1s.sum shouldBe 3
f1s.count shouldBe 1
f2s.min shouldBe 0.5
f2s.max shouldBe 1.0
f2s.sum shouldBe 1.5
f2s.count shouldBe 2
}
}