Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid creating SparseVectors for LOCO #377

Merged
merged 19 commits into from
Aug 21, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
code check fixes
  • Loading branch information
gerashegalov committed Aug 8, 2019
commit 1c383e5390074d7d3069423f76aef9ea79b692b0
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ class RecordInsightsLOCO[T <: Model[T]]
.map(_.index)
.distinct.sorted

private def computeDiffs
private def computeDiff
(
i: Int,
oldInd: Int,
Expand Down Expand Up @@ -168,13 +168,12 @@ class RecordInsightsLOCO[T <: Model[T]]
k: Int,
indexToExamine: Int
): Seq[LOCOValue] = {
val zdif = Array.fill(baseScore.length)(0.0)
val minMaxHeap = new MinMaxHeap(k)
val aggregationMap = mutable.Map.empty[String, (Array[Int], Array[Double])]

agggregateDiffs(0, Left(featureSparse), indexToExamine, zdif, minMaxHeap, aggregationMap,
agggregateDiffs(0, Left(featureSparse), indexToExamine, minMaxHeap, aggregationMap,
baseScore)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the sparse features you just put in a value of 0? Cant we just skip adding them to the heap?

Copy link
Contributor Author

@gerashegalov gerashegalov Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same idea but in one of the iteration I ran into test failures and deferred it to later. I'll recheck now that I have everything green. @michaelweilsalesforce any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of failures have you encountered?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be that we were doing an unnecessary calculation and that just happened to be captured in the test...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelweilsalesforce you can reproduce it by commenting out the line 171-172.

Aggregate all the derived hashing tf features of rawFeature - text. 0.08025355373244505 was not less than 1.0E-10 expected aggregated LOCO value (0.006978569889777832) should be the same as actual (0.08723212362222289)

Aggregate x_HourOfDay and y_HourOfDay of rawFeature - dateFeature. 0.016493734169231777 was not less than 1.0E-10 expected aggregated LOCO value (0.016493734169231777) should be the same as actual (0.032987468338463555)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire @gerashegalov The reason for tracking zero values is whenever we want to average LOCOs of a same raw text feature we are also including the zero values.
E.g if text feature TextA has on a row 6 non zero values loco1, ..., loco6 and 4 0s, we are dividing by 10 :
(loco1 + loco2 + ... + loco6)/10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LEt me write a fix that will not go over the zeros

agggregateDiffs(featureSparse.size, Right(zeroValIndices), indexToExamine, zdif, minMaxHeap,
agggregateDiffs(featureSparse.size, Right(zeroValIndices), indexToExamine, minMaxHeap,
aggregationMap, baseScore)

// Adding LOCO results from aggregation map into heaps
Expand All @@ -192,19 +191,11 @@ class RecordInsightsLOCO[T <: Model[T]]
offset: Int,
featureVec: Either[SparseVector, Array[Int]],
indexToExamine: Int,
zdif: Array[Double],
minMaxHeap: MinMaxHeap,
aggregationMap: mutable.Map[String, (Array[Int], Array[Double])],
baseScore: Array[Double]
): Unit = {
(
featureVec match {
case Left(sparse) => (0 until sparse.size, sparse.indices).zipped
.map { case ( i, oldInd) => (i, oldInd, computeDiffs(i, oldInd, sparse, baseScore)) }
case Right(zeroeIndices) => (0 until zeroeIndices.length, zeroeIndices).zipped
.map { case (i, oldInd) => (i + offset, oldInd, zdif) }
}
).foreach { case (i, oldInd, diffToExamine) =>
computeDiffs(featureVec, offset, baseScore).foreach { case (i, oldInd, diffToExamine) =>
val history = histories(oldInd)
history match {
// If indicator value and descriptor value of a derived text feature are empty, then it is likely
Expand All @@ -227,6 +218,19 @@ class RecordInsightsLOCO[T <: Model[T]]
}
}

private def computeDiffs(
featureVec: Either[SparseVector, Array[Int]],
offset: Int, baseScore: Array[Double]
) = {
val zdif = Array.fill(baseScore.length)(0.0)
featureVec match {
case Left(sparse) => (0 until sparse.size, sparse.indices).zipped
.map { case (i, oldInd) => (i, oldInd, computeDiff(i, oldInd, sparse, baseScore)) }
case Right(zeroeIndices) => (0 until zeroeIndices.length, zeroeIndices).zipped
.map { case (i, oldInd) => (i + offset, oldInd, zdif) }
}
}

override def transformFn: OPVector => TextMap = features => {
val baseResult = modelApply(labelDummy, features)
val baseScore = baseResult.score
Expand Down