Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid creating SparseVectors for LOCO #377

Merged
merged 19 commits into from
Aug 21, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
minor fixes
  • Loading branch information
gerashegalov committed Aug 8, 2019
commit ee760534cec31c712d221048d4757800a3bfa805
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ class RecordInsightsLOCO[T <: Model[T]]
Set(FeatureType.typeName[DateMap], FeatureType.typeName[DateTimeMap])

// Indices of features derived from Text(Map)Vectorizer
private lazy val textFeatureIndices = getIndicesOfFeatureType(textTypes ++ textMapTypes,
private lazy val textFeatureIndices: Seq[Int] = getIndicesOfFeatureType(textTypes ++ textMapTypes,
h => h.indicatorValue.isEmpty && h.descriptorValue.isEmpty)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe update comment to indicate only getting hashed text values


// Indices of features derived from Date(Map)Vectorizer
Expand All @@ -125,9 +125,9 @@ class RecordInsightsLOCO[T <: Model[T]]
* @return Seq[Int]
*/
private def getIndicesOfFeatureType(types: Set[String], predicate: OpVectorColumnHistory => Boolean): Seq[Int] =
histories.filter(h => h.parentFeatureType.exists(types.contains) && predicate(h))
.map(_.index)
.distinct.sorted
histories.collect {
case h if h.parentFeatureType.exists(types.contains) && predicate(h) => h.index
}.distinct.sorted

private def computeDiff
(
Expand All @@ -151,13 +151,15 @@ class RecordInsightsLOCO[T <: Model[T]]
descriptorValue.split("_").lastOption.flatMap(TimePeriod.withNameInsensitiveOption)

private def getRawFeatureName(history: OpVectorColumnHistory): Option[String] = {
val name = history.grouping match {
case Some(grouping) => history.parentFeatureOrigins.headOption.map(_ + "_" + grouping)
case None => history.parentFeatureOrigins.headOption
}
val groupSuffix = history.grouping.map("_" + _).getOrElse("")
val name = history.parentFeatureOrigins.headOption.map(_ + groupSuffix)

// If the descriptor value of a derived date feature exists, then it is likely to be
// from unit circle transformer. We aggregate such features for each (rawFeatureName, timePeriod).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true now - but may not always be true. If you want this to apply only for date unit circles should also check that one of the parentFeatureStages is a DateToUnitCircleTransformer or DateToUnitCircleVectorizer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not consistent : Unit Circle Transformation in DateMapVectorizer is not reflected in the parentStages (Seq[DateMapVectorizer] instead).
I think the check on descriptor value is coherent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I can check the parentType instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this change is explicitly to deal with date features that are transformed to unit circle then the check needs to be explicitly for that. Otherwise this is also applied to lat lon values (and anything else that we add later) and if we just check the type of the parent it assumes that we will always have unit circle transformation of dates - which could change at some point...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but as I said above checking the parentFeatureStages won't work : for instance DateMapVectorizer may apply Unit circle transformation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateMapVectorizer does days between reference date and the date. The only two that do unit vector are DateToUnitCircleTransformer and DateToUnitCircleVectorizer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then there must be a bug in the shortcut : when println(s"name ${history.columnName} stage ${history.parentFeatureStages} descriptor value ${history.descriptorValue}") I get

name dateMapFeature_k0_y_DayOfYear_33 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(y_DayOfYear)
name dateMapFeature_k1_x_DayOfYear_34 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(x_DayOfYear)
name dateMapFeature_k1_y_DayOfYear_35 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(y_DayOfYear)
name dateFeature_x_HourOfDay_0 stage ArrayBuffer() descriptor value Some(x_HourOfDay)
name dateFeature_y_HourOfDay_1 stage ArrayBuffer() descriptor value Some(y_HourOfDay)

Those features both use the .vetcorize shortcut.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blarg! you are right there is a bug in the feature history that means we loose info if the same feature undergoes multiple transformations :-( https://github.com/salesforce/TransmogrifAI/blob/master/features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala#L53

Can you put a todo to update once the bug is fixed

name.map(_ + history.descriptorValue.flatMap(convertToTimePeriod).map(p => "_" + p.entryName).getOrElse(""))
name.map(_ +
history.descriptorValue
.flatMap(convertToTimePeriod)
.map(p => "_" + p.entryName).getOrElse(""))
}

private def returnTopPosNeg
Expand Down Expand Up @@ -229,14 +231,13 @@ class RecordInsightsLOCO[T <: Model[T]]

// Besides non 0 values, we want to check the text/date features as well
val zeroValIndices = (textFeatureIndices ++ dateFeatureIndices)
.filterNot {
featureIndexSet.contains
}
.filterNot(featureIndexSet.contains)
.toArray

// Count zeros by feature name
val zeroCountByFeature = zeroValIndices.map { case i =>
getRawFeatureName(histories(i)).get -> i
}.groupBy(_._1).mapValues(_.length)
val zeroCountByFeature = zeroValIndices
.groupBy(i => getRawFeatureName(histories(i)).get)
.mapValues(_.length).view.toMap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the point of .view here?

Copy link
Contributor Author

@gerashegalov gerashegalov Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to force map materialization after toMap in 2.11


val k = $(topK)
// Index where to examine the difference in the prediction vector
Expand Down