-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid creating SparseVectors for LOCO #377
Changes from 1 commit
66a5e99
77b97ee
35b8b5c
cb2dc05
a494129
1c383e5
a1d7d81
f95f4bf
a8ff84f
709594f
96b2941
ee76053
2248d49
c3f743d
dc39cb3
25a6ac5
3b7b63a
b271b74
087b1d2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -114,7 +114,7 @@ class RecordInsightsLOCO[T <: Model[T]] | |
Set(FeatureType.typeName[DateMap], FeatureType.typeName[DateTimeMap]) | ||
|
||
// Indices of features derived from Text(Map)Vectorizer | ||
private lazy val textFeatureIndices = getIndicesOfFeatureType(textTypes ++ textMapTypes, | ||
private lazy val textFeatureIndices: Seq[Int] = getIndicesOfFeatureType(textTypes ++ textMapTypes, | ||
h => h.indicatorValue.isEmpty && h.descriptorValue.isEmpty) | ||
|
||
// Indices of features derived from Date(Map)Vectorizer | ||
|
@@ -125,9 +125,9 @@ class RecordInsightsLOCO[T <: Model[T]] | |
* @return Seq[Int] | ||
*/ | ||
private def getIndicesOfFeatureType(types: Set[String], predicate: OpVectorColumnHistory => Boolean): Seq[Int] = | ||
histories.filter(h => h.parentFeatureType.exists(types.contains) && predicate(h)) | ||
.map(_.index) | ||
.distinct.sorted | ||
histories.collect { | ||
case h if h.parentFeatureType.exists(types.contains) && predicate(h) => h.index | ||
}.distinct.sorted | ||
|
||
private def computeDiff | ||
( | ||
|
@@ -151,13 +151,15 @@ class RecordInsightsLOCO[T <: Model[T]] | |
descriptorValue.split("_").lastOption.flatMap(TimePeriod.withNameInsensitiveOption) | ||
|
||
private def getRawFeatureName(history: OpVectorColumnHistory): Option[String] = { | ||
val name = history.grouping match { | ||
case Some(grouping) => history.parentFeatureOrigins.headOption.map(_ + "_" + grouping) | ||
case None => history.parentFeatureOrigins.headOption | ||
} | ||
val groupSuffix = history.grouping.map("_" + _).getOrElse("") | ||
val name = history.parentFeatureOrigins.headOption.map(_ + groupSuffix) | ||
|
||
// If the descriptor value of a derived date feature exists, then it is likely to be | ||
// from unit circle transformer. We aggregate such features for each (rawFeatureName, timePeriod). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is true now - but may not always be true. If you want this to apply only for date unit circles should also check that one of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This check is not consistent : Unit Circle Transformation in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or I can check the parentType instead There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this change is explicitly to deal with date features that are transformed to unit circle then the check needs to be explicitly for that. Otherwise this is also applied to lat lon values (and anything else that we add later) and if we just check the type of the parent it assumes that we will always have unit circle transformation of dates - which could change at some point... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, but as I said above checking the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. DateMapVectorizer does days between reference date and the date. The only two that do unit vector are DateToUnitCircleTransformer and DateToUnitCircleVectorizer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then there must be a bug in the shortcut : when
Those features both use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. blarg! you are right there is a bug in the feature history that means we loose info if the same feature undergoes multiple transformations :-( https://github.com/salesforce/TransmogrifAI/blob/master/features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala#L53 Can you put a todo to update once the bug is fixed |
||
name.map(_ + history.descriptorValue.flatMap(convertToTimePeriod).map(p => "_" + p.entryName).getOrElse("")) | ||
name.map(_ + | ||
history.descriptorValue | ||
.flatMap(convertToTimePeriod) | ||
.map(p => "_" + p.entryName).getOrElse("")) | ||
} | ||
|
||
private def returnTopPosNeg | ||
|
@@ -229,14 +231,13 @@ class RecordInsightsLOCO[T <: Model[T]] | |
|
||
// Besides non 0 values, we want to check the text/date features as well | ||
val zeroValIndices = (textFeatureIndices ++ dateFeatureIndices) | ||
.filterNot { | ||
featureIndexSet.contains | ||
} | ||
.filterNot(featureIndexSet.contains) | ||
.toArray | ||
|
||
// Count zeros by feature name | ||
val zeroCountByFeature = zeroValIndices.map { case i => | ||
getRawFeatureName(histories(i)).get -> i | ||
}.groupBy(_._1).mapValues(_.length) | ||
val zeroCountByFeature = zeroValIndices | ||
.groupBy(i => getRawFeatureName(histories(i)).get) | ||
.mapValues(_.length).view.toMap | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What’s the point of .view here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to force map materialization after toMap in 2.11 |
||
|
||
val k = $(topK) | ||
// Index where to examine the difference in the prediction vector | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe update comment to indicate only getting hashed text values