Added value to OpVectorColumnMetadata for numeric column descriptors #89

leahmcguire · 2018-08-25T03:16:14Z

Related issues
We use the indicatorGroup and indicatorValue in OpVectorColumnMetadata not being empty to decide when we need to run sanity checker categorical tests in metadata. This is a problem for cases where we have information about a derived field that we want to represent but the field is not a boolean indicator (we have no place to put them). We have hacked around this by just putting this information in the indicatorValue anyway and not having anything in the indicator group. However this doesn't work if we have the same transformation for maps (which we need to add for feature complete).

Describe the proposed solution
Right now our column metadata is structured like this

case class OpVectorColumnMetadata
(
  parentFeatureName: Seq[String],
  parentFeatureType: Seq[String],
  indicatorGroup: Option[String],
  indicatorValue: Option[String],
  index: Int = 0
)

we will add a value to hold numeric column name information and rename indicatorGroup to grouping to make it's usage more clear.

case class OpVectorColumnMetadata
(
  parentFeatureName: Seq[String],
  parentFeatureType: Seq[String],
  grouping: Option[String],
  indicatorValue: Option[String] = None, // for categoricals
  descriptorValue: Option[String] = None, // for continuous things that need description 
  // eg geolocation lat, lon, acc or datetime unit circle timeperiod_x timeperiod_y
  index: Int = 0
) {
  assert(indicatorValue.isEmpty || descriptorValue.isEmpty)
}

Describe alternatives you've considered
put in an check for parent feature type to exclude specific types from the sanity checker categorical tests

tovbinm · 2018-08-25T03:38:18Z

I propose to keep the name indicatorGroup (instead of grouping) to allow backward compatibility for existing models. otherwise lgtm.

leahmcguire · 2018-08-25T04:00:26Z

Existing models will not load with this release anyway because of the model selector change

tovbinm · 2018-08-25T04:09:40Z

features/src/main/scala/com/salesforce/op/utils/spark/OpVectorColumnMetadata.scala

@@ -75,6 +79,9 @@ case class OpVectorColumnMetadata
 assert(parentFeatureName.length == parentFeatureType.length,
 s"must provide both type and name for every parent feature," +
 s" names: $parentFeatureName and types: $parentFeatureType do not have the same length")
+ assert(indicatorValue.isEmpty || descriptorValue.isEmpty, "cannot have both an indicatorValue and a," +


cannot have both an indicatorValue and descriptorValue.

Since we are making breaking changes anyways, I keep thinking that a base trait + two specific classes makes even more sense. This way we would avoid asserts like this ^

trait OpVectorColumnMetadata ( def parentFeatureName: Seq[String] def parentFeatureType: Seq[String] def grouping: Option[String] def index: Int ) case class CategoricalVectorColumnMetadata( parentFeatureName: Seq[String], parentFeatureType: Seq[String], grouping: Option[String], indicatorValue: Option[String], index: Int = 0 ) extends OpVectorColumnMetadata case class ContinuousVectorColumnMetadata( parentFeatureName: Seq[String], parentFeatureType: Seq[String], grouping: Option[String], descriptorValue: Option[String] index: Int = 0 ) extends OpVectorColumnMetadata

this would break our internal usage - look at sanity checker @tovbinm

I will make a separate ticket do do this as part of metadata restructuring

leahmcguire · 2018-08-25T04:13:24Z

@Jauntbox thoughts on indicatorGroup versus grouping?

codecov · 2018-08-25T04:34:53Z

Codecov Report

Merging #89 into master will increase coverage by 0.51%.
The diff coverage is 85.36%.

@@            Coverage Diff             @@
##           master      #89      +/-   ##
==========================================
+ Coverage   85.47%   85.98%   +0.51%     
==========================================
  Files         294      294              
  Lines        8792     9555     +763     
  Branches      309      533     +224     
==========================================
+ Hits         7515     8216     +701     
- Misses       1277     1339      +62

Impacted Files	Coverage Δ
...esforce/op/utils/spark/OpVectorColumnHistory.scala	`0% <ø> (ø)`	⬆️
...lesforce/op/test/TestOpVectorMetadataBuilder.scala	`0% <ø> (ø)`	⬆️
.../com/salesforce/op/features/TransientFeature.scala	`70.58% <0%> (ø)`	⬆️
...rce/op/stages/impl/feature/OpCountVectorizer.scala	`100% <100%> (ø)`	⬆️
...c/main/scala/com/salesforce/op/ModelInsights.scala	`93.8% <100%> (ø)`	⬆️
...rce/op/stages/impl/preparators/SanityChecker.scala	`91.86% <100%> (ø)`	⬆️
.../op/stages/impl/feature/TextMapNullEstimator.scala	`100% <100%> (ø)`	⬆️
...force/op/stages/impl/feature/OPMapVectorizer.scala	`97.82% <100%> (ø)`	⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala	`96.68% <100%> (ø)`	⬆️
...stages/impl/feature/GeolocationMapVectorizer.scala	`100% <100%> (ø)`	⬆️
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed98173...c3a2c96. Read the comment docs.

codecov · 2018-08-27T17:07:00Z

Codecov Report

Merging #89 into master will increase coverage by 0.49%.
The diff coverage is 86.95%.

@@            Coverage Diff             @@
##           master      #89      +/-   ##
==========================================
+ Coverage   85.62%   86.11%   +0.49%     
==========================================
  Files         292      292              
  Lines        8763     9490     +727     
  Branches      306      316      +10     
==========================================
+ Hits         7503     8172     +669     
- Misses       1260     1318      +58

Impacted Files	Coverage Δ
...lesforce/op/test/TestOpVectorMetadataBuilder.scala	`0% <ø> (ø)`	⬆️
...esforce/op/utils/spark/OpVectorColumnHistory.scala	`0% <ø> (ø)`	⬆️
.../com/salesforce/op/features/TransientFeature.scala	`70.58% <0%> (ø)`	⬆️
...tages/impl/preparators/SanityCheckerMetadata.scala	`91.59% <0%> (-0.62%)`	⬇️
...s/impl/feature/OPCollectionHashingVectorizer.scala	`96.37% <100%> (ø)`	⬆️
.../op/stages/impl/feature/TextMapNullEstimator.scala	`100% <100%> (ø)`	⬆️
...rce/op/stages/impl/feature/NumericBucketizer.scala	`100% <100%> (ø)`	⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala	`96.68% <100%> (ø)`	⬆️
...rce/op/stages/impl/feature/OpCountVectorizer.scala	`100% <100%> (ø)`	⬆️
...ce/op/stages/impl/feature/DateListVectorizer.scala	`99% <100%> (ø)`	⬆️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4afce4...9c73997. Read the comment docs.

…vectorMeta

Jauntbox · 2018-08-27T17:30:19Z

I don't think we should continue using indicatorGroup since we want a more general grouping term that's appropriate for both indicator groups (those filled with binary indicator columns) and other groups (eg. x/y coords for datetimes, stuff we may add in the future like distances to pre-determined cluster centroids). I think grouping is a good general replacement.

The only case I have questions about now is how we would deal with cases where we may split a single feature into multiple groupings. I can't think of any examples where we'd want multiple indicator groups or multiple numeric groups, but we might have some in the future. One way is to generate multiple different separate grouping values from the original feature based on the transformation applied. I'm assuming that's the way we'd go?

Jauntbox

LGTM

Jauntbox · 2018-08-27T17:31:50Z

core/src/main/scala/com/salesforce/op/stages/impl/preparators/SanityChecker.scala

@@ -465,10 +465,10 @@ class SanityChecker(uid: String = UID[SanityChecker])

 // Only perform Cramer's V calculation on columns that have an indicatorGroup and indicatorValue defined (right


Update comment to match new name

@leahmcguire I think this is the only one left. try grepping the codebase indicatorGroup

Jauntbox · 2018-08-27T17:33:04Z

core/src/main/scala/com/salesforce/op/stages/impl/preparators/SanityCheckerMetadata.scala

@@ -319,23 +319,6 @@ case object SanityCheckerSummary {
 )
 }

- @deprecated("CategoricalStats replaced by Array[CategoricalGroupStats]", "3.3.0")


tovbinm

one minor comment left, lgtm!! let's create an issue to refactor for specific vector metadata classes later then.

…89)

salesforce-cla · 2021-04-18T14:44:03Z

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

added value to OpVectorColumnMetadata for numeric column descriptors

c3a2c96

leahmcguire requested a review from tovbinm as a code owner August 25, 2018 03:16

leahmcguire requested a review from Jauntbox August 25, 2018 03:16

tovbinm reviewed Aug 25, 2018

View reviewed changes

updated assert comment

086225e

leahmcguire added 2 commits August 27, 2018 10:07

Merge branch 'master' of github.com:salesforce/TransmogrifAI into lm/…

265b7c7

…vectorMeta

updated comment, added todo, deleted sanity meta backwards compatibility

6b55712

Jauntbox approved these changes Aug 27, 2018

View reviewed changes

tovbinm approved these changes Aug 27, 2018

View reviewed changes

leahmcguire and others added 3 commits August 27, 2018 11:32

changed indicatorGroup to grouping everywhere

2f4ac34

Merge branch 'master' into lm/vectorMeta

90d6be1

Merge branch 'master' into lm/vectorMeta

9c73997

tovbinm merged commit f37113b into master Aug 27, 2018

tovbinm deleted the lm/vectorMeta branch August 27, 2018 20:18

ericwayman pushed a commit that referenced this pull request Feb 8, 2019

Added value to OpVectorColumnMetadata for numeric column descriptors (#…

34b9df2

…89)

salesforce-cla bot added the cla:missing label Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added value to OpVectorColumnMetadata for numeric column descriptors #89

Added value to OpVectorColumnMetadata for numeric column descriptors #89

leahmcguire commented Aug 25, 2018 •

edited

Loading

tovbinm commented Aug 25, 2018 •

edited

Loading

leahmcguire commented Aug 25, 2018

tovbinm Aug 25, 2018

tovbinm Aug 25, 2018 •

edited

Loading

leahmcguire Aug 27, 2018 •

edited

Loading

leahmcguire Aug 27, 2018

leahmcguire commented Aug 25, 2018

codecov bot commented Aug 25, 2018 •

edited

Loading

codecov bot commented Aug 27, 2018 •

edited

Loading

Jauntbox commented Aug 27, 2018

Jauntbox left a comment

Jauntbox Aug 27, 2018

tovbinm Aug 27, 2018

Jauntbox Aug 27, 2018

tovbinm left a comment

salesforce-cla bot commented Apr 18, 2021

		@@ -465,10 +465,10 @@ class SanityChecker(uid: String = UID[SanityChecker])

		// Only perform Cramer's V calculation on columns that have an indicatorGroup and indicatorValue defined (right

Added value to OpVectorColumnMetadata for numeric column descriptors #89

Added value to OpVectorColumnMetadata for numeric column descriptors #89

Conversation

leahmcguire commented Aug 25, 2018 • edited Loading

tovbinm commented Aug 25, 2018 • edited Loading

leahmcguire commented Aug 25, 2018

tovbinm Aug 25, 2018

Choose a reason for hiding this comment

tovbinm Aug 25, 2018 • edited Loading

Choose a reason for hiding this comment

leahmcguire Aug 27, 2018 • edited Loading

Choose a reason for hiding this comment

leahmcguire Aug 27, 2018

Choose a reason for hiding this comment

leahmcguire commented Aug 25, 2018

codecov bot commented Aug 25, 2018 • edited Loading

Codecov Report

codecov bot commented Aug 27, 2018 • edited Loading

Codecov Report

Jauntbox commented Aug 27, 2018

Jauntbox left a comment

Choose a reason for hiding this comment

Jauntbox Aug 27, 2018

Choose a reason for hiding this comment

tovbinm Aug 27, 2018

Choose a reason for hiding this comment

Jauntbox Aug 27, 2018

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

salesforce-cla bot commented Apr 18, 2021

leahmcguire commented Aug 25, 2018 •

edited

Loading

tovbinm commented Aug 25, 2018 •

edited

Loading

tovbinm Aug 25, 2018 •

edited

Loading

leahmcguire Aug 27, 2018 •

edited

Loading

codecov bot commented Aug 25, 2018 •

edited

Loading

codecov bot commented Aug 27, 2018 •

edited

Loading