Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added value to OpVectorColumnMetadata for numeric column descriptors #89

Merged
merged 7 commits into from
Aug 27, 2018

Conversation

leahmcguire
Copy link
Collaborator

@leahmcguire leahmcguire commented Aug 25, 2018

Related issues
We use the indicatorGroup and indicatorValue in OpVectorColumnMetadata not being empty to decide when we need to run sanity checker categorical tests in metadata. This is a problem for cases where we have information about a derived field that we want to represent but the field is not a boolean indicator (we have no place to put them). We have hacked around this by just putting this information in the indicatorValue anyway and not having anything in the indicator group. However this doesn't work if we have the same transformation for maps (which we need to add for feature complete).

Describe the proposed solution
Right now our column metadata is structured like this

case class OpVectorColumnMetadata
(
  parentFeatureName: Seq[String],
  parentFeatureType: Seq[String],
  indicatorGroup: Option[String],
  indicatorValue: Option[String],
  index: Int = 0
)

we will add a value to hold numeric column name information and rename indicatorGroup to grouping to make it's usage more clear.

case class OpVectorColumnMetadata
(
  parentFeatureName: Seq[String],
  parentFeatureType: Seq[String],
  grouping: Option[String],
  indicatorValue: Option[String] = None, // for categoricals
  descriptorValue: Option[String] = None, // for continuous things that need description 
  // eg geolocation lat, lon, acc or datetime unit circle timeperiod_x timeperiod_y
  index: Int = 0
) {
  assert(indicatorValue.isEmpty || descriptorValue.isEmpty)
}

Describe alternatives you've considered
put in an check for parent feature type to exclude specific types from the sanity checker categorical tests

@tovbinm
Copy link
Collaborator

tovbinm commented Aug 25, 2018

I propose to keep the name indicatorGroup (instead of grouping) to allow backward compatibility for existing models. otherwise lgtm.

@leahmcguire
Copy link
Collaborator Author

Existing models will not load with this release anyway because of the model selector change

@@ -75,6 +79,9 @@ case class OpVectorColumnMetadata
assert(parentFeatureName.length == parentFeatureType.length,
s"must provide both type and name for every parent feature," +
s" names: $parentFeatureName and types: $parentFeatureType do not have the same length")
assert(indicatorValue.isEmpty || descriptorValue.isEmpty, "cannot have both an indicatorValue and a," +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot have both an indicatorValue and descriptorValue.

Copy link
Collaborator

@tovbinm tovbinm Aug 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are making breaking changes anyways, I keep thinking that a base trait + two specific classes makes even more sense. This way we would avoid asserts like this ^

trait OpVectorColumnMetadata
(
  def parentFeatureName: Seq[String]
  def parentFeatureType: Seq[String]
  def grouping: Option[String]
  def index: Int
)

case class CategoricalVectorColumnMetadata(
  parentFeatureName: Seq[String],
  parentFeatureType: Seq[String],
  grouping: Option[String],
  indicatorValue: Option[String],
  index: Int = 0
) extends OpVectorColumnMetadata

case class ContinuousVectorColumnMetadata(
  parentFeatureName: Seq[String],
  parentFeatureType: Seq[String],
  grouping: Option[String],
  descriptorValue: Option[String]
  index: Int = 0
) extends OpVectorColumnMetadata

Copy link
Collaborator Author

@leahmcguire leahmcguire Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would break our internal usage - look at sanity checker @tovbinm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a separate ticket do do this as part of metadata restructuring

@leahmcguire
Copy link
Collaborator Author

@Jauntbox thoughts on indicatorGroup versus grouping?

@codecov
Copy link

codecov bot commented Aug 25, 2018

Codecov Report

Merging #89 into master will increase coverage by 0.51%.
The diff coverage is 85.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #89      +/-   ##
==========================================
+ Coverage   85.47%   85.98%   +0.51%     
==========================================
  Files         294      294              
  Lines        8792     9555     +763     
  Branches      309      533     +224     
==========================================
+ Hits         7515     8216     +701     
- Misses       1277     1339      +62
Impacted Files Coverage Δ
...esforce/op/utils/spark/OpVectorColumnHistory.scala 0% <ø> (ø) ⬆️
...lesforce/op/test/TestOpVectorMetadataBuilder.scala 0% <ø> (ø) ⬆️
.../com/salesforce/op/features/TransientFeature.scala 70.58% <0%> (ø) ⬆️
...rce/op/stages/impl/feature/OpCountVectorizer.scala 100% <100%> (ø) ⬆️
...c/main/scala/com/salesforce/op/ModelInsights.scala 93.8% <100%> (ø) ⬆️
...rce/op/stages/impl/preparators/SanityChecker.scala 91.86% <100%> (ø) ⬆️
.../op/stages/impl/feature/TextMapNullEstimator.scala 100% <100%> (ø) ⬆️
...force/op/stages/impl/feature/OPMapVectorizer.scala 97.82% <100%> (ø) ⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala 96.68% <100%> (ø) ⬆️
...stages/impl/feature/GeolocationMapVectorizer.scala 100% <100%> (ø) ⬆️
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed98173...c3a2c96. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 27, 2018

Codecov Report

Merging #89 into master will increase coverage by 0.49%.
The diff coverage is 86.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #89      +/-   ##
==========================================
+ Coverage   85.62%   86.11%   +0.49%     
==========================================
  Files         292      292              
  Lines        8763     9490     +727     
  Branches      306      316      +10     
==========================================
+ Hits         7503     8172     +669     
- Misses       1260     1318      +58
Impacted Files Coverage Δ
...lesforce/op/test/TestOpVectorMetadataBuilder.scala 0% <ø> (ø) ⬆️
...esforce/op/utils/spark/OpVectorColumnHistory.scala 0% <ø> (ø) ⬆️
.../com/salesforce/op/features/TransientFeature.scala 70.58% <0%> (ø) ⬆️
...tages/impl/preparators/SanityCheckerMetadata.scala 91.59% <0%> (-0.62%) ⬇️
...s/impl/feature/OPCollectionHashingVectorizer.scala 96.37% <100%> (ø) ⬆️
.../op/stages/impl/feature/TextMapNullEstimator.scala 100% <100%> (ø) ⬆️
...rce/op/stages/impl/feature/NumericBucketizer.scala 100% <100%> (ø) ⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala 96.68% <100%> (ø) ⬆️
...rce/op/stages/impl/feature/OpCountVectorizer.scala 100% <100%> (ø) ⬆️
...ce/op/stages/impl/feature/DateListVectorizer.scala 99% <100%> (ø) ⬆️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4afce4...9c73997. Read the comment docs.

@Jauntbox
Copy link
Contributor

I don't think we should continue using indicatorGroup since we want a more general grouping term that's appropriate for both indicator groups (those filled with binary indicator columns) and other groups (eg. x/y coords for datetimes, stuff we may add in the future like distances to pre-determined cluster centroids). I think grouping is a good general replacement.

The only case I have questions about now is how we would deal with cases where we may split a single feature into multiple groupings. I can't think of any examples where we'd want multiple indicator groups or multiple numeric groups, but we might have some in the future. One way is to generate multiple different separate grouping values from the original feature based on the transformation applied. I'm assuming that's the way we'd go?

Copy link
Contributor

@Jauntbox Jauntbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -465,10 +465,10 @@ class SanityChecker(uid: String = UID[SanityChecker])

// Only perform Cramer's V calculation on columns that have an indicatorGroup and indicatorValue defined (right
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to match new name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire I think this is the only one left. try grepping the codebase indicatorGroup

@@ -319,23 +319,6 @@ case object SanityCheckerSummary {
)
}

@deprecated("CategoricalStats replaced by Array[CategoricalGroupStats]", "3.3.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay!

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor comment left, lgtm!! let's create an issue to refactor for specific vector metadata classes later then.

@tovbinm tovbinm merged commit f37113b into master Aug 27, 2018
@tovbinm tovbinm deleted the lm/vectorMeta branch August 27, 2018 20:18
@salesforce-cla
Copy link

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants