Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate features using sanity checker feature to feature correlations #476

Merged
merged 12 commits into from
May 19, 2020

Conversation

leahmcguire
Copy link
Collaborator

Related issues
Refer to issue(s) addressed in this pull request from Issues page.

Describe the proposed solution
When features are exact duplicates of each other we would like to remove one before modeling

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context about the changes here.

@codecov
Copy link

codecov bot commented May 11, 2020

Codecov Report

Merging #476 into master will increase coverage by 0.03%.
The diff coverage is 95.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #476      +/-   ##
==========================================
+ Coverage   87.00%   87.03%   +0.03%     
==========================================
  Files         345      345              
  Lines       11643    11671      +28     
  Branches      376      614     +238     
==========================================
+ Hits        10130    10158      +28     
  Misses       1513     1513              
Impacted Files Coverage Δ
...ala/com/salesforce/op/dsl/RichNumericFeature.scala 100.00% <ø> (ø)
...rce/op/stages/impl/preparators/SanityChecker.scala 91.25% <90.90%> (-0.24%) ⬇️
...tages/impl/preparators/SanityCheckerMetadata.scala 89.72% <95.12%> (+0.74%) ⬆️
...c/main/scala/com/salesforce/op/ModelInsights.scala 93.04% <100.00%> (-0.07%) ⬇️
...s/impl/preparators/DerivedFeatureFilterUtils.scala 93.08% <100.00%> (+0.31%) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 89.79% <0.00%> (+4.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24cdbc4...ac1c3a7. Read the comment docs.

Copy link
Contributor

@nicodv nicodv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition. Left several minor comments.

@@ -640,11 +671,13 @@ class SanityCheckerTest extends OpEstimatorSpec[OPVector, BinaryModel[RealNN, OP

val metadata: Metadata = getMetadata(outputColName, transformedData)
val summary = SanityCheckerSummary.fromMetadata(metadata.getSummaryMetadata())

println(summary.correlations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@@ -98,6 +99,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging {

// Check that all the PickList features are dropped and they are the only ones dropped
// (9 choices + "other" + empty = 11 total dropped columns)
println(retrieved.dropped)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@@ -361,6 +373,10 @@ private[op] case class ColumnStatistics
corrLabel.filter(Math.abs(_) > maxCorrelation).map(corr =>
s"correlation $corr higher than max correlation $maxCorrelation"
),
column.flatMap{ case cl => featureCorrs.take(cl.index).find(Math.abs(_) > maxFeatureCorr).map(corr =>
s"this feature has correlations $corr with other features higher than max feature-feature" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantics but would: "with another feature higher than" be better?

)
def setMaxCorrelation(value: Double): this.type = set(maxCorrelation, value)
def getMaxCorrelation: Double = $(maxCorrelation)

final val maxFeatureCorr = new DoubleParam(
parent = this, name = "maxFeatureCorr",
doc = "Maximum correlation (absolute value) allowed between a feature two feature vectors which will" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "a feature two feature vectors"

corrType: CorrelationType
) extends MetadataLike {
require(featuresIn.length == values.length, "Feature names and correlation values arrays must have the same length")
require(featuresIn.length == valuesWithLabel.length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be defensive, maybe add a require for a square correlation matrix? (or later in the code, as it may not be calculated)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correlation matrix may be empty so not sure it makes sense

corrType = CorrelationType.withNameInsensitive(wrapped.get[String](SanityCheckerNames.CorrelationType))
)
val features = wrapped.getArray[String](SanityCheckerNames.FeaturesIn).toSeq
if (wrapped.underlyingMap.keySet.contains("values")) { // old sanity checker meta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice backwards compatibility hack! maybe use "correlationsWithLabel" instead of "values", as its speaks more to the type of change that occurred?

@nicodv
Copy link
Contributor

nicodv commented May 14, 2020

LGTM

@@ -94,15 +94,24 @@ trait SanityCheckerParams extends DerivedFeatureFilterParams {
final val maxCorrelation = new DoubleParam(
parent = this, name = "maxCorrelation",
doc = "Maximum correlation (absolute value) allowed between a feature in the feature vector and the label",
isValid = ParamValidators.inRange(lowerBound = 0.0, upperBound = 1.0, lowerInclusive = true, upperInclusive = true)
isValid = ParamValidators.inRange(lowerBound = -0.1, upperBound = 1.1, lowerInclusive = true, upperInclusive = true)
Copy link
Contributor

@Jauntbox Jauntbox May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did these ranges need to be changed? We shouldn't get correlations outside of [0, 1], right?

final val minCorrelation = new DoubleParam(
parent = this, name = "minCorrelation",
doc = "Minimum correlation (absolute value) allowed between a feature in the feature vector and the label",
isValid = ParamValidators.inRange(lowerBound = 0.0, upperBound = 1.0, lowerInclusive = true, upperInclusive = true)
isValid = ParamValidators.inRange(lowerBound = -0.1, upperBound = 1.1, lowerInclusive = true, upperInclusive = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - why are these changed?

stats.filter(s => s.corrLabel.isDefined && !s.corrLabel.get.isNaN).map(s => s.name -> s.corrLabel.get),
stats.filter(s => s.corrLabel.isDefined && s.corrLabel.get.isNaN).map(_.name),
correlations = new Correlations(
stats.filter(s => s.corrLabel.isDefined).map(s => (s.name, s.corrLabel.get, s.featureCorrs)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not have to check for NaNs anymore? What if the feature is constant and we turn off the min variance threshold?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I fixed it - in that when it goes to metadata it turns it into strings and then when it gets out back into the class it is turned into doubles. Now nans work! The json serializer would not have helped because it was a problem with the spark metadata.

* @param values correlation of feature with label
* @param nanCorrs nan correlation features
* @param valuesWithLabel correlation of feature with label
* @param valuesWithFeatures correlations between features features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"correlations between features features"

@@ -88,6 +88,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging {
val genFeatureVector = Seq(rawCity, rawCountry, rawPickList, rawCurrency).transmogrify()
val checkedFeatures = new SanityChecker()
.setCheckSample(1.0)
.setMaxFeatureCorr(1.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see - you're using values > 1.0 to turn it off. I think it may be clearer if we keep the bounds in [0, 1.0], but make it a < check instead of a <= check (so that setting it to 1.0 would turn it off).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did that! Turns out that with doubles you can get a correlation > 1.0 with perfectly correlated things :-P

Eg in a couple of test 1 had it set to 1.0 with corr > and it failed because the correlation was 1.00000000004

)
def setMaxCorrelation(value: Double): this.type = set(maxCorrelation, value)
def getMaxCorrelation: Double = $(maxCorrelation)

final val maxFeatureCorr = new DoubleParam(
Copy link
Contributor

@gerashegalov gerashegalov May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better spell out Correlation in a public parameter

@leahmcguire leahmcguire merged commit d222438 into master May 19, 2020
@nicodv nicodv mentioned this pull request Jun 11, 2020
@tovbinm tovbinm deleted the lm/removeDups branch June 12, 2020 01:33
@salesforce-cla
Copy link

salesforce-cla bot commented Dec 9, 2020

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com> Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants