Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added check for multipicklist other class as not being binary in metadata #127

Merged
merged 6 commits into from
Sep 13, 2018

Conversation

leahmcguire
Copy link
Collaborator

Related issues
creation of spark attribute groups was mislabeling multipicklist other column as binary. this made tree models fail

Describe the proposed solution
multipicklist other is now a numeric not binary

@leahmcguire leahmcguire changed the title added check for multipicklist other class as not bing binary in metadata added check for multipicklist other class as not being binary in metadata Sep 12, 2018
Copy link
Contributor

@Jauntbox Jauntbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just two very minor things

@@ -75,7 +75,8 @@ class OpVectorMetadata private
newColumns: Array[OpVectorColumnMetadata]
): OpVectorMetadata = OpVectorMetadata(name, newColumns, history)

val categoricalTypes = Seq(FeatureType.typeName[Binary], FeatureType.typeName[BinaryMap])
private val categoricalTypes = Seq(FeatureType.typeName[Binary], FeatureType.typeName[BinaryMap])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling this categoricalTypes, when many of the actual categorical types (eg. PickList) are not included? Maybe binaryTypes instead?

@@ -155,6 +161,7 @@ object OpVectorColumnMetadata {
val DescriptorValueKey = "descriptor_value"
val IndicesKey = "indices"
val NullString = "NullIndicatorValue"
val OtherString = "OTHER"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string is also defined TransmogrifierDefaults, can we make it so it's only defined in one place? I think it has to live here since it's in utils.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the TransmogrifierDefaults to point to this

@codecov
Copy link

codecov bot commented Sep 12, 2018

Codecov Report

Merging #127 into master will increase coverage by 0.02%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #127      +/-   ##
==========================================
+ Coverage   86.15%   86.18%   +0.02%     
==========================================
  Files         298      298              
  Lines        9672     9676       +4     
  Branches      340      542     +202     
==========================================
+ Hits         8333     8339       +6     
+ Misses       1339     1337       -2
Impacted Files Coverage Δ
...sforce/op/stages/impl/feature/Transmogrifier.scala 96.74% <100%> (ø) ⬆️
...m/salesforce/op/utils/spark/OpVectorMetadata.scala 85.45% <100%> (+0.54%) ⬆️
...sforce/op/utils/spark/OpVectorColumnMetadata.scala 77.27% <50%> (-1.3%) ⬇️
...om/salesforce/op/utils/spark/OpSparkListener.scala 98.7% <0%> (+1.29%) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 89.79% <0%> (+4.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fdd780e...17250af. Read the comment docs.

@michaelweilsalesforce
Copy link
Contributor

@leahmcguire should we add in OpRandomForestTest a unit test on a data that contains almost all possible feature transformation we have?
I can help contribute to it...

@@ -42,8 +43,11 @@ trait AttributeAsserts {
* @param schema
* @param expectedNominal Expected array of booleans. True if the field is nominal, false if not.
*/
final def assertNominal(schema: StructField, expectedNominal: Array[Boolean]): Assertion = {
final def assertNominal(schema: StructField, expectedNominal: Array[Boolean], output: Array[OPVector]): Assertion = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update docs for the assertNominal function

val attributes = AttributeGroup.fromStructField(schema).attributes
output.foreach{
x => x.value.toArray.zip(expectedNominal).foreach{ case (v , n) => if (n) (v == 0.0 || v == 1.0) shouldBe true }
Copy link
Collaborator

@tovbinm tovbinm Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add a check when n == false? is so, then better:

for {
    x <- output
    (value, nominal) <- x.value.toArray.zip(expectedNominal)
} if (nominal) value should (be (0.0) or be (1.0)) else ???

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no constraints on if n is false

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it can be NaN, -/+Inf as well?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, some minor comments

@leahmcguire leahmcguire merged commit 57c9f71 into master Sep 13, 2018
@leahmcguire leahmcguire deleted the lm/binaryBugFix branch September 13, 2018 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants