Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify categorical variables in metadata #120

Merged
merged 16 commits into from
Sep 11, 2018
Merged

Conversation

michaelweilsalesforce
Copy link
Contributor

@michaelweilsalesforce michaelweilsalesforce commented Sep 7, 2018

Related issues
The one-hot encoding provided by Transmogrify doesn't specify that the created columns are categorical. As a consequence, Decision Tree and Random Forest will treat these columns as numerics.

Describe the proposed solution
For each feature engineering transformation that creates categorical columns, we add the Binary attribute to their metadata.

@michaelweilsalesforce michaelweilsalesforce changed the title Mw/categorical metadata Specify categorical variables in metadata Sep 7, 2018
import org.scalatest.Matchers
import org.scalatest.junit.JUnitRunner

object AttributeTestUtils extends Matchers{
Copy link
Collaborator

@tovbinm tovbinm Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. better use a trait - it will be cleaner and simpler to use.
  2. make return type the Assertion
    here it is:
trait AttributeAsserts {
   self: Matchers =>
   final def assertNominal(schema: StructField, expectedNominal: Array[Boolean]): Assertion = ???
}

then mixin it like this:

@RunWith(classOf[JUnitRunner])
class BinaryVectorizerTest extends OpTransformerSpec[OPVector, BinaryVectorizer] with AttributeAsserts { ... }

@@ -72,6 +75,8 @@ class OpVectorMetadata private
newColumns: Array[OpVectorColumnMetadata]
): OpVectorMetadata = OpVectorMetadata(name, newColumns, history)

val textTypes = Seq(MultiPickList, MultiPickListMap, Text, TextArea, TextAreaMap, TextMap, Binary, BinaryMap,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. how is Binary and BinaryMap are testTypes?
  2. also instead please use FeatureType.shortTypeName[Text] etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the package name as well. E.g com.salesforce.features.types. MultiPickList

As it is specified in OpVectorColumnMetadata.parentFeatureType

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then FeatureType.typeName

@@ -31,15 +31,18 @@
package com.salesforce.op.utils.spark

import com.salesforce.op.FeatureHistory
import com.salesforce.op.features.types.{Binary, BinaryMap, MultiPickList, MultiPickListMap, Text, TextArea, TextAreaMap, TextList, TextMap}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import com.salesforce.op.features.types._

@codecov
Copy link

codecov bot commented Sep 9, 2018

Codecov Report

Merging #120 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
+ Coverage   86.18%   86.21%   +0.02%     
==========================================
  Files         297      297              
  Lines        9670     9676       +6     
  Branches      334      539     +205     
==========================================
+ Hits         8334     8342       +8     
+ Misses       1336     1334       -2
Impacted Files Coverage Δ
...m/salesforce/op/utils/spark/OpVectorMetadata.scala 84.9% <100%> (+1.92%) ⬆️
...om/salesforce/op/utils/spark/OpSparkListener.scala 97.4% <0%> (-1.3%) ⬇️
.../salesforce/op/features/FeatureBuilderMacros.scala 100% <0%> (+100%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e200397...23e805d. Read the comment docs.

* @param expectedNominal Expected array of booleans. True if the field is nominal, false if not.
*/
final def assertNominal(schema: StructField, expectedNominal: Array[Boolean]): Assertion = {
val attributes = AttributeGroup.fromStructField(schema).attributes.get
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid .get, rather do attributes.map(_.map(_.isNominal)) shouldBe Some(expectedNominal)

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, that is a lot of updated tests! My only real question is if those are the only types we want to make sure are flagged. Also text is not actually a binary attribute by default it is a count of occurrences of the hash. Is there another attribute group type for counts? or should they be numeric?

val categoricalTypes = Seq(FeatureType.typeName[MultiPickList], FeatureType.typeName[MultiPickListMap],
FeatureType.typeName[Text], FeatureType.typeName[TextArea], FeatureType.typeName[TextAreaMap],
FeatureType.typeName[TextMap], FeatureType.typeName[Binary], FeatureType.typeName[BinaryMap],
FeatureType.typeName[TextList])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

picklist? Combo box? country, state, city, id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah If it is only hashing + count, let's remove all these Text types. Do we only do hashing to Combo box, country, state, city, id?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do pivot - so they should be picked up automatically. I think we also do pivot on multiPickList. So you may want to remove the categorical types check completely and only rely on the indicatorValue

.map { case (_, g) => g.head -> g.map(_.index) }
val colMeta = colData.map { case (c, i) =>
c.toMetadata(i)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit unnecessary new line hanging => is bad

.putMetadataArray(OpVectorMetadata.ColumnsKey, colMeta.toArray)
.putMetadata(OpVectorMetadata.HistoryKey, FeatureHistory.toMetadata(history))
.build()
val attributes = columns.map { c =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val attributes = columns.map {
    case c if c.indicatorValue.isDefined || categoricalTypes.exists(c.parentFeatureType.contains) =>
        BinaryAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
    case c =>
        NumericAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
}

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tovbinm tovbinm merged commit 6d69992 into master Sep 11, 2018
@tovbinm tovbinm deleted the mw/categorical-metadata branch September 11, 2018 01:47
@tovbinm
Copy link
Collaborator

tovbinm commented Sep 11, 2018

@michaelweilsalesforce please update the description of the PR to clarify the reasons and the solution.

@salesforce-cla
Copy link

salesforce-cla bot commented Apr 3, 2021

Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants