Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify categorical variables in metadata #120

Merged
merged 16 commits into from
Sep 11, 2018
Merged
Prev Previous commit
Next Next commit
FreatureType.typeName
  • Loading branch information
mweilsalesforce committed Sep 9, 2018
commit 78edecaed67ddabe964aab041270ac77c71208a6
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
package com.salesforce.op.utils.spark

import com.salesforce.op.FeatureHistory
import com.salesforce.op.features.types._
import com.salesforce.op.features.types.{FeatureType, _}
import org.apache.spark.ml.attribute.{AttributeGroup, BinaryAttribute, NumericAttribute}
import org.apache.spark.ml.linalg.SQLDataTypes._
import org.apache.spark.sql.types.{Metadata, MetadataBuilder, StructField}
Expand Down Expand Up @@ -75,8 +75,10 @@ class OpVectorMetadata private
newColumns: Array[OpVectorColumnMetadata]
): OpVectorMetadata = OpVectorMetadata(name, newColumns, history)

val textTypes = Seq(MultiPickList, MultiPickListMap, Text, TextArea, TextAreaMap, TextMap, Binary, BinaryMap,
TextList).map(_.getClass.getName.dropRight(1))
val categoricalTypes = Seq(FeatureType.typeName[MultiPickList], FeatureType.typeName[MultiPickListMap],
FeatureType.typeName[Text], FeatureType.typeName[TextArea], FeatureType.typeName[TextAreaMap],
FeatureType.typeName[TextMap], FeatureType.typeName[Binary], FeatureType.typeName[BinaryMap],
FeatureType.typeName[TextList])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

picklist? Combo box? country, state, city, id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah If it is only hashing + count, let's remove all these Text types. Do we only do hashing to Combo box, country, state, city, id?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do pivot - so they should be picked up automatically. I think we also do pivot on multiPickList. So you may want to remove the categorical types check completely and only rely on the indicatorValue


/**
* Serialize to spark metadata
Expand All @@ -96,7 +98,7 @@ class OpVectorMetadata private
.putMetadata(OpVectorMetadata.HistoryKey, FeatureHistory.toMetadata(history))
.build()
val attributes = columns.map { c =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val attributes = columns.map {
    case c if c.indicatorValue.isDefined || categoricalTypes.exists(c.parentFeatureType.contains) =>
        BinaryAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
    case c =>
        NumericAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
}

if (c.indicatorValue.isDefined || textTypes.exists(c.parentFeatureType.contains)) {
if (c.indicatorValue.isDefined || categoricalTypes.exists(c.parentFeatureType.contains)) {
BinaryAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
} else {
NumericAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
Expand Down