-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust bin values for text features in RFF #99
Conversation
Codecov Report
@@ Coverage Diff @@
## master #99 +/- ##
==========================================
- Coverage 86.08% 86.07% -0.01%
==========================================
Files 297 297
Lines 9659 9660 +1
Branches 336 429 +93
==========================================
Hits 8315 8315
- Misses 1344 1345 +1
Continue to review full report at Codecov.
|
val maxBins = MaxBins | ||
val numBins = math.min(math.max(bins, sum.max/AvgBinValue), maxBins).toInt | ||
|
||
val hasher: HashingTF = new HashingTF(numFeatures = numBins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to create hasher every time or perhaps we can create it once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hashing dimension can be different for different features, the minimum number would be bins
but we can have a shared one with numFeatures = bins, and use that for every case if there is not too many tokens; OR we can create a couple shared hashers with different scales, and choose one based on the scale of token numbers
What do you think? I was assuming creating a hasher for every feature will not be very resource consuming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let see if we can reuse the hashing function without creating HashingTF
everytime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I think we can. See HashingTF.transform
and HashingTF
object
case Left(seq) => { | ||
val minBins = bins | ||
val maxBins = MaxBins | ||
val numBins = math.min(math.max(bins, sum.max/AvgBinValue), maxBins).toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum.max / AvgBinValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of .toInt
perhaps specify the explicit rounding instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed! thanks!!
case Left(seq) => { | ||
val minBins = bins | ||
val maxBins = MaxBins | ||
val numBins = math.min(math.max(bins, sum.max / AvgBinValue), maxBins).floor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we should make this more flexible
instead of defining this allow users to pass in a function for text bins eg:
val textBins: (min: Int, max: Int, sum: Long, count: Long) => Int
to the contructor
and add sum to the Summary info collected
@@ -142,6 +143,13 @@ case class FeatureDistribution | |||
private[op] object FeatureDistribution { | |||
|
|||
val MaxBins = 100000 | |||
val AvgBinValue = 5000 | |||
val MaxTokenLowerLimit = 10 | |||
val getBins = (sum: Summary, bins: Int) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add move this method to Summary class., ie.
private[op] case class Summary(min: Double, max: Double) {
// TODO: docs
def bins(bins: Int): Int = {
if (sum.max < MaxTokenLowerLimit) bins
else math.min(math.max(bins, sum.sum / AvgBinValue), MaxBins).toInt
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
…rce/TransmogrifAI into xs/adjustBinValueForText
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, please replace Todo with TODO @sxd929
Shall we merge? @leahmcguire
* @param bins default bins of RFF | ||
* @return | ||
*/ | ||
def getBinsText(bins: Int): Int = bins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user should be able to pass the formula for how to compute the text feature bin size into RawFeatureFilter as a parameter. We should provide a default which for now can be no formula but later can be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Map("A" -> Summary(-1.6, 10.6), "B" -> Summary(0.0, 3.0)), | ||
Map("B" -> Summary(0.0, 0.0))) | ||
val summaries = Array( | ||
Map("A" -> Summary(0.0, 2.0, 100.0, 10), "B" -> Summary(0.0, 5.0, 10.0, 10)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the max change for just the A
key here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hardcoded, I changed it since a sample of "A" is Seq("male", "female"), number of hashes in the text is already 2, so the max should not be smaller than 2
@@ -107,15 +119,15 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi | |||
else d.distribution.length shouldBe 2 | |||
} | |||
distribs(0).nulls shouldBe 0 | |||
distribs(0).summaryInfo should contain theSameElementsAs Array(0.0, 1.0) | |||
distribs(0).summaryInfo should contain theSameElementsAs Array(0.0, 2.0, 100.0, 10.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, why did the max change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one reads the info from hardcoded 'summaries' above
textBinsFormula: (Summary, Int) => Int | ||
): (Array[Double], Array[Double]) = values match { | ||
case Left(seq) => | ||
val numBins = textBinsFormula(summary, bins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would the formula need the bins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we set is as def textBinsFormula(s: Summary, bins: Int) = bins
.
I can image default bins
can be returned in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would you recommend then? @leahmcguire
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bins is the default number (e.g. 100) there might be cases that we will just want to fall back to the defaults
@@ -514,7 +517,8 @@ class OpWorkflow(val uid: String = UID[OpWorkflow]) extends OpWorkflowCore { | |||
maxJSDivergence: Double = 0.90, | |||
maxCorrelation: Double = 0.95, | |||
correlationType: CorrelationType = CorrelationType.Pearson, | |||
protectedFeatures: Array[OPFeature] = Array.empty | |||
protectedFeatures: Array[OPFeature] = Array.empty, | |||
textBinsFormula: (Summary, Int) => Int = RawFeatureFilter.textBinsFormula |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the bins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but that is not what we want the formula to look like long term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what is your suggestion on the inputs / outputs then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please just add to comments what the bins are
Thanks for the contribution! It looks like @sxd929 is an internal user so signing the CLA is not required. However, we need to confirm this. |
Related issues
Not enough hashing space for RFF of text may reduce the js distance (weaken the signal)
Describe the proposed solution
Describe alternatives you've considered
Alternative method like Jaccard similarity, word and sentence embedding might be considered
Additional context
N/A