Multi-class classification training limit #414

AdamChit · 2019-10-08T01:32:30Z

Related issues
DataBalancer for binary classification has a parameter that controls the max data passed into modeling - multiclass classification should allow similar limits

Describe the proposed solution
The solution is to downsample once we reach the training set limit

Describe alternatives you've considered
N/A

Additional context
similar to #413

… that inherit from splitter can use them

…level

…ogrifAI into LocoTestRefactor merge master

…AdamChit/TransmogrifAI into achit/regression-training-limit

…plitterTest.scala Co-Authored-By: Christopher Rupley <[email protected]>

…egressionModelSelectorTest.scala Co-Authored-By: Christopher Rupley <[email protected]>

…AdamChit/TransmogrifAI into achit/regression-training-limit

…params to DataCutterSummary

…et/get fuctions moved to SplitterParams

codecov · 2019-10-08T01:50:10Z

Codecov Report

Merging #414 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #414      +/-   ##
==========================================
- Coverage   86.96%   86.92%   -0.04%     
==========================================
  Files         337      337              
  Lines       11083    11099      +16     
  Branches      356      593     +237     
==========================================
+ Hits         9638     9648      +10     
- Misses       1445     1451       +6

Impacted Files	Coverage Δ
...alesforce/op/stages/impl/tuning/DataBalancer.scala	`95.95% <ø> (-0.16%)`	⬇️
...alesforce/op/stages/impl/tuning/DataSplitter.scala	`65% <ø> (-25%)`	⬇️
...om/salesforce/op/stages/impl/tuning/Splitter.scala	`98.3% <100%> (+0.22%)`	⬆️
.../salesforce/op/stages/impl/tuning/DataCutter.scala	`97.22% <100%> (+0.38%)`	⬆️
.../op/features/types/FeatureTypeSparkConverter.scala	`98.23% <0%> (-0.89%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f03c43...7b23861. Read the comment docs.

leahmcguire · 2019-10-11T21:26:06Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataCutter.scala

 summary = Option(DataCutterSummary(
+ preSplitterDataCount = dataSetSize,
+ downSamplingFraction = getDownSampleFraction,
 labelsKept = getLabelsToKeep,
 labelsDropped = getLabelsToDrop,
 labelsDroppedTotal = getLabelsDroppedTotal
 ))
 PrevalidationVal(summary, Option(dataPrep))


I think it makes more sense to do the downsampling in the pre-validationPrepare than in the validation prepare - the difference being that the validation prepare is called within the CV folds. So since you upsample for binary it needs to be here to prevent label leakage but since there is only downsampling here it can go earlier

The 2 main reasons behind doing it this way:

Keeping it consistent with dataSplitter/dataBalancer

Easier to implement stratified sampling (or other data balancing techniques) in the future which would upsample the minority classes

leahmcguire · 2019-10-11T21:26:32Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataCutter.scala

@@ -203,7 +233,11 @@ class DataCutter(uid: String = UID[DataCutter]) extends Splitter(uid = uid) with
 s" minLabelFraction = $minLabelFract, maxLabelCategories = $maxLabels. \n" +
 s"Label counts were: ${labelCounts.collect().toSeq}")
 }
- DataCutterSummary(labelsKept.toSeq, labelsDropped.toSeq, labelsDroppedTotal.toLong)
+ DataCutterSummary(
+ labelsKept = labelsKept.toSeq,


do you want to add the downsample fraction?

I didn't want to have to pass the dataset (or the dataset count) to the estimate function. So I added the dataset count and downsample fraction in the summary variable here.

…s passed

tovbinm

see minor comments

tovbinm · 2019-10-19T17:15:29Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataCutter.scala

+ val dataPrep = super.validationPrepare(data)
+
+ // check if down sampling is needed
+ val balanced: DataFrame = if (getDownSampleFraction < 1) {


tovbinm · 2019-10-19T17:15:41Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataCutter.scala

@@ -273,6 +307,8 @@ private[impl] trait DataCutterParams extends SplitterParams {
 */
 case class DataCutterSummary
 (
+ preSplitterDataCount: Long = 0,


tovbinm · 2019-10-19T17:15:55Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/Splitter.scala

@@ -129,6 +129,23 @@ trait SplitterParams extends Params {
 def setReserveTestFraction(value: Double): this.type = set(reserveTestFraction, value)
 def getReserveTestFraction: Double = $(reserveTestFraction)

+ /**
+ * Fraction to sample majority data
+ * Value should be in ]0.0, 1.0]


(0.0, 1.0]

tovbinm · 2019-10-19T17:16:19Z

core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataCutterTest.scala

@@ -65,6 +68,8 @@ class DataCutterTest extends FlatSpec with TestSparkContext with SplitterSummary
 s.labelsKept.length shouldBe 1000
 s.labelsDropped.length shouldBe 0
 s shouldBe DataCutterSummary(
+ preSplitterDataCount = dataSize,
+ downSamplingFraction = 1,


Use 1.0 instead of 1 here and everywhere with Double values

…/AdamChit/TransmogrifAI into achit/multi-class-training-limit

AdamChit and others added 30 commits September 20, 2019 15:46

refactored maxTrainingSample get and set function so that all classes…

e4b8a92

… that inherit from splitter can use them

added downsampling logic if MaxTrainingSample reached

2170254

added unit tests for downsampling in regression data splitter

dff09b9

added integration tests to test downsampling from the model selector …

14c6b42

…level

Changes to FunSpec and refactored test

a17c13c

refactored test to make it more readable and changed to Funspec

b0cdfae

Merge branch 'master' into LocoTestRefactor

ebedf1d

Changes to FunSpec and refactored test

96ec077

refactored test to make it more readable and changed to Funspec

b3ca610

Merge branch 'LocoTestRefactor' of https://github.com/AdamChit/Transm…

fb28e65

…ogrifAI into LocoTestRefactor merge master

revert formatting to follow scala style

3095e19

more descriptive title for section

973aa0d

threshold was too large and would fail on some runs of the test

e31bd22

style changes

722341b

changed the test to reduce run time

34d5bf1

Merge branch 'master' into achit/regression-training-limit

8e2778d

test now checks all data splitter params

433d483

Merge branch 'achit/regression-training-limit' of https://github.com/…

0932810

…AdamChit/TransmogrifAI into achit/regression-training-limit

Update core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataS…

2b02f8a

…plitterTest.scala Co-Authored-By: Christopher Rupley <[email protected]>

Update core/src/test/scala/com/salesforce/op/stages/impl/regression/R…

0521a37

…egressionModelSelectorTest.scala Co-Authored-By: Christopher Rupley <[email protected]>

added downSampleFraction default value and made style changes

962e06f

Merge branch 'achit/regression-training-limit' of https://github.com/…

80a80d5

…AdamChit/TransmogrifAI into achit/regression-training-limit

renamed test

0ab4d9a

changed getDownSampleFraction to protected

ef4327c

name change based on RP comments

8ca0e78

added datacount to summary

8e67f27

Trigger re-build

009706d

Trigger travis re-build

cfbe22f

moved downSampleFraction set/get functions to parent class and added …

6945cf7

…params to DataCutterSummary

there is no need for DataSplitterParams, because downSampleFraction s…

684020b

…et/get fuctions moved to SplitterParams

AdamChit and others added 4 commits October 7, 2019 17:35

downSampleFraction set/get fuctions moved to SplitterParams

1ce49de

added down-sampling logic

56fa01a

unit and integration tests

7ab78d8

fork sync

5aba86c

AdamChit requested a review from gerashegalov October 8, 2019 01:32

AdamChit requested review from Jauntbox, leahmcguire, tovbinm and wsuchy as code owners October 8, 2019 01:32

salesforce-cla bot added the cla:signed label Oct 8, 2019

leahmcguire reviewed Oct 11, 2019

View reviewed changes

AdamChit and others added 9 commits October 16, 2019 09:19

refactored test

6961457

fork sync

acd97a6

added call of validation prepare before model selection when no dag i…

eeb0e1e

…s passed

added validprep changes to make it is called during training

bbd6b22

dont want to balance the test sets

30fdc89

scala style change

5beb53b

Merge remote-tracking branch 'upstream/master'

911c4a7

syn with fork

01b59d1

Merge branch 'master' into achit/multi-class-training-limit

0f09bac

tovbinm approved these changes Oct 19, 2019

View reviewed changes

AdamChit added 2 commits October 21, 2019 11:34

PR style changes

4cee6a5

Merge branch 'achit/multi-class-training-limit' of https://github.com…

9b7c10e

…/AdamChit/TransmogrifAI into achit/multi-class-training-limit

AdamChit requested a review from leahmcguire October 23, 2019 18:12

Merge branch 'master' into achit/multi-class-training-limit

7b23861

AdamChit merged commit c7f363f into salesforce:master Nov 1, 2019

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-class classification training limit #414

Multi-class classification training limit #414

AdamChit commented Oct 8, 2019

codecov bot commented Oct 8, 2019 •

edited

Loading

leahmcguire Oct 11, 2019

AdamChit Oct 21, 2019

leahmcguire Oct 11, 2019

AdamChit Oct 21, 2019

tovbinm left a comment

tovbinm Oct 19, 2019

tovbinm Oct 19, 2019

tovbinm Oct 19, 2019

tovbinm Oct 19, 2019

Multi-class classification training limit #414

Multi-class classification training limit #414

Conversation

AdamChit commented Oct 8, 2019

codecov bot commented Oct 8, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 8, 2019 •

edited

Loading