Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression training limit #413

Merged
merged 19 commits into from
Oct 8, 2019

Conversation

AdamChit
Copy link
Collaborator

@AdamChit AdamChit commented Oct 3, 2019

Related issues

DataBalancer for binary classification has a parameter that controls the max data passed into modeling - Regression should allow similar limits

Describe the proposed solution

Steps:

  1. Investigate where the check should occur (somewhere in DataSplitter)

  2. Add logic to downsample when the number of records is reached

  3. Add downsampling information to the summary object and log

  4. Add tests to DataSplitterTest and RegressionModelSelectorTest to cover the downsampling logic

Describe alternatives you've considered

Not having a limit for any of the model types - this was not optimal because some spark models may have very long runtimes or bad behavior with too much data. So the default will be to downsample once we have passes 1M records and give the user the option to set their own maxTrainingSample if they are ok with working with large dataset

@salesforce-cla
Copy link

salesforce-cla bot commented Oct 3, 2019

Thanks for the contribution! Before we can merge this, we need @AdamChit to sign the Salesforce.com Contributor License Agreement.

@AdamChit
Copy link
Collaborator Author

AdamChit commented Oct 3, 2019

@TuanNguyen27 Could you review

*
* @group param
*/
private[op] final val downSampleFraction = new DoubleParam(this, "downSampleFraction",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can be protected instead of private to OP

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also maybe set a default value of 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the default value here

): DataSplitter = {
new DataSplitter()
.setSeed(seed)
.setReserveTestFraction(reserveTestFraction)
.setMaxTrainingSample(maxTrainingSample)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this also exposed for the datacutter class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added it to SplitterParams which datacutter has access to - e4b8a92. So that I can use the same set/get functions across DataBalancer, DataCutter and DataSplitter.

modelSelector.splitter.get.getMaxTrainingSample shouldBe 1000
}

it should "set maxTrainingSample and down-sample" in {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you also mean to check maxTrainingSample was set in this test as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved testing the maxTrainingSample set/get functions to here and forgot to rename this test. Great catch!

summary = Option(DataSplitterSummary())
val dataSetSize = data.count().toDouble
val sampleF = getMaxTrainingSample / dataSetSize
val DownSampleFraction = if (getMaxTrainingSample < dataSetSize) sampleF else 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be a min here too

also: lowerCamelCase

Suggested change
val DownSampleFraction = if (getMaxTrainingSample < dataSetSize) sampleF else 1
val downSampleFraction = math.min(sampleF, 1.0)

@@ -43,6 +43,7 @@ class DataSplitterTest extends FlatSpec with TestSparkContext with SplitterSumma

val seed = 1234L
val dataCount = 1000
val MaxTrainingSampleDefault = 1E6.toLong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would give this a different name so we don't confuse it with the actual default in SplitterParams


dataSplitter.getMaxTrainingSample shouldBe maxRows
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth checking downSampleFraction params was set here as well, for completeness so that you have checked everything in DataSplitterParams

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made changes here 433d483

@codecov
Copy link

codecov bot commented Oct 4, 2019

Codecov Report

Merging #413 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #413      +/-   ##
==========================================
+ Coverage   86.97%   86.99%   +0.02%     
==========================================
  Files         337      337              
  Lines       11060    11078      +18     
  Branches      357      597     +240     
==========================================
+ Hits         9619     9637      +18     
  Misses       1441     1441
Impacted Files Coverage Δ
...alesforce/op/stages/impl/tuning/DataBalancer.scala 96.11% <ø> (-0.18%) ⬇️
...om/salesforce/op/stages/impl/tuning/Splitter.scala 98.07% <100%> (+0.34%) ⬆️
...e/op/stages/impl/selector/ModelSelectorNames.scala 100% <100%> (ø) ⬆️
...alesforce/op/stages/impl/tuning/DataSplitter.scala 90% <100%> (+23.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1bf6fdf...cfbe22f. Read the comment docs.

@tovbinm tovbinm changed the title Achit/regression training limit Regression training limit Oct 4, 2019
case s if s == classOf[DataSplitterSummary].getName => DataSplitterSummary(
preSplitterDataCount = metadata.getLong(ModelSelectorNames.PreSplitterDataCount),
downSamplingFraction = metadata.getDouble(ModelSelectorNames.DownSample)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the downsample fraction to the datacutter params as well...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added downsample fraction into the datacutter params as part of the multi class classification training limit changes. I'll create the PR for it today.

@AdamChit
Copy link
Collaborator Author

AdamChit commented Oct 7, 2019

Unrelated Test fails during the Travis CI check https://travis-ci.com/salesforce/TransmogrifAI/jobs/243091644 I'll create a ticket to increase the tolerance on the test here

@AdamChit AdamChit merged commit 53dd954 into salesforce:master Oct 8, 2019
@sanmitra sanmitra mentioned this pull request Oct 11, 2019
@nicodv nicodv mentioned this pull request Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants