Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression training limit #413

Merged
merged 19 commits into from
Oct 8, 2019
Merged
Changes from 1 commit
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
e4b8a92
refactored maxTrainingSample get and set function so that all classes…
AdamChit Sep 20, 2019
2170254
added downsampling logic if MaxTrainingSample reached
AdamChit Sep 20, 2019
dff09b9
added unit tests for downsampling in regression data splitter
AdamChit Sep 20, 2019
14c6b42
added integration tests to test downsampling from the model selector …
AdamChit Sep 20, 2019
722341b
style changes
AdamChit Oct 3, 2019
34d5bf1
changed the test to reduce run time
AdamChit Oct 3, 2019
8e2778d
Merge branch 'master' into achit/regression-training-limit
AdamChit Oct 3, 2019
433d483
test now checks all data splitter params
AdamChit Oct 4, 2019
0932810
Merge branch 'achit/regression-training-limit' of https://github.com/…
AdamChit Oct 4, 2019
2b02f8a
Update core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataS…
AdamChit Oct 4, 2019
0521a37
Update core/src/test/scala/com/salesforce/op/stages/impl/regression/R…
AdamChit Oct 4, 2019
962e06f
added downSampleFraction default value and made style changes
AdamChit Oct 4, 2019
80a80d5
Merge branch 'achit/regression-training-limit' of https://github.com/…
AdamChit Oct 4, 2019
0ab4d9a
renamed test
AdamChit Oct 4, 2019
ef4327c
changed getDownSampleFraction to protected
AdamChit Oct 5, 2019
8ca0e78
name change based on RP comments
AdamChit Oct 5, 2019
8e67f27
added datacount to summary
AdamChit Oct 5, 2019
009706d
Trigger re-build
AdamChit Oct 7, 2019
cfbe22f
Trigger travis re-build
AdamChit Oct 7, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
name change based on RP comments
  • Loading branch information
AdamChit committed Oct 5, 2019
commit 8ca0e78fb2f5e1f7e0ba869f9fa86a8387f4f969
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ class DataSplitterTest extends FlatSpec with TestSparkContext with SplitterSumma

val seed = 1234L
val dataCount = 1000
val MaxTrainingSampleDefault = 1E6.toLong
val trainingLimitDefault = 1E6.toLong

val data =
RandomRDDs.normalVectorRDD(sc, 1000, 3, seed = seed)
Expand All @@ -57,8 +57,8 @@ class DataSplitterTest extends FlatSpec with TestSparkContext with SplitterSumma
train.count() shouldBe dataCount
}

it should "down-sample when the data count is above the max allowed" in {
val numRows = MaxTrainingSampleDefault * 2
it should "down-sample when the data count is above the default training limit" in {
val numRows = trainingLimitDefault * 2
val data =
RandomRDDs.normalVectorRDD(sc, numRows, 3, seed = seed)
.map(v => (1.0, Vectors.dense(v.toArray), "A")).toDF()
Expand All @@ -67,9 +67,9 @@ class DataSplitterTest extends FlatSpec with TestSparkContext with SplitterSumma
val dataBalanced = dataSplitter.validationPrepare(data)
// validationPrepare calls the data sample method that samples the data to a target ratio but there is an epsilon
// to how precise this function is which is why we need to check around that epsilon
val samplingErrorEpsilon = (0.1 * MaxTrainingSampleDefault).toLong
val samplingErrorEpsilon = (0.1 * trainingLimitDefault).toLong

dataBalanced.count() shouldBe MaxTrainingSampleDefault +- samplingErrorEpsilon
dataBalanced.count() shouldBe trainingLimitDefault +- samplingErrorEpsilon
}

it should "set and get all data splitter params" in {
Expand Down Expand Up @@ -103,7 +103,7 @@ class DataSplitterTest extends FlatSpec with TestSparkContext with SplitterSumma
it should "keep the data unchanged when prepare is called" in {
val summary = dataSplitter.preValidationPrepare(data)
val train = dataSplitter.validationPrepare(data)
val sampleF = MaxTrainingSampleDefault / dataCount.toDouble
val sampleF = trainingLimitDefault / dataCount.toDouble
val downSampleFraction = math.min(sampleF, 1.0)
train.collect().zip(data.collect()).foreach { case (a, b) => a shouldBe b }
assertDataSplitterSummary(summary.summaryOpt) { s => s shouldBe DataSplitterSummary(downSampleFraction) }
Expand Down