Write and read Spark stages to/from MLeap instead of Spark classes #475

leahmcguire · 2020-05-05T22:19:38Z

Related issues
Currently, Spark save method is used to serialize and deserialize Spark wrapped stages. This PR changes the underlying serialization to write and read from MLeap bundles.

Describe the proposed solution
Writes to MLeap and reads from MLeap with fallback to trying to read from Spark save.

Describe alternatives you've considered
N/A

Additional context
Next steps will be PR's to read the stages directly with the MLeap context rather than the Spark context for local scoring (and possibly all scoring - to better optimize the DAG)

codecov · 2020-05-05T22:38:18Z

Codecov Report

Merging #475 into master will decrease coverage by 0.30%.
The diff coverage is 70.41%.

@@            Coverage Diff             @@
##           master     #475      +/-   ##
==========================================
- Coverage   87.04%   86.74%   -0.31%     
==========================================
  Files         346      346              
  Lines       11782    11848      +66     
  Branches      385      374      -11     
==========================================
+ Hits        10256    10277      +21     
- Misses       1526     1571      +45

Impacted Files	Coverage Δ
...impl/classification/OpDecisionTreeClassifier.scala	`63.63% <ø> (-7.80%)`	⬇️
...p/stages/impl/classification/OpGBTClassifier.scala	`46.66% <ø> (-8.89%)`	⬇️
...ges/impl/classification/OpLogisticRegression.scala	`56.00% <ø> (-4.72%)`	⬇️
...ssification/OpMultilayerPerceptronClassifier.scala	`60.00% <ø> (-9.24%)`	⬇️
...e/op/stages/impl/classification/OpNaiveBayes.scala	`71.42% <ø> (-8.58%)`	⬇️
...impl/classification/OpRandomForestClassifier.scala	`66.66% <ø> (-5.56%)`	⬇️
...ages/impl/regression/OpDecisionTreeRegressor.scala	`50.00% <ø> (ø)`
...rce/op/stages/impl/regression/OpGBTRegressor.scala	`53.33% <ø> (ø)`
...op/stages/impl/regression/OpLinearRegression.scala	`76.00% <ø> (ø)`
...ages/impl/regression/OpRandomForestRegressor.scala	`50.00% <ø> (ø)`
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a350711...38960e9. Read the comment docs.

…mlleapSave

leahmcguire · 2020-08-03T18:32:00Z

@TuanNguyen27 the test that you put in that should have failed on the local XGboost is (correctly) failing in this PR.

TuanNguyen27 · 2020-08-28T17:19:57Z

core/src/main/scala/com/salesforce/op/stages/sparkwrappers/specific/OpPredictorWrapper.scala

 .setParent(this)
 .setInput(in1.asFeatureLike[RealNN], in2.asFeatureLike[OPVector])
 .setMetadata(getMetadata())
 .setOutputFeatureName(getOutputFeatureName)
+
+ if (model.isInstanceOf[XGBoostClassificationModel] || model.isInstanceOf[XGBoostRegressionModel]) {
+ wrappedModel.setOutputDF(model.transform(dataset.limit(1)))


just curious why do we have .limit(1) here ?

we only need one example for the xgboost mleap save (it has a step that calls .first() to get the vector size)

Let's add a comment for it. Looks like this is the only such exception so far.

…mlleapSave

…#502) Making predictive models work with local load of wrapped spark models (#503)

core/src/main/scala/com/salesforce/op/stages/sparkwrappers/specific/OpPredictorWrapper.scala

tovbinm · 2020-09-01T00:15:55Z

core/src/main/scala/com/salesforce/op/stages/impl/regression/OpRandomForestRegressor.scala

@@ -125,4 +127,9 @@ class OpRandomForestRegressionModel
 ttov: TypeTag[Prediction#Value]
 ) extends OpPredictionModel[RandomForestRegressionModel](
 sparkModel = sparkModel, uid = uid, operationName = operationName
-)
+){
+ @transient lazy protected val predict: Vector => Double = getSparkMlStage().map(s => s.predict(_))


This also seems to be a very repetitive pattern. We can add a helper method for it as well.

so the problem with a helper function in this one is there is no shared class for the mleap regressors that contains the predict function. they all implement it but you have to cast them to their specific type to get the predict. Thus a helper function only saves one map and makes it hard to read. I suppose I could use reflection in a shared helper...do you think that is better?

core/src/main/scala/com/salesforce/op/stages/sparkwrappers/specific/OpPredictorWrapper.scala

...n/scala/com/salesforce/op/stages/sparkwrappers/specific/OpProbabilisticClassifierModel.scala

core/src/main/scala/com/salesforce/op/utils/stages/FitStagesUtil.scala

features/build.gradle

features/src/main/scala/com/salesforce/op/stages/SparkStageParam.scala

tovbinm · 2020-09-01T00:31:57Z

local/src/main/scala/com/salesforce/op/local/OpWorkflowModelLocal.scala

- (opStage, sparkStage, i)
+ val mleapStages = stagesWithIndex.filterNot(_._1.isInstanceOf[OpTransformer]).collect {
+ case (opStage: OPStage with SparkWrapperParams[_], i) if opStage.getLocalMlStage().isDefined =>
+ val model = opStage.getLocalMlStage().get


Better pattern match and error gracefully when local model is missing. For example:

opStage.getLocalMlStage() match { case None => throw new RuntimeException("Local model not found for stage ${opStage.uid} of type ${opStage.getClass}") case Some(model) => // Apply model }

tovbinm

lgtm!!

tovbinm · 2020-09-02T18:11:35Z

🥳 🥳 🥳

…asses (#475)" This reverts commit 1040361

koertkuipers · 2020-09-21T22:38:37Z

this seems to have broken some of our inhouse unit tests. in some cases it was because we wrote to relative paths i think. those were easily fixed by making paths absolute. in other situations the paths were absolute and i am unsure why it broke at this point...

stacktraces all have to do with mleap BundleFile on reading and writing. always the same NPE in UnixPath.normalizeAndCheck. for example:

[info]   Cause: java.lang.NullPointerException:
[info]   at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77)
[info]   at sun.nio.fs.UnixPath.<init>(UnixPath.java:71)
[info]   at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
[info]   at ml.combust.bundle.BundleFile$.apply(BundleFile.scala:59)
[info]   at ml.combust.bundle.BundleFile$.apply(BundleFile.scala:40)
[info]   at com.salesforce.op.stages.SparkStageParam.$anonfun$jsonDecodeMleap$1(SparkStageParam.scala:164)
[info]   at resource.DefaultManagedResource.open(AbstractManagedResource.scala:110)
[info]   at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:87)
[info]   at resource.DeferredExtractableManagedResource.either(AbstractManagedResource.scala:29)
[info]   at resource.DeferredExtractableManagedResource.opt(AbstractManagedResource.scala:31)
[info]   at com.salesforce.op.stages.SparkStageParam.jsonDecodeMleap(SparkStageParam.scala:173)
[info]   at com.salesforce.op.stages.SparkStageParam.jsonDecode(SparkStageParam.scala:123)
[info]   at com.salesforce.op.stages.SparkStageParam.jsonDecode(SparkStageParam.scala:55)
[info]   at org.apache.spark.ml.util.DefaultParamsReader$Metadata.$anonfun$setParams$1(ReadWrite.scala:564)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.apache.spark.ml.util.DefaultParamsReader$Metadata.setParams(ReadWrite.scala:561)
[info]   at org.apache.spark.ml.util.DefaultParamsReader$Metadata.getAndSetParams(ReadWrite.scala:549)
[info]   at org.apache.spark.ml.SparkDefaultParamsReadWrite$.getAndSetParams(SparkDefaultParamsReadWrite.scala:126)

koertkuipers · 2020-09-21T22:52:17Z

is protobuf 3 going to be an issue on spark/hadoop?

tovbinm · 2020-09-21T23:51:58Z

@koertkuipers Can you please open an issue to track this? Can you also share which transformer / estimator are you using in your workflow?

koertkuipers · 2020-09-23T04:07:11Z

#514

…

On Mon, Sep 21, 2020 at 7:52 PM Matthew Tovbin ***@***.***> wrote: @koertkuipers <https://github.com/koertkuipers> Can you please open an issue to track this? Can you also share which transformer / estimator are you using in your workflow? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#475 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGIQJE22J6KKLMXXQWMUODSG7RKZANCNFSM4MZ6P4JQ> .

salesforce-cla · 2020-11-03T07:39:18Z

Thanks for the contribution! It looks like @leahmcguire is an internal user so signing the CLA is not required. However, we need to confirm this.

salesforce-cla · 2020-12-18T02:40:13Z

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com> Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

got read and write working

1a7c58d

leahmcguire added work in progress DO NOT MERGE labels May 5, 2020

leahmcguire requested review from gerashegalov, Jauntbox, tovbinm and wsuchy as code owners May 5, 2020 22:19

leahmcguire added 8 commits July 9, 2020 08:35

Merge branch 'master' of github.com:salesforce/TransmogrifAI into lm/…

5232aad

…mlleapSave

fixed some of the failing tests

cdca8d7

removed prints

eed3ecd

fixed XGboost mleap save

c832966

fixed some more tests

0c7f515

got all tests running

822c4a1

cleanup

bbae2b7

Merge branch 'master' of github.com:salesforce/TransmogrifAI into lm/…

c27cd61

…mlleapSave

leahmcguire added ready for review and removed DO NOT MERGE work in progress labels Aug 3, 2020

leahmcguire changed the title ~~[WIP] write and read spark stages to/from mleap instead of spark~~ write and read spark stages to/from mleap instead of spark Aug 3, 2020

leahmcguire requested a review from alexandrnikitin August 3, 2020 17:44

removed test that correctly shows xgboost not suported for local scoring

8d9458e

TuanNguyen27 reviewed Aug 28, 2020

View reviewed changes

leahmcguire added 4 commits August 31, 2020 12:01

Merge branch 'master' of github.com:salesforce/TransmogrifAI into lm/…

2c4b4a8

…mlleapSave

load workflows without spark context directly into MLeap transformers (…

9a8da6f

…#502) Making predictive models work with local load of wrapped spark models (#503)

cleanup

deea402

style and part name

3e95e3d

tovbinm changed the title ~~write and read spark stages to/from mleap instead of spark~~ Write and read Spark stages to/from MLeap instead of Spark classes Sep 1, 2020