Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local scoring (aka Sparkless) using Aardpfark #41

Merged
merged 41 commits into from
Aug 30, 2018
Merged

Local scoring (aka Sparkless) using Aardpfark #41

merged 41 commits into from
Aug 30, 2018

Conversation

tovbinm
Copy link
Collaborator

@tovbinm tovbinm commented Aug 8, 2018

Describe the proposed solution
Added a subproject that enables loading and scoring models without Spark context but locally using Aardpfark (PFA for Spark) and Hadrian libraries instead. This allows orders of magnitude faster scoring times compared to Spark.

Describe alternatives you've considered
dbml-local, ml-local and a custom runtime.

Additional Context
Sample usage:

import com.salesforce.op.local._
val model = workflow.loadModel("/path/to/model")
val scoreFn = model.scoreFunction
val score: Map[String, Any] = scoreFn(Map("age" -> 18, "name" -> "Peter"))

Test results (single thread running on MacBook Pro i7 3.5Ghz):

Scored 6000000 records in 239s
Average time per record: 0.0399215ms

TODO:

  • add a type to ScoreFunction
  • consider using mleap.

@tovbinm tovbinm changed the title Initial draft of PFA based scoring (aka Sparkless) PFA based scoring (aka Sparkless) Aug 8, 2018
@jamesward jamesward closed this Aug 8, 2018
@jamesward jamesward reopened this Aug 8, 2018
@tovbinm tovbinm changed the title PFA based scoring (aka Sparkless) PFA based local scoring (aka Sparkless) Aug 8, 2018
@tovbinm tovbinm changed the title PFA based local scoring (aka Sparkless) Local scoring (aka Sparkless) using Aardpfark Aug 15, 2018
}.head
val vector = r(inputName).asInstanceOf[Vector].toArray
val input = s"""{"$inputName":${vector.mkString("[", ",", "]")}}"""
val res = e.action(e.jsonInput(input)).toString
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick is using json is the most efficient way to call engine action?

@codecov
Copy link

codecov bot commented Aug 25, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@a8eaf4b). Click here to learn what that means.
The diff coverage is 71.73%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #41   +/-   ##
=========================================
  Coverage          ?   86.14%           
=========================================
  Files             ?      296           
  Lines             ?     9594           
  Branches          ?      319           
=========================================
  Hits              ?     8265           
  Misses            ?     1329           
  Partials          ?        0
Impacted Files Coverage Δ
.../scala/com/salesforce/op/utils/spark/RichRow.scala 16.66% <0%> (ø)
...om/salesforce/op/local/OpWorkflowRunnerLocal.scala 100% <100%> (ø)
...com/salesforce/op/local/OpWorkflowModelLocal.scala 81.08% <81.08%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8eaf4b...903ad95. Read the comment docs.

@manuzhang
Copy link

Have you looked at https://github.com/combust/mleap ?

@tovbinm
Copy link
Collaborator Author

tovbinm commented Aug 27, 2018

@manuzhang last time I checked spark support wast that great, but it seems better now. Whats your experience with it?

@tovbinm
Copy link
Collaborator Author

tovbinm commented Aug 27, 2018

@manuzhang let's chat over https://gitter.im ?

*
* @return a [[collection.mutable.Map]] with row contents
*/
def toMutableMap: collection.mutable.Map[String, Any] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you are saying that row.getValuesMap[Any] should work as well? let me try.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oook, so my function is faster, because getValuesMap calls def getAs[T](fieldName: String): T = getAs[T](fieldIndex(fieldName)) for each value, while my function operates on indices.

*/
def score(params: OpParams): ScoreFunction = {
require(params.modelLocation.isDefined, "Model location must be set in params")
val model = workflow.loadModel(params.modelLocation.get)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the standard load method work on spark models that use parquet storage without a spark context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the spark ml readers require the context explicitly, but I will need to verify, cause they might get/create spark context inside. Do you have a model in mind that I can check against?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe try PCA

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// TODO: remove .map[Text] once Aardpfark supports null inputs for StringIndexer
val indexed = description.map[Text](v => if (v.isEmpty) Text("") else v)
.indexed(handleInvalid = StringIndexerHandleInvalid.Skip)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a spark model that uses complicated serialization - maybe PCA since that uses parquet

@@ -88,6 +88,8 @@ configure(allProjs) {
commonsValidatorVersion = '1.6'
commonsIOVersion = '2.6'
scoveragePluginVersion = '1.3.1'
hadrianVersion = '0.8.5'
aardpfarkVersion = '0.1.0-SNAPSHOT'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we pulling in a shapshot?

@tovbinm tovbinm merged commit 47e7e37 into master Aug 30, 2018
@tovbinm tovbinm deleted the mt/pfa-local branch August 30, 2018 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants