Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing blocklist policy for Sequence stages #524

Closed
wants to merge 6 commits into from

Conversation

michaelweilsalesforce
Copy link
Contributor

@michaelweilsalesforce michaelweilsalesforce commented Oct 26, 2020

Related issues
Example : If a SequenceEstimator/Transformer with input features Seq(f1, f2, f3) has a f1 as a blocklist, then, because Seq(f2, f3) and Seq(f1, f2, f3) don't have the same size, the SequenceEstimator/Transformer will be removed when updating the DAG.
However those sequence stages should ignore if one or more original inputs are missing.

Describe the proposed solution
When updating the DAG, sequence stages with updated inputs with length different than 0 will be kept.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
This problem was addressed when we witnessed a result feature being part of the blocklist after updating the DAG. We acknowledge the possibility to change the policy in RawFeatureFilter.

@salesforce-cla
Copy link

Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this.

@codecov
Copy link

codecov bot commented Oct 26, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@13ad9cd). Click here to learn what that means.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #524   +/-   ##
=========================================
  Coverage          ?   86.73%           
=========================================
  Files             ?      347           
  Lines             ?    11961           
  Branches          ?      630           
=========================================
  Hits              ?    10374           
  Misses            ?     1587           
  Partials          ?        0           
Impacted Files Coverage Δ
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 88.59% <100.00%> (ø)
...sforce/op/stages/base/unary/UnaryTransformer.scala 100.00% <0.00%> (ø)
...op/stages/impl/tuning/OpTrainValidationSplit.scala 100.00% <0.00%> (ø)
...ala/com/salesforce/op/test/TempDirectoryTest.scala 82.00% <0.00%> (ø)
...esforce/op/stages/impl/CheckIsResponseValues.scala 75.00% <0.00%> (ø)
...ala/com/salesforce/op/utils/io/csv/CSVToAvro.scala 87.87% <0.00%> (ø)
...ala/com/salesforce/op/testkit/RandomIntegral.scala 100.00% <0.00%> (ø)
...com/salesforce/op/test/TestOpWorkflowBuilder.scala 100.00% <0.00%> (ø)
.../salesforce/op/stages/impl/tuning/DataCutter.scala 97.22% <0.00%> (ø)
...com/salesforce/op/testkit/ProbabilityOfEmpty.scala 100.00% <0.00%> (ø)
... and 338 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13ad9cd...8c2ab5e. Read the comment docs.

@@ -149,7 +150,21 @@ class OpWorkflow(val uid: String = UID[OpWorkflow]) extends OpWorkflowCore {
}
val inputsChanged = blocklistRemoved.map{ f => allUpdated.find(u => u.sameOrigin(f)).getOrElse(f) }
val oldOutput = stg.getOutput()
Try(stg.setInputFeatureArray(inputsChanged).setOutputFeatureName(oldOutput.name).getOutput()) match {
Try(stg match {
case s: SequenceEstimator[_, _] if (inputsChanged.size > 0) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please create a helper function

@@ -149,7 +150,21 @@ class OpWorkflow(val uid: String = UID[OpWorkflow]) extends OpWorkflowCore {
}
val inputsChanged = blocklistRemoved.map{ f => allUpdated.find(u => u.sameOrigin(f)).getOrElse(f) }
val oldOutput = stg.getOutput()
Try(stg.setInputFeatureArray(inputsChanged).setOutputFeatureName(oldOutput.name).getOutput()) match {
Try(stg match {
case s: SequenceEstimator[_, _] if (inputsChanged.size > 0) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputsChanged.size > 0 = inputsChanged.nonEmpty

@michaelweilsalesforce
Copy link
Contributor Author

My code still need some improvement.
However do you folks agree with this change?

@michaelweilsalesforce
Copy link
Contributor Author

@nicodv @Jauntbox Please review. Thanks.

case s @ (_ : SequenceEstimator[_, _] | _ : SequenceTransformer[_, _]) if !inputsChanged.isEmpty => {
s.set(s.inputFeatures, inputsChanged.map(TransientFeature(_))).setOutputFeatureName(oldOutput.name)
// Resetting Metadata
s.setMetadata(new MetadataBuilder().build())
Copy link
Collaborator

@leahmcguire leahmcguire Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot destroy the metadata on a sequence estimator like this - better call the set input and then call

val newMeta = s.getMetadata()
s.setMetadata(newMeta)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't fix my issue. When setting the input a second time, it does not remove the metadata related to features that were blocklisted. This is a problem for stages that rely on column metadata E.g SanityChecker

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the problem is with a specific stage then I would suggest that you fix the getMetadata call to that stage so that it will recreate the metadata with the new input features taken into account - doing this will break everything that relies on metadata to work for example all the model info and feature importance caluculations

Copy link
Contributor

@nicodv nicodv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, but I don't have good knowledge of the consequences of changing the metadata.

Try(stg.setInputFeatureArray(inputsChanged).setOutputFeatureName(oldOutput.name).getOutput()) match {
Try(stg match {
// For Sequence stages, we still want to keep them and remove the blocklisted inputs
case s @ (_ : SequenceEstimator[_, _] | _ : SequenceTransformer[_, _]) if !inputsChanged.isEmpty => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!inputsChanged.isEmpty => inputsChanged.nonEmpty

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the metadata to empty is not the answer - you need to fix the metadata creation on the stage that is not working correctly

@michaelweilsalesforce
Copy link
Contributor Author

Weird. Not able to reproduce the bug anymore. It might have been fixed. Closing PR

@tovbinm tovbinm deleted the mw/SeRFF branch January 18, 2021 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants