Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid creating SparseVectors for LOCO #377

Merged
merged 19 commits into from
Aug 21, 2019

Conversation

gerashegalov
Copy link
Contributor

@gerashegalov gerashegalov commented Jul 30, 2019

Related issues
#376

Describe the proposed solution
reuse the original SparseVector as a mutable template

Additional context
In a scoring job:

before LOCO with record insights with record insights after PR
< 10s 168 sec 11sec

@leahmcguire
Copy link
Collaborator

Wow!

@codecov
Copy link

codecov bot commented Jul 30, 2019

Codecov Report

Merging #377 into master will decrease coverage by <.01%.
The diff coverage is 90.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #377      +/-   ##
==========================================
- Coverage   86.83%   86.83%   -0.01%     
==========================================
  Files         336      336              
  Lines       10955    10957       +2     
  Branches      347      577     +230     
==========================================
+ Hits         9513     9514       +1     
- Misses       1442     1443       +1
Impacted Files Coverage Δ
...ala/com/salesforce/op/utils/spark/RichVector.scala 84.61% <0%> (-15.39%) ⬇️
...e/op/stages/impl/insights/RecordInsightsLOCO.scala 96.77% <100%> (+0.1%) ⬆️
...m/salesforce/op/evaluators/EvaluationMetrics.scala 86.66% <0%> (-0.84%) ⬇️
...op/stages/impl/selector/ModelSelectorSummary.scala 92.55% <0%> (+0.71%) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 89.79% <0%> (+4.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc64b4f...087b1d2. Read the comment docs.

@michaelweilsalesforce
Copy link
Contributor

Wait, if we apply foreachActive on a dense vector, wouldn't this look at ALL the element of the vectors?

@gerashegalov
Copy link
Contributor Author

it's still WIP but I have this guard if oldVal != 0.0 in the pattern match @michaelweilsalesforce

@tovbinm
Copy link
Collaborator

tovbinm commented Jul 30, 2019

Neat. What about memory complexity?

@gerashegalov gerashegalov changed the title Use DenseVector for o(1) LOCO vector creation Avoid creating SparseVectors for LOCO Aug 5, 2019
@gerashegalov
Copy link
Contributor Author

reworked the solution to avoid the memory overhead of the dense vector.


agggregateDiffs(0, Left(featureSparse), indexToExamine, minMaxHeap, aggregationMap,
baseScore)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the sparse features you just put in a value of 0? Cant we just skip adding them to the heap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same idea but in one of the iteration I ran into test failures and deferred it to later. I'll recheck now that I have everything green. @michaelweilsalesforce any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of failures have you encountered?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be that we were doing an unnecessary calculation and that just happened to be captured in the test...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelweilsalesforce you can reproduce it by commenting out the line 171-172.

Aggregate all the derived hashing tf features of rawFeature - text. 0.08025355373244505 was not less than 1.0E-10 expected aggregated LOCO value (0.006978569889777832) should be the same as actual (0.08723212362222289)

Aggregate x_HourOfDay and y_HourOfDay of rawFeature - dateFeature. 0.016493734169231777 was not less than 1.0E-10 expected aggregated LOCO value (0.016493734169231777) should be the same as actual (0.032987468338463555)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire @gerashegalov The reason for tracking zero values is whenever we want to average LOCOs of a same raw text feature we are also including the zero values.
E.g if text feature TextA has on a row 6 non zero values loco1, ..., loco6 and 4 0s, we are dividing by 10 :
(loco1 + loco2 + ... + loco6)/10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LEt me write a fix that will not go over the zeros

@gerashegalov
Copy link
Contributor Author

gerashegalov commented Aug 8, 2019 via email

@michaelweilsalesforce
Copy link
Contributor

@gerashegalov Here is a proposal that skips the diffs for zero values. Code can be nicer though

@gerashegalov
Copy link
Contributor Author

Thank you, looks good, just a few polishes

@@ -116,34 +114,28 @@ class RecordInsightsLOCO[T <: Model[T]]
Set(FeatureType.typeName[DateMap], FeatureType.typeName[DateTimeMap])

// Indices of features derived from Text(Map)Vectorizer
private lazy val textFeatureIndices = getIndicesOfFeatureType(textTypes ++ textMapTypes)
private lazy val textFeatureIndices: Seq[Int] = getIndicesOfFeatureType(textTypes ++ textMapTypes,
h => h.indicatorValue.isEmpty && h.descriptorValue.isEmpty)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe update comment to indicate only getting hashed text values

val name = history.parentFeatureOrigins.headOption.map(_ + groupSuffix)

// If the descriptor value of a derived date feature exists, then it is likely to be
// from unit circle transformer. We aggregate such features for each (rawFeatureName, timePeriod).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true now - but may not always be true. If you want this to apply only for date unit circles should also check that one of the parentFeatureStages is a DateToUnitCircleTransformer or DateToUnitCircleVectorizer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not consistent : Unit Circle Transformation in DateMapVectorizer is not reflected in the parentStages (Seq[DateMapVectorizer] instead).
I think the check on descriptor value is coherent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I can check the parentType instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this change is explicitly to deal with date features that are transformed to unit circle then the check needs to be explicitly for that. Otherwise this is also applied to lat lon values (and anything else that we add later) and if we just check the type of the parent it assumes that we will always have unit circle transformation of dates - which could change at some point...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but as I said above checking the parentFeatureStages won't work : for instance DateMapVectorizer may apply Unit circle transformation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateMapVectorizer does days between reference date and the date. The only two that do unit vector are DateToUnitCircleTransformer and DateToUnitCircleVectorizer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then there must be a bug in the shortcut : when println(s"name ${history.columnName} stage ${history.parentFeatureStages} descriptor value ${history.descriptorValue}") I get

name dateMapFeature_k0_y_DayOfYear_33 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(y_DayOfYear)
name dateMapFeature_k1_x_DayOfYear_34 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(x_DayOfYear)
name dateMapFeature_k1_y_DayOfYear_35 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(y_DayOfYear)
name dateFeature_x_HourOfDay_0 stage ArrayBuffer() descriptor value Some(x_HourOfDay)
name dateFeature_y_HourOfDay_1 stage ArrayBuffer() descriptor value Some(y_HourOfDay)

Those features both use the .vetcorize shortcut.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blarg! you are right there is a bug in the feature history that means we loose info if the same feature undergoes multiple transformations :-( https://github.com/salesforce/TransmogrifAI/blob/master/features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala#L53

Can you put a todo to update once the bug is fixed

val (i, n) = (indices.head, indices.length)
val zeroCounts = zeroCountByFeature.get(name).getOrElse(0)
val diffToExamine = ar.map(_ / (n + zeroCounts))
minMaxHeap enqueue LOCOValue(i, diffToExamine(indexToExamine), diffToExamine)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait so we are aggregating everything into a map and then putting it into a heap and then just taking it out of the heap? doesn't that defeat the whole purpose of the heap? Shouldn't we be putting each value into the heap as we calculating it rather than aggregating the whole thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only aggregating TF and Date features

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok - can you add a comment to that effect

// Count zeros by feature name
val zeroCountByFeature = zeroValIndices
.groupBy(i => getRawFeatureName(histories(i)).get)
.mapValues(_.length).view.toMap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the point of .view here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to force map materialization after toMap in 2.11

@tovbinm
Copy link
Collaborator

tovbinm commented Aug 20, 2019

Shall we merge this one?

@leahmcguire leahmcguire merged commit 16ea717 into salesforce:master Aug 21, 2019
@gerashegalov gerashegalov mentioned this pull request Sep 8, 2019
gerashegalov added a commit that referenced this pull request Sep 11, 2019
Bug fixes:
- Ensure correct metrics despite model failures on some CV folds [#404](#404)
- Fix flaky `ModelInsight` tests [#395](#395)
- Avoid creating `SparseVector`s for LOCO [#377](#377)

New features / updates:
- Model combiner [#385](#399)
- Added new sample for HousingPrices [#365](#365)
- Test to verify that custom metrics appear in model insight metrics [#387](#387)
- Add `FeatureDistribution` to `SerializationFormat`s [#383](#383)
- Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378)
- Improve json serde error in `evalMetFromJson` [#380](#380)
- Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354)
- Making model selectors robust to failing models [#372](#372)
- Use compact and compressed model json by default [#375](#375)
- Descale feature contribution for Linear Regression & Logistic Regression [#345](#345)

Dependency updates:   
- Update tika version [#382](#382)
@salesforce-cla
Copy link

Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this.

@salesforce-cla
Copy link

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants