Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate name detection into SmartTextVectorizer #508

Merged
merged 148 commits into from
Oct 6, 2020

Conversation

Jauntbox
Copy link
Contributor

@Jauntbox Jauntbox commented Sep 3, 2020

Related issues
Re-opening of #456 on branch directly on TransmogrifAI, please see that PR for historical discussion (I'll close once this one gets merged in).

Describe the proposed solution
Adds parameters for ignoring fields identified as personal names to SmartTextVectorizer

Describe alternatives you've considered
N/A

Additional context
N/A

… Still need to fix HLL serialization in Spark issue
…stead of moments of text length; Still need to fix no moments higher than the 1st being calculated
@Jauntbox
Copy link
Contributor Author

Jauntbox commented Sep 3, 2020

Right now, this is just compiling and passing tests so far. I still have to give it a pass to clean things up before it's ready for a real review

@codecov
Copy link

codecov bot commented Sep 3, 2020

Codecov Report

Merging #508 into master will increase coverage by 7.57%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #508      +/-   ##
==========================================
+ Coverage   79.15%   86.73%   +7.57%     
==========================================
  Files         347      347              
  Lines       11851    11897      +46     
  Branches      384      602     +218     
==========================================
+ Hits         9381    10319     +938     
+ Misses       2470     1578     -892     
Impacted Files Coverage Δ
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 82.19% <ø> (+12.32%) ⬆️
...main/scala/com/salesforce/op/test/TestCommon.scala 34.61% <0.00%> (-6.30%) ⬇️
...s/impl/feature/OPCollectionHashingVectorizer.scala 96.55% <100.00%> (+15.43%) ⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100.00% <100.00%> (+2.12%) ⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.33% <100.00%> (+0.12%) ⬆️
...m/salesforce/op/utils/stages/NameDetectUtils.scala 89.44% <100.00%> (+89.44%) ⬆️
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 88.19% <0.00%> (+0.69%) ⬆️
...rce/op/stages/impl/preparators/SanityChecker.scala 90.57% <0.00%> (+1.22%) ⬆️
...ala/com/salesforce/op/features/types/package.scala 58.21% <0.00%> (+1.36%) ⬆️
...orce/op/aggregators/MonoidAggregatorDefaults.scala 100.00% <0.00%> (+1.78%) ⬆️
... and 59 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87eac31...d1ec7f1. Read the comment docs.

// val valueStats: Dataset[Array[TextStats]] = dataset.map(
// _.map(TextStats.computeTextStats(_, shouldCleanText, shouldTokenizeForLengths, maxCard)).toArray
// )
// val aggregatedStats: Array[TextStats] = valueStats.reduce(_ + _)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented lines

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor stuff - otherwise looks good

import com.salesforce.op.utils.spark.{OpVectorColumnMetadata, OpVectorMetadata}
>>>>>>> 4d461814d1791ef7e6dd14b262cf6a8a303ffcde
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix merge conflict

@leahmcguire leahmcguire merged commit 09200c8 into master Oct 6, 2020
@tovbinm tovbinm deleted the my/stv-detect-names branch October 6, 2020 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants