Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added string transformers for substring search and valid email #265

Merged
merged 6 commits into from
Apr 5, 2019

Conversation

leahmcguire
Copy link
Collaborator

No description provided.

@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

Merging #265 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #265      +/-   ##
==========================================
+ Coverage   86.61%   86.61%   +<.01%     
==========================================
  Files         315      317       +2     
  Lines       10345    10375      +30     
  Branches      346      560     +214     
==========================================
+ Hits         8960     8986      +26     
- Misses       1385     1389       +4
Impacted Files Coverage Δ
...esforce/op/stages/impl/feature/TextTokenizer.scala 100% <ø> (ø) ⬆️
...op/stages/impl/feature/ValidEmailTransformer.scala 100% <100%> (ø)
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 85.07% <100%> (+0.45%) ⬆️
...force/op/stages/impl/feature/NGramSimilarity.scala 100% <100%> (ø) ⬆️
.../op/stages/impl/feature/SubstringTransformer.scala 100% <100%> (ø)
...salesforce/op/evaluators/OpBinScoreEvaluator.scala 91.3% <0%> (-5.47%) ⬇️
...p/evaluators/OpBinaryClassificationEvaluator.scala 78.72% <0%> (-3.78%) ⬇️
...es/src/main/scala/com/salesforce/op/OpParams.scala 89.79% <0%> (+4.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bb63ba...95be24d. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

Merging #265 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #265      +/-   ##
==========================================
+ Coverage   86.66%   86.67%   +<.01%     
==========================================
  Files         315      317       +2     
  Lines       10396    10403       +7     
  Branches      344      557     +213     
==========================================
+ Hits         9010     9017       +7     
  Misses       1386     1386
Impacted Files Coverage Δ
...esforce/op/stages/impl/feature/TextTokenizer.scala 100% <ø> (ø) ⬆️
...op/stages/impl/feature/ValidEmailTransformer.scala 100% <100%> (ø)
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 85.07% <100%> (+0.45%) ⬆️
...force/op/stages/impl/feature/NGramSimilarity.scala 100% <100%> (ø) ⬆️
.../op/stages/impl/feature/SubstringTransformer.scala 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40b51c7...323e6e0. Read the comment docs.

@@ -16,6 +16,7 @@ Use TransmogrifAI if you need a machine learning library to:
To understand the motivation behind TransmogrifAI check out these:
- [Open Sourcing TransmogrifAI: Automated Machine Learning for Structured Data](https://engineering.salesforce.com/open-sourcing-transmogrifai-4e5d0e098da2), a blog post by [@snabar](https://github.com/snabar)
- [Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions](https://www.youtube.com/watch?v=93vsqjfGPCw&feature=youtu.be&t=2800), a talk by [@tovbinm](https://github.com/tovbinm)
- [Low Touch Machine Learning](https://www.youtube.com/watch?v=PKTvo9X9Sjg), a talk by [@leahmcguire](https://github.com/leahmcguire)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

def toNGramSimilarity(
that: FeatureLike[T],
nGramSize: Int = NGramSimilarity.nGramSize,
toLowerCase: Boolean = TextTokenizer.ToLowercase
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the doc on the shortcut

class ValidEmailTransformer(uid: String = UID[ValidEmailTransformer]) extends
UnaryTransformer[Email, Binary](operationName = "isValidEmail", uid = uid) {
override def transformFn: Email => Binary = (in: Email) => {
if (in.isEmpty) None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Binary.empty - this would avoid necessary allocation, ie. if (in.isEmpty) Binary.empty else Some(in.prefix.nonEmpty && in.domain.nonEmpty).toBinary

@RunWith(classOf[JUnitRunner])
class ValidEmailTransformerTest extends OpTransformerSpec[Binary, ValidEmailTransformer] {

val sample = Seq(Email("abc"), Email("a@b"), Email("a@"), Email("@blah"), Email.empty, Email("real@stuff"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

real@stuff does not seem to be a valid domain to me. how robust do we what this validation code to be? perhaps it's worth adding a proper email domain? (long awaited btw ;) - https://github.com/salesforce/TransmogrifAI/blob/master/features/src/main/scala/com/salesforce/op/features/types/Text.scala#L85

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that you can actually have an email domain like that. The only way we could really validate is if we used a service that checked if the inbox exists....

@tovbinm tovbinm merged commit 394b4bd into master Apr 5, 2019
@tovbinm tovbinm deleted the lm/newTransformers branch April 5, 2019 22:39
@tovbinm tovbinm mentioned this pull request Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants