Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Html stripping #478

Merged
merged 30 commits into from
Jun 10, 2020
Merged

Enable Html stripping #478

merged 30 commits into from
Jun 10, 2020

Conversation

michaelweilsalesforce
Copy link
Contributor

@michaelweilsalesforce michaelweilsalesforce commented May 26, 2020

Related issues

When engineering features from a Text (and Text-like) raw features, we should strip the text of any html tags, which doesn't add signal to existing tokens (and even pollutes them).

Describe the proposed solution

Enable html stripping via TextTokenizer.AnalyzerHtmlStrip

@michaelweilsalesforce
Copy link
Contributor Author

This PR doesn't introduce options yet

@codecov
Copy link

codecov bot commented May 26, 2020

Codecov Report

Merging #478 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master     #478    +/-   ##
========================================
  Coverage   87.00%   87.01%            
========================================
  Files         345      345            
  Lines       11673    11680     +7     
  Branches      388      613   +225     
========================================
+ Hits        10156    10163     +7     
  Misses       1517     1517            
Impacted Files Coverage Δ
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala 67.64% <ø> (ø)
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 82.19% <100.00%> (+0.24%) ⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100.00% <100.00%> (ø)
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.20% <100.00%> (+0.03%) ⬆️
...esforce/op/stages/impl/feature/TextTokenizer.scala 97.36% <100.00%> (+0.14%) ⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala 98.05% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eba38a0...97b9ce8. Read the comment docs.

val rowHashTokenized = rowHash.map(_.value.map { case (k, v) => k -> tokenize(text = v.toText,
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens })
val rowIgnoreTokenized = rowIgnore.map(_.value.map { case (k, v) => k -> tokenize(text = v.toText,
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than just changing the default please make this a settable parameter

val textTokens: Seq[TextList] = textToHash.map(t => tokenize(text = t,
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens)
val ignorableTextTokens: Seq[TextList] = textToIgnore.map(t => tokenize(text = t,
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, make this settable

@@ -53,7 +53,7 @@ private[op] trait TransmogrifierDefaults {
val NullString: String = OpVectorColumnMetadata.NullString
val OtherString: String = OpVectorColumnMetadata.OtherString
val DefaultNumOfFeatures: Int = 512
val MaxNumOfFeatures: Int = 16384
val MaxNumOfFeatures: Int = 1638400000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above number is 2^14 please choose a number in bytes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we don't need to change this number, because i lifted this limit in #477

val maxCard = 100
val vectorizer = new SmartTextMapVectorizer().setCoveragePct(1.0 - 1e-10).setMaxCardinality(maxCard)
.setMinSupport(1)
val vectorizer = new SmartTextMapVectorizer().setCoveragePct(1.0).setMaxCardinality(maxCard).setMinSupport(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these seems to defeat the purpose of the test

@TuanNguyen27
Copy link
Collaborator

@leahmcguire @Jauntbox could you take a look at this PR ? I'm not sure how to test my changes :(

@@ -53,7 +53,7 @@ private[op] trait TransmogrifierDefaults {
val NullString: String = OpVectorColumnMetadata.NullString
val OtherString: String = OpVectorColumnMetadata.OtherString
val DefaultNumOfFeatures: Int = 512
val MaxNumOfFeatures: Int = 16384
val MaxNumOfFeatures: Int = 131072 // 2^17
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if 2^17 is important we can set it explicitly 1 << 17 instead of a comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not really important...i used 2^17 because @leahmcguire wanted it to be a power of 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well 1 << n is a way to demonstrate that you are definitely using a power of 2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i'll change it to 1 << n then, @leahmcguire @Jauntbox any objection ?

@@ -137,6 +147,39 @@ class SmartTextVectorizerTest
smart shouldBe expected
}

it should "detect one categorical and one non-categorical text feature with html data" in {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is lagerly copy&paste. Can we make it a function or a behavior https://www.scalatest.org/user_guide/sharing_tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure this is the right way to test this function either... @mweilsalesforce @Jauntbox @leahmcguire any advice ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smartVector -> combined
}.unzip

println(smart)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us remove println statements

@gerashegalov
Copy link
Contributor

Let us actually fill out the form for the PR description to set the context :)

): FeatureLike[TextList] = {

// html stripping won't work here due since LuceneRegexTextAnalyzer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if HTML stripping will not work then please dont put it as an input

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the set call too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the set call actually belongs to SmartTextVectorizer so it's separate from tokenizeRegex (which i've removed htmlStripping flag)

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor param change and then LGTM

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the set call on the shortcut that doesnt actually support HTML stripping an then LGTM

@TuanNguyen27 TuanNguyen27 merged commit e48831a into master Jun 10, 2020
TuanNguyen27 added a commit that referenced this pull request Jun 10, 2020
@nicodv nicodv mentioned this pull request Jun 11, 2020
@tovbinm tovbinm deleted the HTMLStrip branch June 12, 2020 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants