Added Chinese and Korean examples to TextTokenizerTest #442

Jauntbox · 2019-11-25T23:49:46Z

Related issues
n/a

Describe the proposed solution
n/a

Describe alternatives you've considered
n/a

Additional context
This is a small change to better allow testing of alternatives to the CJK tokenizer (that we've already replaced for Japanese). The CJK tokenizer uses bigrams for its tokenization, rather than trying to extract words, so most of the tokens from a text sample will have length 2 (not all, since other languages can be mixed in). Some of the simpler ID detection calculations will look at the distributions of token lengths, so they may incorrectly think that text from languages using the CJK tokenizer is IDs.

tovbinm

Lgtm

codecov · 2019-11-26T00:14:26Z

Codecov Report

Merging #442 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #442   +/-   ##
=======================================
  Coverage   86.93%   86.93%           
=======================================
  Files         337      337           
  Lines       11096    11096           
  Branches      362      362           
=======================================
  Hits         9646     9646           
  Misses       1450     1450

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e45073d...9f67367. Read the comment docs.

gerashegalov

LGTM

leahmcguire

LGTM

tovbinm · 2019-12-03T19:36:46Z

LOCO test is failing @sanmitra @Jauntbox

sanmitra · 2019-12-03T19:56:38Z

@tovbinm The LOCO test - com.salesforce.op.stages.impl.insights.RecordInsightsLOCOTest is succeeding. Where exactly you are seeing the failure of LOCO test ?

tovbinm · 2019-12-03T20:15:13Z

It’s a flaky one. See previous runs.

Added Chinese and Korean examples to TextTokenizerTest

9f67367

Jauntbox requested review from gerashegalov, leahmcguire, tovbinm and wsuchy as code owners November 25, 2019 23:49

tovbinm approved these changes Nov 26, 2019

View reviewed changes

gerashegalov approved these changes Nov 26, 2019

View reviewed changes

leahmcguire approved these changes Dec 3, 2019

View reviewed changes

tovbinm merged commit 9778481 into master Dec 3, 2019

tovbinm deleted the km/token-lens branch December 3, 2019 20:15

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Chinese and Korean examples to TextTokenizerTest #442

Added Chinese and Korean examples to TextTokenizerTest #442

Jauntbox commented Nov 25, 2019

tovbinm left a comment

codecov bot commented Nov 26, 2019 •

edited

Loading

gerashegalov left a comment

leahmcguire left a comment

tovbinm commented Dec 3, 2019

sanmitra commented Dec 3, 2019

tovbinm commented Dec 3, 2019

Added Chinese and Korean examples to TextTokenizerTest #442

Added Chinese and Korean examples to TextTokenizerTest #442

Conversation

Jauntbox commented Nov 25, 2019

tovbinm left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 26, 2019 • edited Loading

Codecov Report

gerashegalov left a comment

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

tovbinm commented Dec 3, 2019

sanmitra commented Dec 3, 2019

tovbinm commented Dec 3, 2019

codecov bot commented Nov 26, 2019 •

edited

Loading