Allows adding customized rules to the ICU tokenizer #2165

odelmarcelle · 2022-03-11T12:43:17Z

In #896, the usage of stringi's RBBI was considered to improve the tokenization of URLs and tags (following gagolews/stringi#263). I believe that RBBI rules are also useful for users, as it provides an elegant way to slightly adjust the tokenizer without having to consider other packages.

In this branch, I implement a new function customized_tokenizer() that essentially construct a brand new Rule-based Break Iterator for stringi::stri_split_boundaries(). The result of this function is another function, designed to be used as the what argument of tokens().

For example, it allows dealing with elisions #1610 :

library(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(doc = "I've been sick today, I may go to the hospital.",
         doc_fr = "J'ai été malade aujourd'hui, je vais aller à l'hôpital.")
tokens(txt, what = customized_tokenizer())
#> Tokens consisting of 2 documents.
#> doc :
#>  [1] "I've"     "been"     "sick"     "today"    ","        "I"       
#>  [7] "may"      "go"       "to"       "the"      "hospital" "."       
#> 
#> doc_fr :
#>  [1] "J'ai"        "été"         "malade"      "aujourd'hui" ","          
#>  [6] "je"          "vais"        "aller"       "à"           "l'hôpital"  
#> [11] "."

## Implement custom elision rule for french
Elision_french <- "
$Elision = ([lLmMtTnNsSjJdDcC]|([jJ][u][s]|[qQ][u][o][i]|[lL][o][r][s]|[pP][u][i][s])?[qQ][u])[\u0027\u2019];
# Disable chaining so it only matches beginning of word.
^$Elision / $ALetterPlus;
"

tokens(txt, what = customized_tokenizer(custom_rules = Elision_french))
#> Tokens consisting of 2 documents.
#> doc :
#>  [1] "I've"     "been"     "sick"     "today"    ","        "I"       
#>  [7] "may"      "go"       "to"       "the"      "hospital" "."       
#> 
#> doc_fr :
#>  [1] "J'"          "ai"          "été"         "malade"      "aujourd'hui"
#>  [6] ","           "je"          "vais"        "aller"       "à"          
#> [11] "l'"          "hôpital"    
#> [ ... and 1 more ]

The implementation is rather naive and can certainly be enhanced. The customized_tokenizer() has 3 basic settings:

"ICU_word", fully based on the word-RBBI (and skipping preserve_special()),
"word", a hybrid solution between the custom word-RBBI rules and the what = "word" tokenization (does not skip preserve_special())
"sentence", fully based on the sentence-RBBI

I've re-implemented custom rules related to hyphens, URLs, tags in the RBBI to remain as aligned as possible with the default what = "word" tokenization. There are a few differences, however (highlighted in the test file):

I implemented a URL pattern that does not break addresses starting with "www."
A hyphen-space combination is broken down into two components (due to stricter hyphenation rule)

See the following illustration:

library(quanteda)
txt <- c("www.r-project.org/about.html", "sci- fi sci-fi")
tokens(txt)
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "www.r-project.org" "/"                 "about.html"       
#> 
#> text2 :
#> [1] "sci-"   "fi"     "sci-fi"
tokens(txt, what = customized_tokenizer())
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "www.r-project.org/about.html"
#> 
#> text2 :
#> [1] "sci"    "-"      "fi"     "sci-fi"

If needed, it is also possible to align these two behaviours to the baseline tokenization.

A current limitation:

The RBBI hashtag rule currently does not rely on quanteda_options("pattern_hashtag")

Performance-wise, skipping preserve_special() does improve a bit the speed of tokenization. See this benchmark:

library(quanteda)
data(data_corpus_sotu, package = "quanteda.corpora")
data_corpus_sotu <- as.corpus(data_corpus_sotu)

microbenchmark::microbenchmark(
  vanilla = tokens(data_corpus_sotu, what = "word"),
  customized = tokens(data_corpus_sotu, what = customized_tokenizer()),
  times = 10
)
#> Unit: milliseconds
#>        expr       min        lq      mean    median        uq      max neval
#>     vanilla 2949.3061 3046.1964 3047.8255 3068.1595 3071.3867 3087.762    10
#>  customized  776.3383  781.8884  885.1196  895.2472  903.4933 1111.633    10

Let me know what you think of this feature, I thought it could be a nice addition to quanteda!

corect email

codecov · 2022-03-14T10:17:58Z

Codecov Report

Base: 96.32% // Head: 96.29% // Decreases project coverage by -0.02% ⚠️

Coverage data is based on head (4d6dbd1) compared to base (e7682e0).
Patch coverage: 93.33% of modified lines in pull request are covered.

❗ Current head 4d6dbd1 differs from pull request most recent head c78f453. Consider uploading reports for the commit c78f453 to get more accurate results

Additional details and impacted files

@@             Coverage Diff              @@
##           dev-rbbi    #2165      +/-   ##
============================================
- Coverage     96.32%   96.29%   -0.03%     
============================================
  Files            87       87              
  Lines          5064     5105      +41     
============================================
+ Hits           4878     4916      +38     
- Misses          186      189       +3

Impacted Files	Coverage Δ
R/tokenizers.R	`96.80% <89.28%> (-2.17%)`	⬇️
R/tokens.R	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

koheiw · 2022-03-20T02:36:49Z

Thank you for the very interesting PR. This is great.

We need to spend enougth time thinking what is the best approach before merging to the master. Can you issue a PR to merge yours to dev-rbbi?

odelmarcelle · 2022-03-20T10:13:33Z

I updated the target branch of this pull request, you should be able to merge.

kbenoit · 2022-03-20T10:40:52Z

Fully agreed with @koheiw, thanks @odelmarcelle this is great. I have been slow in replying because I'm just getting over COVID but we will review this thoroughly soon.

kbenoit

@odelmarcelle @koheiw I am so sorry for letting this get stale! COVID and then the death of a family member then the summer (summer school) made me completely forget about this PR. It's a great contribution and I want to tidy it up and merge asap. I'm on it now.

@koheiw what do you think about changing the default tokeniser? We could use (the new) word or ICU_word as the new default, and make the existing word into word2.

@odelmarcelle On the PR in your fork, can you select "Allow edits from maintainers" so that I can tweak a few things? A few issues:

considering changes to the default tokeniser
reimplementing the tokeniser as a character label, not a function, although... @koheiw we could consider changing this to a function. For backward compatibility we could still allow character labels. The existing documentation refers to a label, which calls the corresponding function in tokenizers.R, yet here it's a function. An argument in favour of making it a function is that then, it could be standalone, so that it would provide an alternative tokeniser that could work as e.g.,

customised_tokenizer(txt) |>
    as.tokens()

the same way that spacyr::spacy_tokenize() can be used as an input already.

I'll add a few more tests

odelmarcelle · 2022-12-08T15:46:59Z

@kbenoit No worries about the delay. I think I've enabled the "Allow edits from maintainers". Are you able to change anything?

I quickly merged into the fork the latest version of master.

Regarding your comments:

My original intent was to leave the behavior of the current tokenizer untouched. I agree that the performance increase makes it a contender for the default tokenizer. However, more testing is needed to ensure that minimal changes are brought to users. I think the two differences I highlighted before (regarding URLs and hyphens) would be an upgrade, but there might be side effects. If you'd like to replace the current default tokenizer, the objective of this PR becomes quite different.
My reasoning for implementing customized_tokenizer as a function was because current tokenizers are already implemented as functions (for example, tokenize_word). But in the case of the cutomized_tokenizer, the tokenizing function doesn't exist (yet). Calling cutomized_tokenizer() acts as a factory and creates the tokenizing function on the fly. The correct parallel with existing tokenizers would be:

tokenize_word(txt) |> as.tokens()

tokenize_custom <- customized_tokenizer()
tokenize_custom(txt) |> as.tokens()

I thought the factory approach would be cleaner than adding extra arguments such as custom_rules to the tokens() function. I agree that renaming the function to maybe create_tokenizer() would make it more explicit.

For the tests, it depends on how you'd like to implement this (see first remarks). If the goal is to replace the default tokenizer, it makes sense to map what = "word" to an instance of customized_tokenizer() and run the existing test suite. Agree that some additional tests on custom_rules inputs might be useful.

koheiw · 2023-03-19T22:16:06Z

I merged this PR to keep your branch in this repository. Let's develop further in #2216.

odelmarcelle added 9 commits March 11, 2022 03:13

initi

2e72998

man

a91272f

format

eb747fc

revert pattern change

8276293

corect email

format

c0ef43d

adjust WORDLIST

67cc4d0

test failing on ubuntu

f69f6f3

doc

25e9f5f

fix sentence

277bbc2

fix spelling

4d6dbd1

odelmarcelle changed the base branch from master to dev-rbbi March 20, 2022 10:08

koheiw requested a review from kbenoit March 21, 2022 08:39

kbenoit reviewed Dec 7, 2022

View reviewed changes

Merge branch 'quanteda:master' into customized_tokenizer

c78f453

odelmarcelle changed the base branch from dev-rbbi to master December 8, 2022 14:50

kbenoit mentioned this pull request Jan 19, 2023

Consider parallelizing tokenization #1965

Open

koheiw added a commit that referenced this pull request Mar 18, 2023

Partially incorporate code from PR #2165

983029a

koheiw mentioned this pull request Mar 19, 2023

Incoporate RBBI tokenizer for v4.0 #2216

Merged

koheiw changed the base branch from master to customized_tokenizer March 19, 2023 22:14

koheiw merged commit 6c4a7ee into quanteda:customized_tokenizer Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allows adding customized rules to the ICU tokenizer #2165

Allows adding customized rules to the ICU tokenizer #2165

odelmarcelle commented Mar 11, 2022

codecov bot commented Mar 14, 2022 •

edited

Loading

koheiw commented Mar 20, 2022

odelmarcelle commented Mar 20, 2022 •

edited

Loading

kbenoit commented Mar 20, 2022

kbenoit left a comment

odelmarcelle commented Dec 8, 2022

koheiw commented Mar 19, 2023

Allows adding customized rules to the ICU tokenizer #2165

Allows adding customized rules to the ICU tokenizer #2165

Conversation

odelmarcelle commented Mar 11, 2022

codecov bot commented Mar 14, 2022 • edited Loading

Codecov Report

koheiw commented Mar 20, 2022

odelmarcelle commented Mar 20, 2022 • edited Loading

kbenoit commented Mar 20, 2022

kbenoit left a comment

Choose a reason for hiding this comment

odelmarcelle commented Dec 8, 2022

koheiw commented Mar 19, 2023

codecov bot commented Mar 14, 2022 •

edited

Loading

odelmarcelle commented Mar 20, 2022 •

edited

Loading