Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows adding customized rules to the ICU tokenizer #2165

Merged

Conversation

odelmarcelle
Copy link
Collaborator

In #896, the usage of stringi's RBBI was considered to improve the tokenization of URLs and tags (following gagolews/stringi#263). I believe that RBBI rules are also useful for users, as it provides an elegant way to slightly adjust the tokenizer without having to consider other packages.

In this branch, I implement a new function customized_tokenizer() that essentially construct a brand new Rule-based Break Iterator for stringi::stri_split_boundaries(). The result of this function is another function, designed to be used as the what argument of tokens().

For example, it allows dealing with elisions #1610 :

library(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(doc = "I've been sick today, I may go to the hospital.",
         doc_fr = "J'ai été malade aujourd'hui, je vais aller à l'hôpital.")
tokens(txt, what = customized_tokenizer())
#> Tokens consisting of 2 documents.
#> doc :
#>  [1] "I've"     "been"     "sick"     "today"    ","        "I"       
#>  [7] "may"      "go"       "to"       "the"      "hospital" "."       
#> 
#> doc_fr :
#>  [1] "J'ai"        "été"         "malade"      "aujourd'hui" ","          
#>  [6] "je"          "vais"        "aller"       "à"           "l'hôpital"  
#> [11] "."

## Implement custom elision rule for french
Elision_french <- "
$Elision = ([lLmMtTnNsSjJdDcC]|([jJ][u][s]|[qQ][u][o][i]|[lL][o][r][s]|[pP][u][i][s])?[qQ][u])[\u0027\u2019];
# Disable chaining so it only matches beginning of word.
^$Elision / $ALetterPlus;
"

tokens(txt, what = customized_tokenizer(custom_rules = Elision_french))
#> Tokens consisting of 2 documents.
#> doc :
#>  [1] "I've"     "been"     "sick"     "today"    ","        "I"       
#>  [7] "may"      "go"       "to"       "the"      "hospital" "."       
#> 
#> doc_fr :
#>  [1] "J'"          "ai"          "été"         "malade"      "aujourd'hui"
#>  [6] ","           "je"          "vais"        "aller"       "à"          
#> [11] "l'"          "hôpital"    
#> [ ... and 1 more ]

The implementation is rather naive and can certainly be enhanced. The customized_tokenizer() has 3 basic settings:

  • "ICU_word", fully based on the word-RBBI (and skipping preserve_special()),
  • "word", a hybrid solution between the custom word-RBBI rules and the what = "word" tokenization (does not skip preserve_special())
  • "sentence", fully based on the sentence-RBBI

I've re-implemented custom rules related to hyphens, URLs, tags in the RBBI to remain as aligned as possible with the default what = "word" tokenization. There are a few differences, however (highlighted in the test file):

  • I implemented a URL pattern that does not break addresses starting with "www."
  • A hyphen-space combination is broken down into two components (due to stricter hyphenation rule)

See the following illustration:

library(quanteda)
txt <- c("www.r-project.org/about.html", "sci- fi sci-fi")
tokens(txt)
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "www.r-project.org" "/"                 "about.html"       
#> 
#> text2 :
#> [1] "sci-"   "fi"     "sci-fi"
tokens(txt, what = customized_tokenizer())
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "www.r-project.org/about.html"
#> 
#> text2 :
#> [1] "sci"    "-"      "fi"     "sci-fi"

If needed, it is also possible to align these two behaviours to the baseline tokenization.

A current limitation:

  • The RBBI hashtag rule currently does not rely on quanteda_options("pattern_hashtag")

Performance-wise, skipping preserve_special() does improve a bit the speed of tokenization. See this benchmark:

library(quanteda)
data(data_corpus_sotu, package = "quanteda.corpora")
data_corpus_sotu <- as.corpus(data_corpus_sotu)

microbenchmark::microbenchmark(
  vanilla = tokens(data_corpus_sotu, what = "word"),
  customized = tokens(data_corpus_sotu, what = customized_tokenizer()),
  times = 10
)
#> Unit: milliseconds
#>        expr       min        lq      mean    median        uq      max neval
#>     vanilla 2949.3061 3046.1964 3047.8255 3068.1595 3071.3867 3087.762    10
#>  customized  776.3383  781.8884  885.1196  895.2472  903.4933 1111.633    10

Let me know what you think of this feature, I thought it could be a nice addition to quanteda!

@codecov
Copy link

codecov bot commented Mar 14, 2022

Codecov Report

Base: 96.32% // Head: 96.29% // Decreases project coverage by -0.02% ⚠️

Coverage data is based on head (4d6dbd1) compared to base (e7682e0).
Patch coverage: 93.33% of modified lines in pull request are covered.

❗ Current head 4d6dbd1 differs from pull request most recent head c78f453. Consider uploading reports for the commit c78f453 to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##           dev-rbbi    #2165      +/-   ##
============================================
- Coverage     96.32%   96.29%   -0.03%     
============================================
  Files            87       87              
  Lines          5064     5105      +41     
============================================
+ Hits           4878     4916      +38     
- Misses          186      189       +3     
Impacted Files Coverage Δ
R/tokenizers.R 96.80% <89.28%> (-2.17%) ⬇️
R/tokens.R 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@koheiw
Copy link
Collaborator

koheiw commented Mar 20, 2022

Thank you for the very interesting PR. This is great.

We need to spend enougth time thinking what is the best approach before merging to the master. Can you issue a PR to merge yours to dev-rbbi?

@odelmarcelle odelmarcelle changed the base branch from master to dev-rbbi March 20, 2022 10:08
@odelmarcelle
Copy link
Collaborator Author

odelmarcelle commented Mar 20, 2022

I updated the target branch of this pull request, you should be able to merge.

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 20, 2022

Fully agreed with @koheiw, thanks @odelmarcelle this is great. I have been slow in replying because I'm just getting over COVID but we will review this thoroughly soon.

@koheiw koheiw requested a review from kbenoit March 21, 2022 08:39
Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@odelmarcelle @koheiw I am so sorry for letting this get stale! COVID and then the death of a family member then the summer (summer school) made me completely forget about this PR. It's a great contribution and I want to tidy it up and merge asap. I'm on it now.

@koheiw what do you think about changing the default tokeniser? We could use (the new) word or ICU_word as the new default, and make the existing word into word2.

@odelmarcelle On the PR in your fork, can you select "Allow edits from maintainers" so that I can tweak a few things? A few issues:

  • considering changes to the default tokeniser

  • reimplementing the tokeniser as a character label, not a function, although... @koheiw we could consider changing this to a function. For backward compatibility we could still allow character labels. The existing documentation refers to a label, which calls the corresponding function in tokenizers.R, yet here it's a function. An argument in favour of making it a function is that then, it could be standalone, so that it would provide an alternative tokeniser that could work as e.g.,

customised_tokenizer(txt) |>
    as.tokens()

the same way that spacyr::spacy_tokenize() can be used as an input already.

  • I'll add a few more tests

@odelmarcelle odelmarcelle changed the base branch from dev-rbbi to master December 8, 2022 14:50
@odelmarcelle
Copy link
Collaborator Author

@kbenoit No worries about the delay. I think I've enabled the "Allow edits from maintainers". Are you able to change anything?

I quickly merged into the fork the latest version of master.

Regarding your comments:

  • My original intent was to leave the behavior of the current tokenizer untouched. I agree that the performance increase makes it a contender for the default tokenizer. However, more testing is needed to ensure that minimal changes are brought to users. I think the two differences I highlighted before (regarding URLs and hyphens) would be an upgrade, but there might be side effects. If you'd like to replace the current default tokenizer, the objective of this PR becomes quite different.
  • My reasoning for implementing customized_tokenizer as a function was because current tokenizers are already implemented as functions (for example, tokenize_word). But in the case of the cutomized_tokenizer, the tokenizing function doesn't exist (yet). Calling cutomized_tokenizer() acts as a factory and creates the tokenizing function on the fly. The correct parallel with existing tokenizers would be:
tokenize_word(txt) |> as.tokens()

tokenize_custom <- customized_tokenizer()
tokenize_custom(txt) |> as.tokens()

I thought the factory approach would be cleaner than adding extra arguments such as custom_rules to the tokens() function. I agree that renaming the function to maybe create_tokenizer() would make it more explicit.

  • For the tests, it depends on how you'd like to implement this (see first remarks). If the goal is to replace the default tokenizer, it makes sense to map what = "word" to an instance of customized_tokenizer() and run the existing test suite. Agree that some additional tests on custom_rules inputs might be useful.

koheiw added a commit that referenced this pull request Mar 18, 2023
@koheiw koheiw changed the base branch from master to customized_tokenizer March 19, 2023 22:14
@koheiw koheiw merged commit 6c4a7ee into quanteda:customized_tokenizer Mar 19, 2023
@koheiw
Copy link
Collaborator

koheiw commented Mar 19, 2023

I merged this PR to keep your branch in this repository. Let's develop further in #2216.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants