-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allows adding customized rules to the ICU tokenizer #2165
Allows adding customized rules to the ICU tokenizer #2165
Conversation
Codecov ReportBase: 96.32% // Head: 96.29% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev-rbbi #2165 +/- ##
============================================
- Coverage 96.32% 96.29% -0.03%
============================================
Files 87 87
Lines 5064 5105 +41
============================================
+ Hits 4878 4916 +38
- Misses 186 189 +3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Thank you for the very interesting PR. This is great. We need to spend enougth time thinking what is the best approach before merging to the master. Can you issue a PR to merge yours to |
I updated the target branch of this pull request, you should be able to merge. |
Fully agreed with @koheiw, thanks @odelmarcelle this is great. I have been slow in replying because I'm just getting over COVID but we will review this thoroughly soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@odelmarcelle @koheiw I am so sorry for letting this get stale! COVID and then the death of a family member then the summer (summer school) made me completely forget about this PR. It's a great contribution and I want to tidy it up and merge asap. I'm on it now.
@koheiw what do you think about changing the default tokeniser? We could use (the new) word
or ICU_word
as the new default, and make the existing word into word2
.
@odelmarcelle On the PR in your fork, can you select "Allow edits from maintainers" so that I can tweak a few things? A few issues:
-
considering changes to the default tokeniser
-
reimplementing the tokeniser as a character label, not a function, although... @koheiw we could consider changing this to a function. For backward compatibility we could still allow character labels. The existing documentation refers to a label, which calls the corresponding function in tokenizers.R, yet here it's a function. An argument in favour of making it a function is that then, it could be standalone, so that it would provide an alternative tokeniser that could work as e.g.,
customised_tokenizer(txt) |>
as.tokens()
the same way that spacyr::spacy_tokenize()
can be used as an input already.
- I'll add a few more tests
@kbenoit No worries about the delay. I think I've enabled the "Allow edits from maintainers". Are you able to change anything? I quickly merged into the fork the latest version of master. Regarding your comments:
tokenize_word(txt) |> as.tokens()
tokenize_custom <- customized_tokenizer()
tokenize_custom(txt) |> as.tokens() I thought the factory approach would be cleaner than adding extra arguments such as
|
I merged this PR to keep your branch in this repository. Let's develop further in #2216. |
In #896, the usage of stringi's RBBI was considered to improve the tokenization of URLs and tags (following gagolews/stringi#263). I believe that RBBI rules are also useful for users, as it provides an elegant way to slightly adjust the tokenizer without having to consider other packages.
In this branch, I implement a new function
customized_tokenizer()
that essentially construct a brand new Rule-based Break Iterator forstringi::stri_split_boundaries()
. The result of this function is another function, designed to be used as thewhat
argument oftokens()
.For example, it allows dealing with elisions #1610 :
The implementation is rather naive and can certainly be enhanced. The
customized_tokenizer()
has 3 basic settings:"ICU_word"
, fully based on the word-RBBI (and skippingpreserve_special()
),"word"
, a hybrid solution between the custom word-RBBI rules and thewhat = "word"
tokenization (does not skippreserve_special()
)"sentence"
, fully based on the sentence-RBBII've re-implemented custom rules related to hyphens, URLs, tags in the RBBI to remain as aligned as possible with the default
what = "word"
tokenization. There are a few differences, however (highlighted in the test file):See the following illustration:
If needed, it is also possible to align these two behaviours to the baseline tokenization.
A current limitation:
quanteda_options("pattern_hashtag")
Performance-wise, skipping
preserve_special()
does improve a bit the speed of tokenization. See this benchmark:Let me know what you think of this feature, I thought it could be a nice addition to quanteda!