-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade tokens object to process larger data more efficiently #2208
Labels
Milestone
Comments
koheiw
added a commit
that referenced
this issue
Feb 6, 2023
If DFM is constructed in C++, tokens object are not copied to R at all. We can expect that the pipeline to be 2 to 3 times faster between tokens selection to DFM construction. The speed of tokenization is a different issue thought (#1965). Lines 83 to 141 in 82cf04f
|
Merged
@chainsawriot, I welcome your inputs too. Please share your views as a prolific package developer. |
This is more accurate performance comparison. > microbenchmark::microbenchmark(
+ old = quanteda3::tokens(corp) %>%
+ quanteda3::tokens_remove(stopwords("en"), padding = TRUE) %>%
+ quanteda3::dfm(remove_padding = TRUE),
+ new = tokens(corp, xptr = TRUE) %>%
+ tokens_remove(stopwords("en"), padding = TRUE) %>%
+ dfm(remove_padding = TRUE),
+ times = 10
+ )
Unit: seconds
expr min lq mean median uq max neval
old 13.691072 13.998020 14.470557 14.353116 14.961678 15.366565 10
new 8.205233 8.270918 8.489905 8.352947 8.633411 9.081053 10
> ndoc(corp)
[1] 391395 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi all,
I have finished feasibility tests of upgrading tokens object to process larger a corpus with smaller memory footprint and shorter execution time. The discussion on this subject has a long history (#681), but, in short, the new tokens object is based on Rcpp's external pointer, the
XPtr
object, to keep the large vectors in the C++ memory space.It is hugely beneficial to keep tokens in the C++ side because every time we copy C++ vectors to R vectors, it takes a few seconds consuming a huge memory space. Repeated creation and deletion of large objects also trigger R's garbage collector, which also takes more than a few second to do its job.
The example below compares the old tokens object
toks
and new (prototype) tokens objectxtoks
through a long pipeline, which (I think) is common. Sincetokens_remove()
is called four times,toks
travels between C++ and R eight times; whilextoks
does not at all. As a result, operations on the new object is nearly three times faster.as.externalptr(toks)
converts the old tokens object to the new tokens object (I probably should give the function a better name);xtoks
.as.externalptr(xtoks)
deep-copies the new tokens object, soxtoks
is not modified in the test.Please install the branch with this command and tests the new object:
The Guardian corpus is avaialble, but please use a larger corpus if you have any. You should see a bigger difference between the old and new if you use tokens objects with many short texts, like sentences or social media posts.
I would appreciate anyone's feedback/comment, but especially interested in @conjugateprior @stefan-mueller @amatsuo @pablobarbera.
Created on 2023-02-05 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: