Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade tokens object to process larger data more efficiently #2208

Closed
koheiw opened this issue Feb 5, 2023 · 3 comments
Closed

Upgrade tokens object to process larger data more efficiently #2208

koheiw opened this issue Feb 5, 2023 · 3 comments

Comments

@koheiw
Copy link
Collaborator

koheiw commented Feb 5, 2023

Hi all,

I have finished feasibility tests of upgrading tokens object to process larger a corpus with smaller memory footprint and shorter execution time. The discussion on this subject has a long history (#681), but, in short, the new tokens object is based on Rcpp's external pointer, the XPtr object, to keep the large vectors in the C++ memory space.

It is hugely beneficial to keep tokens in the C++ side because every time we copy C++ vectors to R vectors, it takes a few seconds consuming a huge memory space. Repeated creation and deletion of large objects also trigger R's garbage collector, which also takes more than a few second to do its job.

The example below compares the old tokens object toks and new (prototype) tokens object xtoks through a long pipeline, which (I think) is common. Since tokens_remove() is called four times, toks travels between C++ and R eight times; while xtoks does not at all. As a result, operations on the new object is nearly three times faster.

as.externalptr(toks) converts the old tokens object to the new tokens object (I probably should give the function a better name); xtoks. as.externalptr(xtoks) deep-copies the new tokens object, so xtoks is not modified in the test.

Please install the branch with this command and tests the new object:

devtools::install_github("quanteda/quanteda", ref = "test-xtokens")

The Guardian corpus is avaialble, but please use a larger corpus if you have any. You should see a bigger difference between the old and new if you use tokens objects with many short texts, like sentences or social media posts.

I would appreciate anyone's feedback/comment, but especially interested in @conjugateprior @stefan-mueller @amatsuo @pablobarbera.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.5
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.

corp <- readRDS('/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds') %>% 
    corpus_reshape()
ndoc(corp)
#> [1] 391395

toks <- tokens(corp, remove_punct = FALSE, remove_numbers = FALSE, 
               remove_symbols = FALSE)
xtoks <- as.externalptr(toks) # convert to a new tokens object
class(xtoks)
#> [1] "externalptr"
class(toks)
#> [1] "tokens"

toks2 <- tokens_remove(toks, stopwords("en"), padding = TRUE) %>% 
  tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
  tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
  tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE)
xtoks2 <- tokens_remove(as.externalptr(xtoks), stopwords("en"), padding = TRUE) %>% 
  tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
  tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
  tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE)
identical(toks2, as.tokens(xtoks2))
#> [1] TRUE

microbenchmark::microbenchmark(
    old = tokens_remove(toks, stopwords("en"), padding = TRUE) %>% 
        tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
        tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
        tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE),
    new = tokens_remove(as.externalptr(xtoks), stopwords("en"), padding = TRUE) %>% 
        tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
        tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
        tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE),
    times = 10
)
#> Unit: seconds
#>  expr      min       lq     mean   median       uq       max neval
#>   old 8.369785 8.911533 9.385009 9.464731 9.915285 10.152171    10
#>   new 2.747844 3.110164 3.254950 3.319481 3.384929  3.561483    10

Created on 2023-02-05 with reprex v2.0.2

@koheiw koheiw changed the title Upgrade tokens to process larger tokens object more efficiently Upgrade tokens object to process larger data more efficiently Feb 5, 2023
koheiw added a commit that referenced this issue Feb 6, 2023
@koheiw
Copy link
Collaborator Author

koheiw commented Feb 9, 2023

If DFM is constructed in C++, tokens object are not copied to R at all. We can expect that the pipeline to be 2 to 3 times faster between tokens selection to DFM construction. The speed of tokenization is a different issue thought (#1965).

S4 cpp_dfm(TokensPtr xptr) {
xptr->recompile();
std::size_t H = xptr->texts.size();
int N = 0;
for (std::size_t h = 0; h < H; h++)
N += xptr->texts[h].size();
std::vector<double> slot_x;
std::vector<int> slot_i, slot_p;
slot_i.reserve(N);
slot_x.reserve(N);
slot_p.reserve(H + 1);
int p = 0;
slot_p.push_back(p);
for (std::size_t h = 0; h < H; h++) {
Text text = xptr->texts[h];
std::sort(text.begin(), text.end()); // rows must be sorted in dgCMatrix
int n = 1;
for (std::size_t i = 0; i < text.size(); i++) {
if (i + 1 == text.size() || text[i] != text[i + 1]) {
slot_i.push_back(text[i]);
slot_x.push_back(n);
p++;
n = 1;
} else {
n++;
}
}
slot_p.push_back(p);
//count.erase(std::remove(count.begin(), count.end(), 0), count.end());
//i.insert(i.end(), text.begin(), text.end());
}
IntegerVector slot_p_ = Rcpp::wrap(slot_p);
//Rcout << "p: " << p_ << "\n";
DoubleVector slot_x_ = Rcpp::wrap(slot_x);
//Rcout << "x: " << x_ << "\n";
IntegerVector slot_i_ = Rcpp::wrap(slot_i);
//Rcout << "i: " << i_ << "\n";
size_t G = xptr->types.size();
CharacterVector types_ = encode(xptr->types);
if (xptr->padding) {
G++;
types_.push_front("");
} else {
slot_i_ = slot_i_ - 1; // use zero fro other tokens
}
IntegerVector dim_ = IntegerVector::create(G, H);
List dimnames_ = List::create(types_, R_NilValue);
S4 dfm_("dgCMatrix");
dfm_.slot("p") = slot_p_;
dfm_.slot("i") = slot_i_;
dfm_.slot("x") = slot_x_;
dfm_.slot("Dim") = dim_;
dfm_.slot("Dimnames") = dimnames_;
return(dfm_);
}

> microbenchmark::microbenchmark(
+     old = tokens_remove(toks, stopwords("en"), padding = TRUE) %>% 
+         tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
+         tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
+         tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE) %>% 
+         dfm(),
+     new = tokens_remove(as.tokens_xptr(xtoks), stopwords("en"), padding = TRUE) %>% 
+         tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
+         tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
+         tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE) %>% 
+         dfm(),
+     times = 10
+ )
Unit: seconds
 expr      min        lq      mean    median        uq       max neval
  old 9.702666 10.933117 12.287242 12.100233 12.874663 15.576066    10
  new 3.865356  3.885691  4.262062  4.018364  4.441775  5.838174    10
> 

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 15, 2023

@chainsawriot, I welcome your inputs too. Please share your views as a prolific package developer.

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 4, 2023

This is more accurate performance comparison.

> microbenchmark::microbenchmark(
+     old = quanteda3::tokens(corp) %>% 
+         quanteda3::tokens_remove(stopwords("en"), padding = TRUE) %>% 
+         quanteda3::dfm(remove_padding = TRUE),
+     new = tokens(corp, xptr = TRUE) %>% 
+         tokens_remove(stopwords("en"), padding = TRUE) %>% 
+         dfm(remove_padding = TRUE),
+     times = 10
+ )
Unit: seconds
 expr       min        lq      mean    median        uq       max neval
  old 13.691072 13.998020 14.470557 14.353116 14.961678 15.366565    10
  new  8.205233  8.270918  8.489905  8.352947  8.633411  9.081053    10
> ndoc(corp)
[1] 391395

@kbenoit kbenoit added this to the v4 release milestone Apr 12, 2023
@koheiw koheiw closed this as completed Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants