Upgrade tokens object to process larger data more efficiently #2208

koheiw · 2023-02-05T09:46:12Z

Hi all,

I have finished feasibility tests of upgrading tokens object to process larger a corpus with smaller memory footprint and shorter execution time. The discussion on this subject has a long history (#681), but, in short, the new tokens object is based on Rcpp's external pointer, the XPtr object, to keep the large vectors in the C++ memory space.

It is hugely beneficial to keep tokens in the C++ side because every time we copy C++ vectors to R vectors, it takes a few seconds consuming a huge memory space. Repeated creation and deletion of large objects also trigger R's garbage collector, which also takes more than a few second to do its job.

The example below compares the old tokens object toks and new (prototype) tokens object xtoks through a long pipeline, which (I think) is common. Since tokens_remove() is called four times, toks travels between C++ and R eight times; while xtoks does not at all. As a result, operations on the new object is nearly three times faster.

as.externalptr(toks) converts the old tokens object to the new tokens object (I probably should give the function a better name); xtoks. as.externalptr(xtoks) deep-copies the new tokens object, so xtoks is not modified in the test.

Please install the branch with this command and tests the new object:

devtools::install_github("quanteda/quanteda", ref = "test-xtokens")

The Guardian corpus is avaialble, but please use a larger corpus if you have any. You should see a bigger difference between the old and new if you use tokens objects with many short texts, like sentences or social media posts.

I would appreciate anyone's feedback/comment, but especially interested in @conjugateprior @stefan-mueller @amatsuo @pablobarbera.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.5
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.

corp <- readRDS('/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds') %>% 
    corpus_reshape()
ndoc(corp)
#> [1] 391395

toks <- tokens(corp, remove_punct = FALSE, remove_numbers = FALSE, 
               remove_symbols = FALSE)
xtoks <- as.externalptr(toks) # convert to a new tokens object
class(xtoks)
#> [1] "externalptr"
class(toks)
#> [1] "tokens"

toks2 <- tokens_remove(toks, stopwords("en"), padding = TRUE) %>% 
  tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
  tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
  tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE)
xtoks2 <- tokens_remove(as.externalptr(xtoks), stopwords("en"), padding = TRUE) %>% 
  tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
  tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
  tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE)
identical(toks2, as.tokens(xtoks2))
#> [1] TRUE

microbenchmark::microbenchmark(
    old = tokens_remove(toks, stopwords("en"), padding = TRUE) %>% 
        tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
        tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
        tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE),
    new = tokens_remove(as.externalptr(xtoks), stopwords("en"), padding = TRUE) %>% 
        tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
        tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
        tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE),
    times = 10
)
#> Unit: seconds
#>  expr      min       lq     mean   median       uq       max neval
#>   old 8.369785 8.911533 9.385009 9.464731 9.915285 10.152171    10
#>   new 2.747844 3.110164 3.254950 3.319481 3.384929  3.561483    10

^{Created on 2023-02-05 with reprex v2.0.2}

The text was updated successfully, but these errors were encountered:

koheiw · 2023-02-09T01:05:57Z

If DFM is constructed in C++, tokens object are not copied to R at all. We can expect that the pipeline to be 2 to 3 times faster between tokens selection to DFM construction. The speed of tokenization is a different issue thought (#1965).

quanteda/src/tokens_xptr.cpp

Lines 83 to 141 in 82cf04f

 S4 cpp_dfm(TokensPtr xptr) { 

 xptr->recompile(); 

 std::size_t H = xptr->texts.size(); 

 int N = 0; 

 for (std::size_t h = 0; h < H; h++) 

 N += xptr->texts[h].size(); 

 std::vector<double> slot_x; 

 std::vector<int> slot_i, slot_p; 

 slot_i.reserve(N); 

 slot_x.reserve(N); 

 slot_p.reserve(H + 1); 

 int p = 0; 

 slot_p.push_back(p); 

 for (std::size_t h = 0; h < H; h++) { 

 Text text = xptr->texts[h]; 

 std::sort(text.begin(), text.end()); // rows must be sorted in dgCMatrix 

 int n = 1; 

 for (std::size_t i = 0; i < text.size(); i++) { 

 if (i + 1 == text.size() || text[i] != text[i + 1]) { 

 slot_i.push_back(text[i]); 

 slot_x.push_back(n); 

 p++; 

 n = 1; 

 } else { 

 n++; 

 } 

 } 

 slot_p.push_back(p); 

 //count.erase(std::remove(count.begin(), count.end(), 0), count.end()); 

 //i.insert(i.end(), text.begin(), text.end()); 

 } 

 IntegerVector slot_p_ = Rcpp::wrap(slot_p); 

 //Rcout << "p: " << p_ << "\n"; 

 DoubleVector slot_x_ = Rcpp::wrap(slot_x); 

 //Rcout << "x: " << x_ << "\n"; 

 IntegerVector slot_i_ = Rcpp::wrap(slot_i); 

 //Rcout << "i: " << i_ << "\n"; 

 size_t G = xptr->types.size(); 

 CharacterVector types_ = encode(xptr->types); 

 if (xptr->padding) { 

 G++; 

 types_.push_front(""); 

 } else { 

 slot_i_ = slot_i_ - 1; // use zero fro other tokens 

 } 

 IntegerVector dim_ = IntegerVector::create(G, H); 

 List dimnames_ = List::create(types_, R_NilValue); 

 S4 dfm_("dgCMatrix"); 

 dfm_.slot("p") = slot_p_; 

 dfm_.slot("i") = slot_i_; 

 dfm_.slot("x") = slot_x_; 

 dfm_.slot("Dim") = dim_; 

 dfm_.slot("Dimnames") = dimnames_; 

 return(dfm_); 

 }

> microbenchmark::microbenchmark(
+     old = tokens_remove(toks, stopwords("en"), padding = TRUE) %>% 
+         tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
+         tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
+         tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE) %>% 
+         dfm(),
+     new = tokens_remove(as.tokens_xptr(xtoks), stopwords("en"), padding = TRUE) %>% 
+         tokens_remove("[\\p{N}]", valuetype = "regex", padding = TRUE) %>% 
+         tokens_remove("[\\p{P}]", valuetype = "regex", padding = TRUE) %>% 
+         tokens_remove("[\\p{S}]", valuetype = "regex", padding = TRUE) %>% 
+         dfm(),
+     times = 10
+ )
Unit: seconds
 expr      min        lq      mean    median        uq       max neval
  old 9.702666 10.933117 12.287242 12.100233 12.874663 15.576066    10
  new 3.865356  3.885691  4.262062  4.018364  4.441775  5.838174    10
>

koheiw · 2023-02-15T09:10:04Z

@chainsawriot, I welcome your inputs too. Please share your views as a prolific package developer.

koheiw · 2023-04-04T01:05:37Z

This is more accurate performance comparison.

> microbenchmark::microbenchmark(
+     old = quanteda3::tokens(corp) %>% 
+         quanteda3::tokens_remove(stopwords("en"), padding = TRUE) %>% 
+         quanteda3::dfm(remove_padding = TRUE),
+     new = tokens(corp, xptr = TRUE) %>% 
+         tokens_remove(stopwords("en"), padding = TRUE) %>% 
+         dfm(remove_padding = TRUE),
+     times = 10
+ )
Unit: seconds
 expr       min        lq      mean    median        uq       max neval
  old 13.691072 13.998020 14.470557 14.353116 14.961678 15.366565    10
  new  8.205233  8.270918  8.489905  8.352947  8.633411  9.081053    10
> ndoc(corp)
[1] 391395

koheiw added question performance tokens design labels Feb 5, 2023

koheiw changed the title ~~Upgrade tokens to process larger tokens object more efficiently~~ Upgrade tokens object to process larger data more efficiently Feb 5, 2023

koheiw added a commit that referenced this issue Feb 6, 2023

Simplify object builder for #2208

2dd8798

koheiw mentioned this issue Feb 15, 2023

Add tokens_xptr object #2210

Merged

kbenoit added this to the v4 release milestone Apr 12, 2023

koheiw closed this as completed Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade tokens object to process larger data more efficiently #2208

Upgrade tokens object to process larger data more efficiently #2208

koheiw commented Feb 5, 2023 •

edited

Loading

koheiw commented Feb 9, 2023

koheiw commented Feb 15, 2023

koheiw commented Apr 4, 2023

Upgrade tokens object to process larger data more efficiently #2208

Upgrade tokens object to process larger data more efficiently #2208

Comments

koheiw commented Feb 5, 2023 • edited Loading

koheiw commented Feb 9, 2023

koheiw commented Feb 15, 2023

koheiw commented Apr 4, 2023

koheiw commented Feb 5, 2023 •

edited

Loading