Replies: 1 comment
-
Sorry, found the issue myself: valuetype = "regex" doent do the job here, but ="fixed" is the right one. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dear all,
I am encountering difficulties while preprocessing my textual data with the unique quanteda packages. When removing stop words, suddenly several documents end up with no features eventhough the original text contains other words than those included in the stop words list. I added some customized stopwords as well, but these are not related to all of the words in the document. Does anyone can help me with this?
Here's an example R code:
library(quanteda)
Document text for ro2007-36
text <- "romania has a quarter of its workforce abroad the italian ambassador in bucharest, daniele mancini, believes that romania cannot develop with a quarter of the workforce going abroad, reports newsin. romania has fish the italian ambassador in bucharest, daniele mancini, believes that romania cannot develop with a quarter of the workforce going abroad, reports newsin. romania has over 1,300,000 romanians who are currently legally living abroad. but the real number cannot be established precisely because many work illegally. romania has a fourth of its workforce going abroad. no country that wants to grow can allow something like that, said daniele mancini."
Custom list of stopwords
mystopwords <- stopwords("english", source = "snowball")
mystopwords <- c("u.", "dv", "e.g.", "new","de", "even","per", "cent", "mr" ,"us","need", "must",
"press","update", "video", "foto", "hyperlink", "adevarul",
"documentary", "www", "re", "see", "much", "good", "get",
"look", "eve", "st", "open", "due", "except", "next", "ð", "tjob",
"ever", "le", mystopwords)
Display the number of stopwords
glue::glue("Now {length(mystopwords)} stopwords")
console output: Now 211 stopwords
Create a corpus
corp <- corpus(text, docnames = "ro2007-36")
Tokenize the text
dat.tokens <- tokens(corp)
Remove stopwords
dat.tokens.stopwords <- tokens_remove(dat.tokens,
pattern = mystopwords,
case_insensitive = TRUE,
valuetype = "regex",
padding = FALSE,
verbose = TRUE)
Convert tokens to a document-feature matrix (dfm)
dfm <- dfm(dat.tokens.stopwords)
Display the dfm to check the number of features left
print(dfm)
console output: Document-feature matrix of: 1 document, 4 features (0.00% sparse) and 0 docvars.
features
Beta Was this translation helpful? Give feedback.
All reactions