While removing stop words, the whole text of a document is removed eventhough not all of its words are included in the stop words list #2393

annagoerg · 2024-06-13T12:14:56Z

annagoerg
Jun 13, 2024

Dear all,
I am encountering difficulties while preprocessing my textual data with the unique quanteda packages. When removing stop words, suddenly several documents end up with no features eventhough the original text contains other words than those included in the stop words list. I added some customized stopwords as well, but these are not related to all of the words in the document. Does anyone can help me with this?

Here's an example R code:

library(quanteda)

Document text for ro2007-36

text <- "romania has a quarter of its workforce abroad the italian ambassador in bucharest, daniele mancini, believes that romania cannot develop with a quarter of the workforce going abroad, reports newsin. romania has fish the italian ambassador in bucharest, daniele mancini, believes that romania cannot develop with a quarter of the workforce going abroad, reports newsin. romania has over 1,300,000 romanians who are currently legally living abroad. but the real number cannot be established precisely because many work illegally. romania has a fourth of its workforce going abroad. no country that wants to grow can allow something like that, said daniele mancini."

Custom list of stopwords

mystopwords <- stopwords("english", source = "snowball")
mystopwords <- c("u.", "dv", "e.g.", "new","de", "even","per", "cent", "mr" ,"us","need", "must",
"press","update", "video", "foto", "hyperlink", "adevarul",
"documentary", "www", "re", "see", "much", "good", "get",
"look", "eve", "st", "open", "due", "except", "next", "ð", "tjob",
"ever", "le", mystopwords)

Display the number of stopwords

glue::glue("Now {length(mystopwords)} stopwords")
console output: Now 211 stopwords

Create a corpus

corp <- corpus(text, docnames = "ro2007-36")

Tokenize the text

dat.tokens <- tokens(corp)

Remove stopwords

dat.tokens.stopwords <- tokens_remove(dat.tokens,
pattern = mystopwords,
case_insensitive = TRUE,
valuetype = "regex",
padding = FALSE,
verbose = TRUE)

Convert tokens to a document-feature matrix (dfm)

dfm <- dfm(dat.tokens.stopwords)

Display the dfm to check the number of features left

print(dfm)

console output: Document-feature matrix of: 1 document, 4 features (0.00% sparse) and 0 docvars.
features

annagoerg · 2024-06-13T12:21:38Z

annagoerg
Jun 13, 2024
Author

Sorry, found the issue myself: valuetype = "regex" doent do the job here, but ="fixed" is the right one.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

While removing stop words, the whole text of a document is removed eventhough not all of its words are included in the stop words list #2393

{{title}}

Replies: 1 comment

{{title}}

Select a reply

While removing stop words, the whole text of a document is removed eventhough not all of its words are included in the stop words list #2393

annagoerg Jun 13, 2024

Document text for ro2007-36

Custom list of stopwords

Display the number of stopwords

Create a corpus

Tokenize the text

Remove stopwords

Convert tokens to a document-feature matrix (dfm)

Display the dfm to check the number of features left

Replies: 1 comment

annagoerg Jun 13, 2024 Author

annagoerg
Jun 13, 2024

annagoerg
Jun 13, 2024
Author