Replies: 3 comments
-
You can perform such operations using existing functions. require(quanteda)
#> Loading required package: quanteda
#> Warning: package 'quanteda' was built under R version 4.3.3
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Package version: 4.0.2
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- "if you drive over 300000 miles a year you will pay a fine of 10000 usd."
pat <- "\\d{4,6}"
# Show only pre-context in kwic ------------
tokens(txt) |>
tokens_select(pat, valuetype = "regex", window = c(3, 0), padding = TRUE) |>
kwic(pat, valuetype = "regex", window = 3)
#> Keyword-in-context with 2 matches.
#> [text1, 5] you drive over | 300000 |
#> [text1, 15] a fine of | 10000 |
# Exclude some tokens in kwic ------------
toks <- tokens(txt)
idx <- tokens_remove(toks, stopwords(), padding = TRUE) %>%
index("\\w", valuetype = "regex")
kwic(toks, index = idx, window = 3)
#> Keyword-in-context with 8 matches.
#> [text1, 3] if you | drive | over 300000 miles
#> [text1, 5] you drive over | 300000 | miles a year
#> [text1, 6] drive over 300000 | miles | a year you
#> [text1, 8] 300000 miles a | year | you will pay
#> [text1, 11] year you will | pay | a fine of
#> [text1, 13] will pay a | fine | of 10000 usd
#> [text1, 15] a fine of | 10000 | usd.
#> [text1, 16] fine of 10000 | usd | . Created on 2024-05-20 with reprex v2.1.0 |
Beta Was this translation helpful? Give feedback.
0 replies
-
It is easy to make the Line 40 in 3163bac We kept |
Beta Was this translation helpful? Give feedback.
0 replies
-
Oh dang, that is some functionality I was not aware of! Thank you! Wil
definitely apply this.
…On Mon, 20 May 2024, 04:12 Kohei Watanabe, ***@***.***> wrote:
It is easy to make the window in kwic() asymmertic: only changing it to a
two-element vector of integers.
https://github.com/quanteda/quanteda/blob/3163bac37232753c1531f3bbf28a0d8095113a38/src/kwic.cpp#L40
We kept kwic() simple because users should instead use tokens_select()
with window argument for statistical analysis. We could add the index
argument to tokens_select() to allow users to select tokens based on a
different tokens object (e.g. POS annotation, sentiment).
—
Reply to this email directly, view it on GitHub
<#2391 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A7EHOZKQRTTZKNSSAZ5MRSLZDFEXRAVCNFSM6AAAAABH6WGHOGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TIOJRGIYDA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey. I posted a quite detailed post in SO, here
My main issue is that with kwic, I can't modify the context pattern, nor the window size before and after. I may want only one context before the keyword but 3 after, etc.. this is an easy implementation I guess.
The more important one is regarding the context pattern itself. Instead of limiting the search to letters or symbols, I would like to fully customize the type of pattern I'm after.
For example, I may not want to
remove_punct
intokens
but still want to ignore them in my search of context. I would be able to achieve it by providing window_pattern =\\w+\\s+
for example (or [A-Za-z] etc..).I started writing my own function but there are some really nice strengths to quanteda, as it overcomes those cases where the pattern exceeds the text (example in my SO post).
I hope I laid my idea in a clear way. would love to get involved if needed in the development of these ideas (though I an not a software engineer, so only R code here..).
Thank you
Yann
Beta Was this translation helpful? Give feedback.
All reactions