modifying window & pattern options #2391

iamYannC · 2024-05-19T21:55:17Z

iamYannC
May 19, 2024

Hey. I posted a quite detailed post in SO, here

My main issue is that with kwic, I can't modify the context pattern, nor the window size before and after. I may want only one context before the keyword but 3 after, etc.. this is an easy implementation I guess.

The more important one is regarding the context pattern itself. Instead of limiting the search to letters or symbols, I would like to fully customize the type of pattern I'm after.
For example, I may not want to remove_punct in tokens but still want to ignore them in my search of context. I would be able to achieve it by providing window_pattern = \\w+\\s+ for example (or [A-Za-z] etc..).

I started writing my own function but there are some really nice strengths to quanteda, as it overcomes those cases where the pattern exceeds the text (example in my SO post).

I hope I laid my idea in a clear way. would love to get involved if needed in the development of these ideas (though I an not a software engineer, so only R code here..).

Thank you
Yann

koheiw · 2024-05-20T00:44:55Z

koheiw
May 20, 2024
Maintainer

You can perform such operations using existing functions.

require(quanteda)
#> Loading required package: quanteda
#> Warning: package 'quanteda' was built under R version 4.3.3
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Package version: 4.0.2
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- "if you drive over 300000 miles a year you will pay a  fine of 10000 usd."
pat <- "\\d{4,6}"

# Show only pre-context in kwic ------------

tokens(txt) |>
  tokens_select(pat, valuetype = "regex", window = c(3, 0), padding = TRUE) |>
  kwic(pat, valuetype = "regex", window = 3)
#> Keyword-in-context with 2 matches.                                       
#>   [text1, 5] you drive over | 300000 | 
#>  [text1, 15]      a fine of | 10000  |


# Exclude some tokens in kwic ------------

toks <- tokens(txt) 
idx <- tokens_remove(toks, stopwords(), padding = TRUE) %>% 
  index("\\w", valuetype = "regex")
kwic(toks, index = idx, window = 3)
#> Keyword-in-context with 8 matches.                                                           
#>   [text1, 3]            if you | drive  | over 300000 miles
#>   [text1, 5]    you drive over | 300000 | miles a year     
#>   [text1, 6] drive over 300000 | miles  | a year you       
#>   [text1, 8]    300000 miles a |  year  | you will pay     
#>  [text1, 11]     year you will |  pay   | a fine of        
#>  [text1, 13]        will pay a |  fine  | of 10000 usd     
#>  [text1, 15]         a fine of | 10000  | usd.             
#>  [text1, 16]     fine of 10000 |  usd   | .

^{Created on 2024-05-20 with reprex v2.1.0}

0 replies

koheiw · 2024-05-20T01:12:03Z

koheiw
May 20, 2024
Maintainer

It is easy to make the window in kwic() asymmertic: only changing it to a two-element vector of integers.

quanteda/src/kwic.cpp

Line 40 in 3163bac

const int window,

We kept kwic() simple because users should instead use tokens_select() with window argument for statistical analysis. We could add the index argument to tokens_select() to allow users to select tokens based on a different tokens object (e.g. POS annotation, sentiment).

0 replies

iamYannC · 2024-05-22T23:20:16Z

iamYannC
May 22, 2024
Author

Oh dang, that is some functionality I was not aware of! Thank you! Wil definitely apply this.

…

On Mon, 20 May 2024, 04:12 Kohei Watanabe, ***@***.***> wrote: It is easy to make the window in kwic() asymmertic: only changing it to a two-element vector of integers. https://github.com/quanteda/quanteda/blob/3163bac37232753c1531f3bbf28a0d8095113a38/src/kwic.cpp#L40 We kept kwic() simple because users should instead use tokens_select() with window argument for statistical analysis. We could add the index argument to tokens_select() to allow users to select tokens based on a different tokens object (e.g. POS annotation, sentiment). — Reply to this email directly, view it on GitHub <#2391 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A7EHOZKQRTTZKNSSAZ5MRSLZDFEXRAVCNFSM6AAAAABH6WGHOGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TIOJRGIYDA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modifying window & pattern options #2391

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

modifying window & pattern options #2391

iamYannC May 19, 2024

Replies: 3 comments

koheiw May 20, 2024 Maintainer

koheiw May 20, 2024 Maintainer

iamYannC May 22, 2024 Author

iamYannC
May 19, 2024

koheiw
May 20, 2024
Maintainer

koheiw
May 20, 2024
Maintainer

iamYannC
May 22, 2024
Author