Enable stemming and choosing tokenizer, when doing full text search in tantivy #1315

josca42 · 2024-05-19T08:36:00Z

SDK

Python

Description

Enabling stemming and using a language specific tokenizer tend to improve recall quite a bit, when doing full text search.

Tantivy has support for this through the tokenizer_name argument in add_text_field.

As far as I can tell the change needed is to add tokenizer_name argument to the following line

And then add the tokenizer_name argument to the create_fts_index method.

I would personally really prefer if the argument could be exposed instead of just enabling the usage of the english stemmer. Tantivy supports a few different language tokenizers, which I think a lot of people would like to use instead of english

I can create a pull request with the suggested changes if you think it is a good idea :-).

wjones127 · 2024-05-20T16:35:02Z

This all sounds good to me. Feel free to make a PR :)

josca42 added the enhancement New feature or request label May 19, 2024

josca42 linked a pull request Jun 5, 2024 that will close this issue

feat: enable stemming #1356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable stemming and choosing tokenizer, when doing full text search in tantivy #1315

Enable stemming and choosing tokenizer, when doing full text search in tantivy #1315

josca42 commented May 19, 2024

wjones127 commented May 20, 2024

Enable stemming and choosing tokenizer, when doing full text search in tantivy #1315

Enable stemming and choosing tokenizer, when doing full text search in tantivy #1315

Comments

josca42 commented May 19, 2024

SDK

Description

wjones127 commented May 20, 2024