Add split by token to PreProcessor #4392

danielbichuetti · 2023-03-13T11:22:40Z

Is your feature request related to a problem? Please describe.

As a user, I want to split my documents based on a token limit. I would like to use HuggingFace tokenizers, Sentence Transformers and OpenAI tokenizers. I wish to avoid setting this value based on try/catch attempts using words or sentences.
I would like to use regex to replace strings and replace newlines with spaces

Describe the solution you'd like
Allow the usage of any Hugging Face tokenizer, Sentence Transformers or OpenAI tokenizer. User will choose the mode tokens, set the tokenizer type (hf, sbert or OpenAI), the model name. The remaining PreProcessor function will stay the same.

Describe alternatives you've considered
Keep like it's, and each user will do its experiments about word/sentence split length.

Additional context
This is just a meta code (exported from our internal codebase), but it would be something like:

def __init__(
        self,
        mode: SplitMode = "words",
        tokenizer: Optional[TokenizerMode] = None,
        tokenizer_model: Optional[str] = None,
        language: Optional[str]=None,
        clean_whitespace: Optional[bool] = None,
        clean_header_footer: Optional[bool] = None,
        clean_empty_lines: Optional[bool] = None,
        remove_substrings: Optional[List[str]] = None,
        remove_substrings_regex: Optional[Union[List[Pattern],List[str]]] = None,
        replace_newlines: Optional[bool] = None,
        split_length: int = 180,
        split_overlap: int = 25,
        split_respect_sentence_boundary: bool = False,
        id_hash_keys: Optional[List[str]] = None,
    ) -> None:

The text was updated successfully, but these errors were encountered:

danielbichuetti · 2023-03-13T11:36:49Z

This implementation will work with the current PreProcessor and pipelines. The code is already tested.

@ZanSara Tagging you because of #4256.

ZanSara · 2023-03-13T17:53:26Z

Hey @danielbichuetti, if you already have a working implementation I'll be glad to have a look.

However let me warn you: PreProcessor has almost no test on the corner cases. My last attempt at introducing them revealed so many bugs hidden in the implementation that I had to drop the idea of touching it ever again. So... do you plan to add any tests? 😅 It might be way harder than it looks.

If you do open a PR please tag me and I'll do my best to follow it up.

danielbichuetti · 2023-03-13T18:20:13Z

@ZanSara We can add tests for all 3 token splitting modes. Or do you need that I add tests to the whole implementation? 😅

I'll be honest. This node code looks like if we had a house, and we keep adding features and doing extra constructions, adapting everything. It's constructed, but not much planning hahahah Sorry for being too honest.

I have one working implementation. We exported the Haystack PreProcessor and built a DocumentProcessor changing to a more typed structure and adding token splitter. I can do the reversal and adapt the DocumentProcessor for the current PreProcessor.

The way it was adapted works with the current implementation. Of course, if we change to a more optimized way of doing (like using front and tail calc.) the token calculation, we would need to refactor many methods. And probably your concerns will become a problem.

This implementation is using a linear approach.

ZanSara · 2023-03-13T18:33:18Z

Ok this sounds interesting! I'd be happy to see the DocumentProcessor. I can't guarantee I will have the time to review it quickly, but I can sure look at it and consider if it can be used in place of PreProcessor, at least in some situations. Unless I find some huge issues I believe we could have it in the core.

For the tests: well technically if you add a new parameter you should test it against at least a few others... which might already surface problems if the tests are done properly. I didn't expect you to write tests for the rest of the codebase, don't worry 😄

I'll be honest. This node code looks like if we had a house, and we keep adding features and doing extra constructions, adapting everything. It's constructed, but not much planning hahahah Sorry for being too honest.

I couldn't agree more with you on this point 😂 Let me share an anecdote. PreProcessor has one amazing feature that makes it amazingly fast: it loses track of whitespace when splitting 🙈 And it's basically impossible to fix this bug of losing whitespace without slowing it down about 2x, because clearly not counting something will always be faster than counting something, and PreProcessor currently loses that whitespace very efficiently. You can imagine my joy when I found that out 😅

danielbichuetti · 2023-03-13T18:39:12Z

@ZanSara Yes. The first time we did a port of it, we noticed that. Touching the original PreProcessor code was like: hey, better start from scratch than mod it. We tried a non-linear approach to calculate. But touching the old code was a nightmare. I can see what you felt in the past. So, we decided for a linear calculation.

I'll use the same name PreProcessor as it's already a derivative implementation. And it's, indeed, a modded PreProcessor, just adding features.

I'll open a PR, so we can explore it.

liorshk · 2023-05-29T07:20:43Z

+1

masci · 2024-03-13T10:41:38Z

I need this myself :) moving to a feature request for 2.x

masci added type:feature New feature or request 2.x Related to Haystack v2.0 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add split by token to PreProcessor #4392

Add split by token to PreProcessor #4392

danielbichuetti commented Mar 13, 2023 •

edited

Loading

danielbichuetti commented Mar 13, 2023 •

edited

Loading

ZanSara commented Mar 13, 2023

danielbichuetti commented Mar 13, 2023 •

edited

Loading

ZanSara commented Mar 13, 2023

danielbichuetti commented Mar 13, 2023

liorshk commented May 29, 2023

masci commented Mar 13, 2024

Add split by token to PreProcessor #4392

Add split by token to PreProcessor #4392

Comments

danielbichuetti commented Mar 13, 2023 • edited Loading

danielbichuetti commented Mar 13, 2023 • edited Loading

ZanSara commented Mar 13, 2023

danielbichuetti commented Mar 13, 2023 • edited Loading

ZanSara commented Mar 13, 2023

danielbichuetti commented Mar 13, 2023

liorshk commented May 29, 2023

masci commented Mar 13, 2024

danielbichuetti commented Mar 13, 2023 •

edited

Loading

danielbichuetti commented Mar 13, 2023 •

edited

Loading

danielbichuetti commented Mar 13, 2023 •

edited

Loading