Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add split by token to PreProcessor #4392

Open
danielbichuetti opened this issue Mar 13, 2023 · 7 comments
Open

Add split by token to PreProcessor #4392

danielbichuetti opened this issue Mar 13, 2023 · 7 comments
Labels
2.x Related to Haystack v2.0 type:feature New feature or request

Comments

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Mar 13, 2023

Is your feature request related to a problem? Please describe.

  • As a user, I want to split my documents based on a token limit. I would like to use HuggingFace tokenizers, Sentence Transformers and OpenAI tokenizers. I wish to avoid setting this value based on try/catch attempts using words or sentences.
  • I would like to use regex to replace strings and replace newlines with spaces

Describe the solution you'd like
Allow the usage of any Hugging Face tokenizer, Sentence Transformers or OpenAI tokenizer. User will choose the mode tokens, set the tokenizer type (hf, sbert or OpenAI), the model name. The remaining PreProcessor function will stay the same.

Describe alternatives you've considered
Keep like it's, and each user will do its experiments about word/sentence split length.

Additional context
This is just a meta code (exported from our internal codebase), but it would be something like:

def __init__(
        self,
        mode: SplitMode = "words",
        tokenizer: Optional[TokenizerMode] = None,
        tokenizer_model: Optional[str] = None,
        language: Optional[str]=None,
        clean_whitespace: Optional[bool] = None,
        clean_header_footer: Optional[bool] = None,
        clean_empty_lines: Optional[bool] = None,
        remove_substrings: Optional[List[str]] = None,
        remove_substrings_regex: Optional[Union[List[Pattern],List[str]]] = None,
        replace_newlines: Optional[bool] = None,
        split_length: int = 180,
        split_overlap: int = 25,
        split_respect_sentence_boundary: bool = False,
        id_hash_keys: Optional[List[str]] = None,
    ) -> None:
@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Mar 13, 2023

This implementation will work with the current PreProcessor and pipelines. The code is already tested.

@ZanSara Tagging you because of #4256.

@ZanSara
Copy link
Contributor

ZanSara commented Mar 13, 2023

Hey @danielbichuetti, if you already have a working implementation I'll be glad to have a look.

However let me warn you: PreProcessor has almost no test on the corner cases. My last attempt at introducing them revealed so many bugs hidden in the implementation that I had to drop the idea of touching it ever again. So... do you plan to add any tests? 😅 It might be way harder than it looks.

If you do open a PR please tag me and I'll do my best to follow it up.

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Mar 13, 2023

@ZanSara We can add tests for all 3 token splitting modes. Or do you need that I add tests to the whole implementation? 😅

I'll be honest. This node code looks like if we had a house, and we keep adding features and doing extra constructions, adapting everything. It's constructed, but not much planning hahahah Sorry for being too honest.

I have one working implementation. We exported the Haystack PreProcessor and built a DocumentProcessor changing to a more typed structure and adding token splitter. I can do the reversal and adapt the DocumentProcessor for the current PreProcessor.

The way it was adapted works with the current implementation. Of course, if we change to a more optimized way of doing (like using front and tail calc.) the token calculation, we would need to refactor many methods. And probably your concerns will become a problem.

This implementation is using a linear approach.

@ZanSara
Copy link
Contributor

ZanSara commented Mar 13, 2023

Ok this sounds interesting! I'd be happy to see the DocumentProcessor. I can't guarantee I will have the time to review it quickly, but I can sure look at it and consider if it can be used in place of PreProcessor, at least in some situations. Unless I find some huge issues I believe we could have it in the core.

For the tests: well technically if you add a new parameter you should test it against at least a few others... which might already surface problems if the tests are done properly. I didn't expect you to write tests for the rest of the codebase, don't worry 😄

I'll be honest. This node code looks like if we had a house, and we keep adding features and doing extra constructions, adapting everything. It's constructed, but not much planning hahahah Sorry for being too honest.

I couldn't agree more with you on this point 😂 Let me share an anecdote. PreProcessor has one amazing feature that makes it amazingly fast: it loses track of whitespace when splitting 🙈 And it's basically impossible to fix this bug of losing whitespace without slowing it down about 2x, because clearly not counting something will always be faster than counting something, and PreProcessor currently loses that whitespace very efficiently. You can imagine my joy when I found that out 😅

@danielbichuetti
Copy link
Contributor Author

@ZanSara Yes. The first time we did a port of it, we noticed that. Touching the original PreProcessor code was like: hey, better start from scratch than mod it. We tried a non-linear approach to calculate. But touching the old code was a nightmare. I can see what you felt in the past. So, we decided for a linear calculation.

I'll use the same name PreProcessor as it's already a derivative implementation. And it's, indeed, a modded PreProcessor, just adding features.

I'll open a PR, so we can explore it.

@liorshk
Copy link

liorshk commented May 29, 2023

+1

@masci
Copy link
Contributor

masci commented Mar 13, 2024

I need this myself :) moving to a feature request for 2.x

@masci masci added type:feature New feature or request 2.x Related to Haystack v2.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants