-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Handling Empty Substrings in split_text_keep_separator
Function
#14528
Comments
I think this would be undesirable. Splitting is only one step, there is also the step that merges back up to your chunk size, which is the part that makes this work Furthermore, this is removing spatial information (like how the text was originally laid out), which LLMs are quite good at understanding |
Got it, then would it be beneficial to modify the function to check and handle only the first element of the result? I think only the first element can be empty. def split_text_keep_separator(text: str, separator: str) -> List[str]:
"""Split text with separator and keep the separator."""
parts = text.split(separator)
result = [separator + s if i > 0 else s for i, s in enumerate(parts)]
return result[1:] if len(result) > 0 and not result[0] else result |
Please CC @logan-markewich |
Feature Description
I've noticed that the current implementation includes many empty substring separators when splitting the text. Here is the existing implementation:
When using the function as shown below:
This includes multiple empty substring separators, which might not be the desired behavior.
Would it be better to adjust the function to filter out these empty strings by modifying the implementation to:
With this change, the output would be:
['\n\nhello', '\n\nworld', '\n\nhaha']
Is the inclusion of empty substring separators an intentional design choice? If not, would it be more efficient to use the revised function to prevent empty strings from being included in the result?
Reason
No response
Value of Feature
No response
The text was updated successfully, but these errors were encountered: