Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Handling Empty Substrings in split_text_keep_separator Function #14528

Open
yewentao256 opened this issue Jul 3, 2024 · 3 comments
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized

Comments

@yewentao256
Copy link

Feature Description

I've noticed that the current implementation includes many empty substring separators when splitting the text. Here is the existing implementation:

def split_text_keep_separator(text: str, separator: str) -> List[str]:
    """Split text with separator and keep the separator at the end of each split."""
    parts = text.split(separator)
    result = [separator + s if i > 0 else s for i, s in enumerate(parts)]
    return [s for s in result if s]

When using the function as shown below:

print(split_text_keep_separator('\n\nhello\n\n\n\n\n\nworld\n\nhaha', '\n\n'))
['\n\nhello', '\n\n', '\n\n', '\n\nworld', '\n\nhaha']

This includes multiple empty substring separators, which might not be the desired behavior.

Would it be better to adjust the function to filter out these empty strings by modifying the implementation to:

def split_text_keep_separator(text: str, separator: str) -> List[str]:
    """Split text with separator and keep the separator."""
    parts = text.split(separator)
    result = [separator + s if i > 0 else s for i, s in enumerate(parts) if s]
    return result

With this change, the output would be:

['\n\nhello', '\n\nworld', '\n\nhaha']

Is the inclusion of empty substring separators an intentional design choice? If not, would it be more efficient to use the revised function to prevent empty strings from being included in the result?

Reason

No response

Value of Feature

No response

@yewentao256 yewentao256 added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Jul 3, 2024
@logan-markewich
Copy link
Collaborator

logan-markewich commented Jul 3, 2024

I think this would be undesirable. Splitting is only one step, there is also the step that merges back up to your chunk size, which is the part that makes this work

Furthermore, this is removing spatial information (like how the text was originally laid out), which LLMs are quite good at understanding

@yewentao256
Copy link
Author

Got it, then would it be beneficial to modify the function to check and handle only the first element of the result? I think only the first element can be empty.

def split_text_keep_separator(text: str, separator: str) -> List[str]:
    """Split text with separator and keep the separator."""
    parts = text.split(separator)
    result = [separator + s if i > 0 else s for i, s in enumerate(parts)]
    return result[1:] if len(result) > 0 and not result[0] else result

@yewentao256
Copy link
Author

Please CC @logan-markewich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants