[Feature Request]: Handling Empty Substrings in `split_text_keep_separator` Function #14528

yewentao256 · 2024-07-03T02:05:39Z

Feature Description

I've noticed that the current implementation includes many empty substring separators when splitting the text. Here is the existing implementation:

def split_text_keep_separator(text: str, separator: str) -> List[str]:
    """Split text with separator and keep the separator at the end of each split."""
    parts = text.split(separator)
    result = [separator + s if i > 0 else s for i, s in enumerate(parts)]
    return [s for s in result if s]

When using the function as shown below:

print(split_text_keep_separator('\n\nhello\n\n\n\n\n\nworld\n\nhaha', '\n\n'))
['\n\nhello', '\n\n', '\n\n', '\n\nworld', '\n\nhaha']

This includes multiple empty substring separators, which might not be the desired behavior.

Would it be better to adjust the function to filter out these empty strings by modifying the implementation to:

def split_text_keep_separator(text: str, separator: str) -> List[str]:
    """Split text with separator and keep the separator."""
    parts = text.split(separator)
    result = [separator + s if i > 0 else s for i, s in enumerate(parts) if s]
    return result

With this change, the output would be:

['\n\nhello', '\n\nworld', '\n\nhaha']

Is the inclusion of empty substring separators an intentional design choice? If not, would it be more efficient to use the revised function to prevent empty strings from being included in the result?

Reason

No response

Value of Feature

No response

The text was updated successfully, but these errors were encountered:

logan-markewich · 2024-07-03T03:50:35Z

I think this would be undesirable. Splitting is only one step, there is also the step that merges back up to your chunk size, which is the part that makes this work

Furthermore, this is removing spatial information (like how the text was originally laid out), which LLMs are quite good at understanding

yewentao256 · 2024-07-03T06:23:02Z

Got it, then would it be beneficial to modify the function to check and handle only the first element of the result? I think only the first element can be empty.

def split_text_keep_separator(text: str, separator: str) -> List[str]:
    """Split text with separator and keep the separator."""
    parts = text.split(separator)
    result = [separator + s if i > 0 else s for i, s in enumerate(parts)]
    return result[1:] if len(result) > 0 and not result[0] else result

yewentao256 · 2024-07-05T01:22:58Z

Please CC @logan-markewich

yewentao256 added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Handling Empty Substrings in `split_text_keep_separator` Function #14528

[Feature Request]: Handling Empty Substrings in `split_text_keep_separator` Function #14528

yewentao256 commented Jul 3, 2024

logan-markewich commented Jul 3, 2024 •

edited

Loading

yewentao256 commented Jul 3, 2024

yewentao256 commented Jul 5, 2024

[Feature Request]: Handling Empty Substrings in split_text_keep_separator Function #14528

[Feature Request]: Handling Empty Substrings in split_text_keep_separator Function #14528

Comments

yewentao256 commented Jul 3, 2024

Feature Description

Reason

Value of Feature

logan-markewich commented Jul 3, 2024 • edited Loading

yewentao256 commented Jul 3, 2024

yewentao256 commented Jul 5, 2024

[Feature Request]: Handling Empty Substrings in `split_text_keep_separator` Function #14528

[Feature Request]: Handling Empty Substrings in `split_text_keep_separator` Function #14528

logan-markewich commented Jul 3, 2024 •

edited

Loading