Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust): Treat splitting by empty string as iterating over chars #15922

Merged
merged 7 commits into from
May 1, 2024

Conversation

haocheng6
Copy link
Contributor

@haocheng6 haocheng6 commented Apr 26, 2024

Changes

This PR fixes the behavior of str.split, str.split_exact and str.splitn by treating splitting by an empty string as iterating over chars.

Closes #14604

@ritchie46
Copy link
Member

Can we see if we can support this without allocating a boxed iterator on every row?

@haocheng6
Copy link
Contributor Author

@ritchie46 Sure. I restored the signatures of split_to_struct and split_helper in b380968. Now there are no boxed trait objects.

@orlp
Copy link
Collaborator

orlp commented Apr 28, 2024

I think it's easier/faster if we modify split_to_struct instead of every single caller of it. That way we also don't have to do a filter over every single character in the string just to modify the edge case behavior.

@haocheng6
Copy link
Contributor Author

haocheng6 commented Apr 28, 2024

@orlp I refactored the code 9e91234. Now there is no call to filter, but for splitn, it still needs to use chars().count() to determine the right number of substrings when the pattern is empty. Is this the kind of thing you had in mind?

@orlp
Copy link
Collaborator

orlp commented Apr 29, 2024

@haocheng6 What you should do is when by == "" you should use s.splitn(n + 1, "").skip(1) instead of op which should automatically handle all cases.

Copy link

codecov bot commented Apr 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.95%. Comparing base (dee176c) to head (5bf814f).
Report is 31 commits behind head on main.

❗ Current head 5bf814f differs from pull request most recent head b6b24c7. Consider uploading reports for the commit b6b24c7 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15922      +/-   ##
==========================================
- Coverage   81.24%   80.95%   -0.29%     
==========================================
  Files        1382     1384       +2     
  Lines      176628   178145    +1517     
  Branches     3032     3043      +11     
==========================================
+ Hits       143494   144216     +722     
- Misses      32649    33446     +797     
+ Partials      485      483       -2     
Flag Coverage Δ
python 74.43% <100.00%> (-0.25%) ⬇️
rust 78.09% <100.00%> (-0.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

codspeed-hq bot commented May 1, 2024

CodSpeed Performance Report

Merging #15922 will improve performances by 26.19%

Comparing haocheng6:14604-fix-str-split-by-empty-string (b6b24c7) with main (be09246)

Summary

⚡ 2 improvements
✅ 33 untouched benchmarks

Benchmarks breakdown

Benchmark main haocheng6:14604-fix-str-split-by-empty-string Change
test_filter2 2.8 ms 2.2 ms +24.41%
test_tpch_q15 7.8 ms 6.2 ms +26.19%

crates/polars-ops/src/chunked_array/strings/split.rs Outdated Show resolved Hide resolved
crates/polars-ops/src/chunked_array/strings/split.rs Outdated Show resolved Hide resolved
@orlp orlp merged commit 8929395 into pola-rs:main May 1, 2024
24 of 25 checks passed
@orlp
Copy link
Collaborator

orlp commented May 1, 2024

@haocheng6 Thanks for the PR!

@haocheng6 haocheng6 deleted the 14604-fix-str-split-by-empty-string branch May 1, 2024 13:44
Wouittone pushed a commit to Wouittone/polars that referenced this pull request Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str.split by an empty string produces incorrect results
3 participants