-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement like/ilike etc for StringViewArray #5931
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not leverage the inlined 4 bytes to shortcut some comparison. I plan to leave this as a future work as it can significantly complicate the code. (I also doubt that we can gain any benefit)
I agree we should only start adding specialized paths if we determine it is worth it. starts_with
might be, especially if the parameter is short enough to always be able to use the prefix. We can decide based on benchmarks which are appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XiangpengHao -- this is a very clever PR (adds new features + test coverage while removing net lines of code)
(3) Since we are not smart at handling view types, I didn't make special test cases for StringViewArray, i.e., I did not deliberately include test cases with long/short strings. We rely on the correctness of the StringViewArray iterator, which has already been tested elsewhere.
Even though you are right that the implementation is the same, I still think it is important to add test coverage so that if the implementations change in the future the code will be better
For example I can easily imagine someone wanting to add the 'starts_with' optimization suggested by @wjones127
I think like
also has several prefix optimizations (for example, special casing matches like prefix%
).
So I think in order to complete this PR we need:
- Test coverage for short/long strings
- Benchmarks on this branch (I am running these now) to ensure no regressions
|
||
/// Returns true if all data within this array is ASCII | ||
pub fn is_ascii(&self) -> bool { | ||
// Alternative (but incorrect): directly check the underlying buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I made a PR with some benchmarks for like here: #5936 What I suggest we do is get this PR in with good test coverage, and ensure no performance regressions, and then file at ticket to continue to improve the performance |
Benchmarks show no change:
|
Looks like two starts with get much slower... I'll take closer look 👀 |
I can't reproduce the results (I get many fluctuations on my machine). Looking at the code changes, there's no reason the new code could be slower. Can you benchmark them again? @alamb |
Added more tests to cover longer string cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reran the benchmarks and I still see a slow down with ilike/nilike and a scalar 🤔
ilike_utf8 scalar contains 1.00 1466.0±4.48µs ? ?/sec 1.00 1469.5±5.21µs ? ?/sec 1.11 1622.9±5.84µs ? ?/sec
alamb@aal-dev:~/arrow-rs$ critcmp master master-2 string-view-like
group master master-2 string-view-like
----- ------ -------- ----------------
ilike_utf8 scalar complex 1.02 310.8±1.01µs ? ?/sec 1.02 311.3±0.93µs ? ?/sec 1.00 303.9±1.51µs ? ?/sec
ilike_utf8 scalar contains 1.00 1466.0±4.48µs ? ?/sec 1.00 1469.5±5.21µs ? ?/sec 1.11 1622.9±5.84µs ? ?/sec
ilike_utf8 scalar ends with 1.00 246.7±0.52µs ? ?/sec 1.00 246.7±0.43µs ? ?/sec 1.00 246.9±0.70µs ? ?/sec
ilike_utf8 scalar equals 1.00 196.9±0.29µs ? ?/sec 1.00 197.2±0.41µs ? ?/sec 1.26 248.8±1.06µs ? ?/sec
ilike_utf8 scalar starts with 1.02 286.9±2.29µs ? ?/sec 1.00 282.4±0.71µs ? ?/sec 1.00 282.3±0.45µs ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4]) 1.00 49.0±0.15µs ? ?/sec 1.00 49.0±0.12µs ? ?/sec 1.00 49.0±0.14µs ? ?/sec
like_utf8 scalar complex 1.03 293.6±1.22µs ? ?/sec 1.04 295.1±1.53µs ? ?/sec 1.00 283.9±1.50µs ? ?/sec
like_utf8 scalar contains 1.00 347.2±1.41µs ? ?/sec 1.00 347.9±0.72µs ? ?/sec 1.01 349.6±0.85µs ? ?/sec
like_utf8 scalar ends with 1.00 209.1±0.56µs ? ?/sec 1.00 209.2±1.13µs ? ?/sec 1.00 209.1±0.26µs ? ?/sec
like_utf8 scalar equals 1.01 220.7±0.95µs ? ?/sec 1.01 220.7±0.73µs ? ?/sec 1.00 217.6±0.40µs ? ?/sec
like_utf8 scalar starts with 1.00 231.8±1.11µs ? ?/sec 1.00 231.4±1.47µs ? ?/sec 1.01 232.8±0.32µs ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4]) 1.00 49.1±0.15µs ? ?/sec 1.00 49.0±0.12µs ? ?/sec 1.00 49.1±0.16µs ? ?/sec
nilike_utf8 scalar complex 1.03 313.1±2.25µs ? ?/sec 1.03 312.3±6.04µs ? ?/sec 1.00 304.4±2.76µs ? ?/sec
nilike_utf8 scalar contains 1.00 1466.8±7.42µs ? ?/sec 1.00 1470.9±7.23µs ? ?/sec 1.11 1622.7±7.95µs ? ?/sec
nilike_utf8 scalar ends with 1.00 246.8±0.58µs ? ?/sec 1.00 246.6±0.51µs ? ?/sec 1.00 246.9±0.53µs ? ?/sec
nilike_utf8 scalar equals 1.00 197.0±0.45µs ? ?/sec 1.00 197.1±0.44µs ? ?/sec 1.26 248.7±0.65µs ? ?/sec
nilike_utf8 scalar starts with 1.01 286.2±0.54µs ? ?/sec 1.00 282.4±0.95µs ? ?/sec 1.00 282.2±0.29µs ? ?/sec
nlike_utf8 scalar complex 1.04 293.4±1.22µs ? ?/sec 1.04 293.9±3.01µs ? ?/sec 1.00 283.2±1.52µs ? ?/sec
nlike_utf8 scalar contains 1.00 346.9±0.55µs ? ?/sec 1.00 347.9±0.76µs ? ?/sec 1.01 349.7±0.67µs ? ?/sec
nlike_utf8 scalar ends with 1.00 209.5±2.23µs ? ?/sec 1.00 208.9±0.22µs ? ?/sec 1.00 209.2±0.82µs ? ?/sec
nlike_utf8 scalar equals 1.01 220.4±0.80µs ? ?/sec 1.02 221.3±0.94µs ? ?/sec 1.00 217.8±0.67µs ? ?/sec
nlike_utf8 scalar starts with 1.00 231.5±0.35µs ? ?/sec 1.00 231.7±4.16µs ? ?/sec 1.01 233.0±0.42µs ? ?/sec
array: &GenericStringArray<O>, | ||
negate: bool, | ||
) -> BooleanArray { | ||
pub fn evaluate_array<'i, T>(&self, array: T, negate: bool) -> BooleanArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually I think we should consider making some special variant of this function for StringViews (that can take advantage of the inlined views)
Maybe we could add some method to ArrayAccessor
for quickly checking prefix-equality 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could work.
I think this pr is to give a reasonable baseline for string view implementation and unblock the string view for DF. And if we care enough later, we do need to copy-paste a lot of the code and make specialized comparsion for StringViewArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#5951 tracks potential optimizations
vec!["arrow", "arrow", "arrow", "arrow", "arrow", "arrows", "arrow", "arrow"], | ||
vec![ | ||
"arrow", | ||
"arrow_long_string_more than 12 bytes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I think the regression is related to the inline art discussed (and fixed) in #5922 (comment) We can wait that pr to merge and see if anything changes. |
I've updated the pr to include the changes from #5922. I benchmarked again and find some performance increase some decrease, but they all within 10%. |
I agree -- I reran the benchmarks and see some report got faster / slower
|
I filed #5951 to track potentially improving the performance |
Thanks again everyone for all the help |
Which issue does this PR close?
Part of apache/datafusion#11024 and #5374
Rationale for this change
Currently, the arrow-string crate does not support ops (like, ilike, etc.) against string view types. This pr implements them in a generic way.
What changes are included in this PR?
(1) Added a
StringArrayType
(could have a better name, or placed at a better location) traithttps://github.com/apache/arrow-rs/compare/master...XiangpengHao:string-view-like?expand=1#diff-14974bf6baecb88fa03010ef05c5927627a26bdeb74c3fe186ac2f9527bdca45R151
Arrays (currently,
StringArray
,LargeStringArray
,StringViewArray
) implement this trait will automatically gain the ability to all kinds of string operations (like
,ilike
,nilike
,contains
,starts_with
,ends_with
)This also means that we did not try to be smart when handling
StringViewArray
, i.e., does not leverage the inlined 4 bytes to shortcut some comparison. I plan to leave this as a future work as it can significantly complicate the code. (I also doubt that we can gain any benefit)(2) Overhauled the related test so that they no longer use the depreciated kernels and I believe now they look easier to reason.
(3) Since we are not smart at handling view types, I didn't make special test cases for
StringViewArray
, i.e., I did not deliberately include test cases with long/short strings. We rely on the correctness of the StringViewArray iterator, which has already been tested elsewhere.Are there any user-facing changes?