-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow sorting to improve FixedSizeBinary
filtering
#11170
Comments
Well that's embarrassing 🤦, I forgot In release mode:
Kind of even more surprising - Surely FWB should be faster as you don't need to calculate offsets? |
Well it keeps getting weirder. In release mode:
(Trying This is all very confusing. TL;DR; - @alamb if you were storing
In parquet to query with datafusion, and wanted it to be fast long term, what would you use? |
Okay last comment here (for now), I'll stop talking to myself. It seems that Times
|
I would have recommended using However I got broadly similar numbers to you in with the different types (and I agree Decimal128 looks quite good) I checked out https://github.com/samuelcolvin/datafusion-id-experiment and got an explain plan with metrics (ran FixedSizeBinaryselect * from simple_fixed_sorted where id=arrow_cast(decode('57f16cbaf865bcd9adcc71c03200fd60', 'hex'),
'FixedSizeBinary(16)')
Decimalselect * from decimal where id=arrow_cast('5714204269946304998258834512.6198419457', 'Decimal128(38, 10)')
I don't really have great insight as to why Decimal was better -- it may be because it is stored inline as |
What's weird is the behaviour with a decimal 128 is better than a uint64 when sorted. Is that that a fundamental side affect of the type, or some missing logic/optimisation? |
I suspect it is some missing optimization -- I don't know of any reason that fixed size binary would be less efficient than decimal. I double checked that FixedSizeBinary is also stored inline |
I think the problem is that here |
Thank you @samuelcolvin -- filed apache/arrow-rs#6153 to track |
Hi @samuelcolvin! I'm playing around your example code samuelcolvin/datafusion-id-experiment on my Mac M1 and I got empty results for all the queries. Click to see what I did and what I got
loading data from simple_fixed.parquet
loading data from simple_string.parquet
loading data from int.parquet
loading data from decimal.parquet
running:
select * from simple_fixed where id=arrow_cast(decode('57f16cbaf865bcd9adcc71c03200fd60', 'hex'), 'FixedSizeBinary(16)')
++
++
query time: 102.351167ms
running:
select * from simple_fixed where id=arrow_cast(decode('57f16cbaf865bcd9adcc71c03200fd60', 'hex'), 'FixedSizeBinary(16)')
++
++
query time: 81.322958ms
running:
select * from simple_fixed_sorted where id=arrow_cast(decode('57f16cbaf865bcd9adcc71c03200fd60', 'hex'), 'FixedSizeBinary(16)')
++
++
query time: 89.593416ms
running:
select * from simple_string where id='7wIBWI3Ol4njEVD8'
++
++
query time: 138.759125ms
running:
select * from simple_string where id='7wIBWI3Ol4njEVD8'
++
++
query time: 128.254459ms
running:
select * from simple_string_sorted where id='7wIBWI3Ol4njEVD8'
++
++
query time: 20.043959ms
running:
select * from int where id=1485542105725837362
++
++
query time: 77.541083ms
running:
select * from int where id=1485542105725837362
++
++
query time: 77.021583ms
running:
select * from int_sorted where id=1485542105725837362
++
++
query time: 59.955709ms
running:
select * from decimal where id=arrow_cast('5714204269946304998258834512.6198419457', 'Decimal128(38, 10)')
++
++
query time: 96.231125ms
running:
select * from decimal where id=arrow_cast('5714204269946304998258834512.6198419457', 'Decimal128(38, 10)')
++
++
query time: 105.458583ms
running:
select * from decimal_sorted where id=arrow_cast('5714204269946304998258834512.6198419457', 'Decimal128(38, 10)')
++
++
query time: 13.88675ms Is this what is expected? Anything did I miss? |
I was trying to see where the time was spent for This is what I observed: 98% of the time was spent on reading data from parquet files and 1.2% of the time was spent on querying (see screenshot below). So it is hard to tell whether the unsorted |
I was trying to see where the time was spent for This is what I observed: 98% of the time was spent on reading data from parquet files and 1.2% of the time was spent on querying (see screenshot below). So it is hard to tell whether the unsorted |
Thanks @appletreeisyellow -- I filed apache/arrow-rs#6219 upstream to track the idea of making this faster |
I might be misreading the profile but that has 90% of the time in the byte_array_reader compared to only 0.6% of the time in the fixed_len_byte_array_reader? |
Describe the bug
Is it a bug? Is it a feature? No one has every really known 🤷 .
We have an ID field which is really a u128, because arrow doesn't support u128, I experimented with making it a
DataType::FixedSizeBinary(16)
instead of a string.The performance differences when querying the column (just looking for a single row that matched a value) were quite surprising to me (updated, see below):
("FWB" ==
FixedSizeBinary(16)
)I assume the point is that
FixedSizeBinary
doesn't support using the known sort order to improve filtering?I assume (probably naively) that this shouldn't be too hard to add. Where would I start looking to add it?
To Reproduce
Example code: https://github.com/samuelcolvin/datafusion-id-experiment
Expected behavior
Ideally
FixedSizeBinary
look ups could benefit from known sort order in the same way thatLargeUtf8
does.Additional context
No response
The text was updated successfully, but these errors were encountered: