Allow sorting to improve `FixedSizeBinary` filtering #11170

samuelcolvin · 2024-06-28T18:56:38Z

Describe the bug

Is it a bug? Is it a feature? No one has every really known 🤷 .

We have an ID field which is really a u128, because arrow doesn't support u128, I experimented with making it a DataType::FixedSizeBinary(16) instead of a string.

The performance differences when querying the column (just looking for a single row that matched a value) were quite surprising to me (updated, see below):

unsorted string: 600ms
unsorted FWB: 350ms
sorted string: 56ms
sorted FWB: 350ms

("FWB" == FixedSizeBinary(16))

I assume the point is that FixedSizeBinary doesn't support using the known sort order to improve filtering?

I assume (probably naively) that this shouldn't be too hard to add. Where would I start looking to add it?

To Reproduce

Example code: https://github.com/samuelcolvin/datafusion-id-experiment

Expected behavior

Ideally FixedSizeBinary look ups could benefit from known sort order in the same way that LargeUtf8 does.

Additional context

No response

The text was updated successfully, but these errors were encountered:

samuelcolvin · 2024-06-29T09:34:26Z

Well that's embarrassing 🤦, I forgot --release.

In release mode:

unsorted string: 82ms
unsorted FWB: 82ms
sorted string: 11ms
sorted FWB: 82ms

Kind of even more surprising - Surely FWB should be faster as you don't need to calculate offsets?

samuelcolvin · 2024-06-29T11:21:40Z

Well it keeps getting weirder.

In release mode:

unsorted string: 82ms
unsorted FWB: 82ms
unsorted UInt64: 51ms
sorted string: 11ms
sorted FWB: 82ms
sorted UInt64: 39ms

(Trying UInt64 was the first step towards using a struct of two UInt64, but it seems unlike that would be as fast as a string right now)

This is all very confusing.

TL;DR; - @alamb if you were storing

A 16-byte array with at least one non-zero byte. ref

In parquet to query with datafusion, and wanted it to be fast long term, what would you use?

samuelcolvin · 2024-06-29T12:33:46Z

Okay last comment here (for now), I'll stop talking to myself.

It seems that Decimal128 is the best option for our case (we can rewrite it to look like hex and be queried with hex):

Times (unsorted, sorted):

DataType::FixedSizeBinary(16) - (51, 57)
DataType::LargeUtf8 - (81, 10)
DataType::UInt64 - (52, 36)
DataType::Decimal128(38, 10) - (57, 7)

alamb · 2024-06-30T10:59:19Z

In parquet to query with datafusion, and wanted it to be fast long term, what would you use?

I would have recommended using FixedSizeBinary as you have done (and in fact I believe @hiltontj is doing something like this internall at InfluxData at the moment).

However I got broadly similar numbers to you in with the different types (and I agree Decimal128 looks quite good)

I checked out https://github.com/samuelcolvin/datafusion-id-experiment and got an explain plan with metrics (ran EXPLAIN ANALYZE {sql}):

FixedSizeBinary

select * from simple_fixed_sorted where id=arrow_cast(decode('57f16cbaf865bcd9adcc71c03200fd60', 'hex'),
 'FixedSizeBinary(16)')

+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=0, elapsed_compute=3.172µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|                   |   FilterExec: id@0 = 87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96, metrics=[output_rows=0, elapsed_compute=2.207689ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |     ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:0..2375388], [Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:2375388..4750776], [Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:4750776..7126164], [Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:7126164..9501552], [Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:9501552..11876940], ...]}, projection=[id, name], predicate=id@0 = 87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96, pruning_predicate=CASE WHEN id_null_count@2 = id_row_count@3 THEN false ELSE id_min@0 <= 87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96 AND 87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96 <= id_max@1 END, required_guarantees=[id in (87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96)], metrics=[output_rows=1000000, elapsed_compute=16ns, bytes_scanned=38037954, file_open_errors=0, row_groups_matched_statistics=1, row_groups_pruned_statistics=0, num_predicate_creation_errors=0, file_scan_errors=0, predicate_evaluation_errors=0, row_groups_pruned_bloom_filter=0, page_index_rows_filtered=0, row_groups_matched_bloom_filter=0, pushdown_rows_filtered=0, time_elapsed_opening=24.008333ms, time_elapsed_scanning_total=68.525251ms, time_elapsed_processing=75.557418ms, pushdown_eval_time=32ns, time_elapsed_scanning_until_data=10.007876ms, page_index_eval_time=218.687µs] |

Decimal

select * from decimal where id=arrow_cast('5714204269946304998258834512.6198419457', 'Decimal128(38, 10)')

+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=0, elapsed_compute=2.492µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                   |   FilterExec: id@0 = Some(57142042699463049982588345126198419457),38,10, metrics=[output_rows=0, elapsed_compute=422.896µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |     ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:0..2376546], [Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:2376546..4753092], [Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:4753092..7129638], [Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:7129638..9506184], [Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:9506184..11882730], ...]}, projection=[id, name], predicate=id@0 = Some(57142042699463049982588345126198419457),38,10, pruning_predicate=CASE WHEN id_null_count@2 = id_row_count@3 THEN false ELSE id_min@0 <= Some(57142042699463049982588345126198419457),38,10 AND Some(57142042699463049982588345126198419457),38,10 <= id_max@1 END, required_guarantees=[id in (Some(57142042699463049982588345126198419457),38,10)], metrics=[output_rows=1000000, elapsed_compute=16ns, bytes_scanned=38054970, file_open_errors=0, row_groups_matched_statistics=1, row_groups_pruned_statistics=0, num_predicate_creation_errors=0, file_scan_errors=0, predicate_evaluation_errors=0, row_groups_pruned_bloom_filter=0, page_index_rows_filtered=0, row_groups_matched_bloom_filter=0, pushdown_rows_filtered=0, time_elapsed_opening=4.448333ms, time_elapsed_scanning_total=60.640037ms, time_elapsed_processing=56.689831ms, pushdown_eval_time=32ns, time_elapsed_scanning_until_data=7.254625ms, page_index_eval_time=17.975µs] |
|                   |

I don't really have great insight as to why Decimal was better -- it may be because it is stored inline as i128 values (rather than out of line).

samuelcolvin · 2024-06-30T12:13:13Z

What's weird is the behaviour with a decimal 128 is better than a uint64 when sorted. Is that that a fundamental side affect of the type, or some missing logic/optimisation?

alamb · 2024-07-01T15:34:16Z

What's weird is the behaviour with a decimal 128 is better than a uint64 when sorted. Is that that a fundamental side affect of the type, or some missing logic/optimisation?

I suspect it is some missing optimization -- I don't know of any reason that fixed size binary would be less efficient than decimal.

I double checked that FixedSizeBinary is also stored inline

samuelcolvin · 2024-07-26T21:24:36Z

I think the problem is that here arrow-rs has special cases for primative types and "byte types" e.g. strings, but no special case for FixedSizeBinary. Not sure how much of the slow down is due to that, but could be significant.

alamb · 2024-07-29T16:50:15Z

I think the problem is that here arrow-rs has special cases for primative types and "byte types" e.g. strings, but no special case for FixedSizeBinary. Not sure how much of the slow down is due to that, but could be significant.

Thank you @samuelcolvin -- filed apache/arrow-rs#6153 to track

appletreeisyellow · 2024-08-07T00:06:41Z

Hi @samuelcolvin! I'm playing around your example code samuelcolvin/datafusion-id-experiment on my Mac M1 and I got empty results for all the queries.

Click to see what I did and what I got

cargo build --release
./target/release/union-scratch
Got the following:

loading data from simple_fixed.parquet
loading data from simple_string.parquet
loading data from int.parquet
loading data from decimal.parquet
running:
select * from simple_fixed where id=arrow_cast(decode('57f16cbaf865bcd9adcc71c03200fd60', 'hex'), 'FixedSizeBinary(16)')
++
++
query time: 102.351167ms


running:
select * from simple_fixed where id=arrow_cast(decode('57f16cbaf865bcd9adcc71c03200fd60', 'hex'), 'FixedSizeBinary(16)')
++
++
query time: 81.322958ms


running:
select * from simple_fixed_sorted where id=arrow_cast(decode('57f16cbaf865bcd9adcc71c03200fd60', 'hex'), 'FixedSizeBinary(16)')
++
++
query time: 89.593416ms


running:
select * from simple_string where id='7wIBWI3Ol4njEVD8'
++
++
query time: 138.759125ms


running:
select * from simple_string where id='7wIBWI3Ol4njEVD8'
++
++
query time: 128.254459ms


running:
select * from simple_string_sorted where id='7wIBWI3Ol4njEVD8'
++
++
query time: 20.043959ms


running:
select * from int where id=1485542105725837362
++
++
query time: 77.541083ms


running:
select * from int where id=1485542105725837362
++
++
query time: 77.021583ms


running:
select * from int_sorted where id=1485542105725837362
++
++
query time: 59.955709ms


running:
select * from decimal where id=arrow_cast('5714204269946304998258834512.6198419457', 'Decimal128(38, 10)')
++
++
query time: 96.231125ms


running:
select * from decimal where id=arrow_cast('5714204269946304998258834512.6198419457', 'Decimal128(38, 10)')
++
++
query time: 105.458583ms


running:
select * from decimal_sorted where id=arrow_cast('5714204269946304998258834512.6198419457', 'Decimal128(38, 10)')
++
++
query time: 13.88675ms

Is this what is expected? Anything did I miss?

appletreeisyellow · 2024-08-08T21:30:50Z

I was trying to see where the time was spent for FixedSizeBinary, so I used Mac's Instrument CPU Profiler to run a CPU profiling on the benchmark / example code: samuelcolvin/datafusion-id-experiment

This is what I observed: 98% of the time was spent on reading data from parquet files and 1.2% of the time was spent on querying (see screenshot below). So it is hard to tell whether the unsorted FixSizeBinary is slower or not with the current example code.

appletreeisyellow · 2024-08-08T21:35:42Z

I was trying to see where the time was spent for FixedSizeBinary, so I used Mac's Instrument CPU Profiler to run a CPU profiling on the benchmark / example code: samuelcolvin/datafusion-id-experiment

This is what I observed: 98% of the time was spent on reading data from parquet files and 1.2% of the time was spent on querying (see screenshot below). So it is hard to tell whether the unsorted FixSizeBinary is slower or not with the current example code. Maybe other people can try a different approach to measure where the time was spent

alamb · 2024-08-09T16:23:24Z

Thanks @appletreeisyellow -- I filed apache/arrow-rs#6219 upstream to track the idea of making this faster

tustvold · 2024-08-09T17:14:58Z

I might be misreading the profile but that has 90% of the time in the byte_array_reader compared to only 0.6% of the time in the fixed_len_byte_array_reader?

samuelcolvin added the bug Something isn't working label Jun 28, 2024

alamb mentioned this issue Jul 29, 2024

Implement filter kernel specially for FixedSizeByteArray apache/arrow-rs#6153

Closed

alamb mentioned this issue Jul 29, 2024

Enable split_file_groups_by_statistics by default #10336

Open

alamb mentioned this issue Aug 9, 2024

Look into optimizing reading FixedSizeBinary arrays from parquet apache/arrow-rs#6219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow sorting to improve `FixedSizeBinary` filtering #11170

Allow sorting to improve `FixedSizeBinary` filtering #11170

samuelcolvin commented Jun 28, 2024 •

edited

Loading

samuelcolvin commented Jun 29, 2024

samuelcolvin commented Jun 29, 2024

samuelcolvin commented Jun 29, 2024

alamb commented Jun 30, 2024

samuelcolvin commented Jun 30, 2024

alamb commented Jul 1, 2024 •

edited

Loading

samuelcolvin commented Jul 26, 2024

alamb commented Jul 29, 2024

appletreeisyellow commented Aug 7, 2024 •

edited

Loading

appletreeisyellow commented Aug 8, 2024

appletreeisyellow commented Aug 8, 2024 •

edited

Loading

alamb commented Aug 9, 2024

tustvold commented Aug 9, 2024 •

edited

Loading

Allow sorting to improve FixedSizeBinary filtering #11170

Allow sorting to improve FixedSizeBinary filtering #11170

Comments

samuelcolvin commented Jun 28, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

samuelcolvin commented Jun 29, 2024

samuelcolvin commented Jun 29, 2024

samuelcolvin commented Jun 29, 2024

alamb commented Jun 30, 2024

FixedSizeBinary

Decimal

samuelcolvin commented Jun 30, 2024

alamb commented Jul 1, 2024 • edited Loading

samuelcolvin commented Jul 26, 2024

alamb commented Jul 29, 2024

appletreeisyellow commented Aug 7, 2024 • edited Loading

appletreeisyellow commented Aug 8, 2024

appletreeisyellow commented Aug 8, 2024 • edited Loading

alamb commented Aug 9, 2024

tustvold commented Aug 9, 2024 • edited Loading

Allow sorting to improve `FixedSizeBinary` filtering #11170

Allow sorting to improve `FixedSizeBinary` filtering #11170

samuelcolvin commented Jun 28, 2024 •

edited

Loading

alamb commented Jul 1, 2024 •

edited

Loading

appletreeisyellow commented Aug 7, 2024 •

edited

Loading

appletreeisyellow commented Aug 8, 2024 •

edited

Loading

tustvold commented Aug 9, 2024 •

edited

Loading