Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(parquet): use optimized bloom filter #4441

Closed
wants to merge 9 commits into from
Closed

perf(parquet): use optimized bloom filter #4441

wants to merge 9 commits into from

Conversation

ozgrakkurt
Copy link

Closes #4213

Uses bloom filter implementation from a different crate. This crate uses CPU-specific optimized implementations so it is faster than the old implementation. benchmarks are on this repo

I am the author of the library that this PR uses.

There are no user-facing changes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 21, 2023
@ozgrakkurt ozgrakkurt changed the title use optimized bloom filter perf(parquet): use optimized bloom filter Jun 21, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jimexist who contributed the existing implementation -- perhaps you can take a look?

paste = { version = "1.0" }
sbbf-rs-safe = "0.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using a new crate (that would add a new dependency and need to be maintained, etc) what do you think about inlining your implementation into this repository?

Copy link
Author

@ozgrakkurt ozgrakkurt Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep it separate since it can be used in a lot of other applications, there are a lot of bloom filter libraries for rust but none of them implement split block variants afaik

@alamb
Copy link
Contributor

alamb commented Jun 22, 2023

cc @JayjeetAtGithub and @crepererum (as I think you may be interested in this area of parquet)

@ozgrakkurt
Copy link
Author

ozgrakkurt commented Jun 22, 2023

I fixed the compilation issue on wasm, fixed fmt but not sure about the illegal instruction error on mac CI, maybe it says it is aarch64 but doesn't have Neon instructions?, I found the illegal instruction was because of using _mm_sllv_epi32 for sse implementation, working on fixing it

@crepererum
Copy link
Contributor

The results here look good and I think having this in a separate crate is totally OK. I this is really perf critical for anyone I could imagine that the impl. get reasonable complex (incl. assembly) which we probably don't wanna have in core parquet-rs.

@ozgrakkurt
Copy link
Author

illegal instruction issue should be fixed, could you run the test again?

@Dandandan
Copy link
Contributor

Rerunning

@ozgrakkurt
Copy link
Author

I added CI for aarch64 and wasm, so anyone can check it to verify it indeed works on all target platforms. repo
Also added an optimized wasm version that requires the nightly compiler and requires the user to manually enable the simd128 feature.

@jhorstmann
Copy link
Contributor

I would actually expect the existing code to auto-vectorize and generate nearly the same assembly as with using intrinsics.

Maybe check needs a tiny refactoring and not do that early exit though: https://rust.godbolt.org/z/GG6YqTena

Are there more optimizations in the crate that I'm missing?

@ozgrakkurt
Copy link
Author

ozgrakkurt commented Jun 27, 2023 via email

@ozgrakkurt
Copy link
Author

ozgrakkurt commented Jun 27, 2023

Also compiler can't enable avx unless compiled with target-cpu=native which is very unlikely in cloud situations. But can do pretty well with aarch64 and sse I think

benchmarks on 4 AMD EPYC cores

parquet2 insert time: [21.260 ns 21.743 ns 22.242 ns]
change: [-6.7189% -4.0742% -1.4153%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe

parquet insert time: [14.298 ns 14.501 ns 14.721 ns]
change: [-0.4325% +1.8520% +4.1458%] (p = 0.12 > 0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe

sbbf-rs insert time: [3.5763 ns 3.6755 ns 3.8263 ns]
change: [-0.4673% +3.0805% +7.1155%] (p = 0.11 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe

parquet2 contains time: [6.1688 ns 6.2463 ns 6.3320 ns]
change: [-0.4251% +1.4917% +3.4157%] (p = 0.12 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe

parquet contains time: [6.4125 ns 6.5060 ns 6.6049 ns]
change: [-1.6971% +0.1722% +1.9752%] (p = 0.86 > 0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild

sbbf-rs contains time: [2.5523 ns 2.6273 ns 2.7069 ns]
change: [-3.0901% +1.4871% +6.1788%] (p = 0.53 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild

benchmarks on 4 AMD EPYC cores (with `target-cpu=native`)

parquet insert time: [8.9271 ns 9.1874 ns 9.4780 ns]
change: [-37.005% -34.474% -32.117%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
11 (11.00%) high mild
3 (3.00%) high severe

sbbf-rs insert time: [3.5264 ns 3.5634 ns 3.6087 ns]
change: [-5.1266% -1.3047% +1.9364%] (p = 0.54 > 0.05)
No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
5 (5.00%) high mild
4 (4.00%) high severe

parquet2 contains time: [3.2520 ns 3.2906 ns 3.3314 ns]
change: [-48.555% -47.728% -46.877%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high mild

parquet contains time: [3.1553 ns 3.1971 ns 3.2399 ns]
change: [-50.737% -49.948% -49.089%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe

sbbf-rs contains time: [2.2696 ns 2.3210 ns 2.3780 ns]
change: [-7.5938% -4.4491% -1.1179%] (p = 0.01 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe

The sbbf-rs code also has the disadvantage of dynamic dispatch so the code can't get inlined by the compiler and this actually has a big effect since it is a microbenchmark, only calling one function.

@Dandandan
Copy link
Contributor

If autovectorization / avx is of concern we could also bring back feature detection using the multiversion crate.

I know also of a few (cloud) users are enabling a specific cpu target, e.g. by setting target-cpu=skylake etc as the instance type / cpu is often known in advance.

@ozgrakkurt
Copy link
Author

ozgrakkurt commented Jun 27, 2023

If autovectorization / avx is of concern we could also bring back feature detection using the multiversion crate.

I know also of a few (cloud) users are enabling a specific cpu target, e.g. by setting target-cpu=skylake etc as the instance type / cpu is often known in advance.

With target-cpu=native on avx2 enabled cpu, library is 3x as fast on insert and 1.4x as fast on contains. Also library code is very easy to use without any requirement on cpu targets or anything like that.

Also checking old code, it seems like it loads the entire filter into memory whereas both parquet2 and sbbf-rs work with having the filter, as it is in the parquet file, in the memory and then working off that in the benchmark loop. So if we are doing something like: load filter -> do 5 contains queries it will have big impact as opposed to other implementations. Overall it seems like it might be possible to make it perform very well with a bunch of optimizations but at that point it is probably going to get more complicated than the sbbf-rs code.

@jhorstmann
Copy link
Contributor

I know also of a few (cloud) users are enabling a specific cpu target

We certainly do, especially in a cloud or server side environment we want to make the best use of the available hardware. So our usecase does not benefit from runtime dispatching, but I understand this might not be the main usecase. I'm also a huge proponent of using the available instruction set, if necessary via intrinsics. In many cases though the compiler generates just as good assembly without specialized instructions. Sometimes this requires a bit of restructuring or careful use of unsafe, but simple loops like 0..8 usually work fine, and the rust code is then more maintainable than the intrinsics.

I don't have a strong opinion on the bloom filter code, using a crate would also make sense instead of implementing in the arrow project itself. On the other hand, minimizing dependencies to a small, well-known set is also a good goal and something that some commercial users might care about.

Regarding benchmarks, I think the existing arrow_writer benchmark includes code to enable bloom filters, it would be interesting to see the performance benefit in such a slightly bigger context.

@ozgrakkurt
Copy link
Author

On aarch64

Running `cargo bench` under parquet folder, first with master branch and then sbbf branch
Benchmarking write_batch primitive/4096 values primitive: Warming up for 3.0000 Benchmarking write_batch primitive/4096 values primitive: Collecting 100 sampleswrite_batch primitive/4096 values primitive
                        time:   [861.79 µs 873.72 µs 885.48 µs]
                        thrpt:  [198.69 MiB/s 201.36 MiB/s 204.15 MiB/s]
                 change:
                        time:   [+2.1325% +3.2924% +4.6073%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4044% -3.1875% -2.0880%]
                        Performance has regressed.
Benchmarking write_batch primitive/4096 values primitive with bloom filter: WarmBenchmarking write_batch primitive/4096 values primitive with bloom filter: CollBenchmarking write_batch primitive/4096 values primitive with bloom filter: Analwrite_batch primitive/4096 values primitive with bloom filter
                        time:   [3.8160 ms 3.8739 ms 3.9339 ms]
                        thrpt:  [44.722 MiB/s 45.415 MiB/s 46.105 MiB/s]
                 change:
                        time:   [-32.210% -31.013% -29.875%] (p = 0.00 < 0.05)
                        thrpt:  [+42.603% +44.955% +47.515%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking write_batch primitive/4096 values primitive non-null: Warming up foBenchmarking write_batch primitive/4096 values primitive non-null: Collecting 10write_batch primitive/4096 values primitive non-null
                        time:   [734.01 µs 739.56 µs 746.42 µs]
                        thrpt:  [231.12 MiB/s 233.27 MiB/s 235.03 MiB/s]
                 change:
                        time:   [+2.7312% +3.4277% +4.2323%] (p = 0.00 < 0.05)
                        thrpt:  [-4.0604% -3.3141% -2.6586%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking write_batch primitive/4096 values primitive non-null with bloom filBenchmarking write_batch primitive/4096 values primitive non-null with bloom filBenchmarking write_batch primitive/4096 values primitive non-null with bloom filBenchmarking write_batch primitive/4096 values primitive non-null with bloom filwrite_batch primitive/4096 values primitive non-null with bloom filter
                        time:   [3.3841 ms 3.4255 ms 3.4691 ms]
                        thrpt:  [49.729 MiB/s 50.362 MiB/s 50.978 MiB/s]
                 change:
                        time:   [-39.074% -38.127% -37.236%] (p = 0.00 < 0.05)
                        thrpt:  [+59.327% +61.621% +64.133%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
Benchmarking write_batch primitive/4096 values bool: Collecting 100 samples in estimated 5.3174 s (write_batch primitive/4096 values bool
                        time:   [98.046 µs 98.965 µs 100.33 µs]
                        thrpt:  [10.570 MiB/s 10.716 MiB/s 10.816 MiB/s]
                 change:
                        time:   [+2.9694% +3.9493% +5.1195%] (p = 0.00 < 0.05)
                        thrpt:  [-4.8702% -3.7992% -2.8837%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Benchmarking write_batch primitive/4096 values bool non-null: Collecting 100 samples in estimated 5write_batch primitive/4096 values bool non-null
                        time:   [78.075 µs 79.162 µs 80.149 µs]
                        thrpt:  [7.1393 MiB/s 7.2282 MiB/s 7.3289 MiB/s]
                 change:
                        time:   [+0.0433% +2.4260% +5.4192%] (p = 0.07 > 0.05)
                        thrpt:  [-5.1406% -2.3685% -0.0432%]
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
Benchmarking write_batch primitive/4096 values string: Collecting 100 samples in estimated 6.2646 swrite_batch primitive/4096 values string
                        time:   [403.61 µs 405.73 µs 408.72 µs]
                        thrpt:  [194.39 MiB/s 195.83 MiB/s 196.85 MiB/s]
                 change:
                        time:   [+1.4492% +2.3498% +3.5335%] (p = 0.00 < 0.05)
                        thrpt:  [-3.4129% -2.2959% -1.4285%]
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe
Benchmarking write_batch primitive/4096 values string with bloom filter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
Benchmarking write_batch primitive/4096 values string with bloom filter: Collecting 100 samples in write_batch primitive/4096 values string with bloom filter
                        time:   [1.1055 ms 1.1382 ms 1.1729 ms]
                        thrpt:  [67.741 MiB/s 69.804 MiB/s 71.873 MiB/s]
                 change:
                        time:   [-37.247% -35.553% -33.810%] (p = 0.00 < 0.05)
                        thrpt:  [+51.079% +55.167% +59.354%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking write_batch primitive/4096 values string dictionary: Collecting 100 samples in estimatwrite_batch primitive/4096 values string dictionary
                        time:   [229.68 µs 230.35 µs 231.20 µs]
                        thrpt:  [207.00 MiB/s 207.77 MiB/s 208.37 MiB/s]
                 change:
                        time:   [+0.2089% +1.0188% +2.0040%] (p = 0.02 < 0.05)
                        thrpt:  [-1.9646% -1.0085% -0.2085%]
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe
Benchmarking write_batch primitive/4096 values string dictionary with bloom filter: Warming up for Benchmarking write_batch primitive/4096 values string dictionary with bloom filter: Collecting 100 write_batch primitive/4096 values string dictionary with bloom filter
                        time:   [535.90 µs 545.12 µs 555.73 µs]
                        thrpt:  [86.120 MiB/s 87.796 MiB/s 89.305 MiB/s]
                 change:
                        time:   [-42.792% -40.776% -38.804%] (p = 0.00 < 0.05)
                        thrpt:  [+63.410% +68.850% +74.801%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
Benchmarking write_batch primitive/4096 values string non-null: Collecting 100 samples in estimatedwrite_batch primitive/4096 values string non-null
                        time:   [483.04 µs 485.05 µs 487.48 µs]
                        thrpt:  [160.98 MiB/s 161.79 MiB/s 162.46 MiB/s]
                 change:
                        time:   [-2.2902% -1.5725% -0.8595%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8669% +1.5976% +2.3439%]
                        Change within noise threshold.
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) high mild
  10 (10.00%) high severe
Benchmarking write_batch primitive/4096 values string non-null with bloom filter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.8s, enable flat sampling, or reduce sample count to 60.
Benchmarking write_batch primitive/4096 values string non-null with bloom filter: Collecting 100 sawrite_batch primitive/4096 values string non-null with bloom filter
                        time:   [1.1462 ms 1.1802 ms 1.2178 ms]
                        thrpt:  [64.442 MiB/s 66.495 MiB/s 68.466 MiB/s]
                 change:
                        time:   [-39.081% -37.298% -35.342%] (p = 0.00 < 0.05)
                        thrpt:  [+54.661% +59.483% +64.152%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

Benchmarking write_batch nested/4096 values primitive list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60.
Benchmarking write_batch nested/4096 values primitive list: Collecting 100 samples in estimated 5.7write_batch nested/4096 values primitive list
                        time:   [1.1292 ms 1.1307 ms 1.1322 ms]
                        thrpt:  [144.23 MiB/s 144.42 MiB/s 144.60 MiB/s]
                 change:
                        time:   [-1.3286% -0.8860% -0.4412%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4431% +0.8939% +1.3465%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking write_batch nested/4096 values primitive list non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
Benchmarking write_batch nested/4096 values primitive list non-null: Collecting 100 samples in estiwrite_batch nested/4096 values primitive list non-null
                        time:   [1.4193 ms 1.4229 ms 1.4272 ms]
                        thrpt:  [133.14 MiB/s 133.54 MiB/s 133.88 MiB/s]
                 change:
                        time:   [+4.4098% +4.7487% +5.1370%] (p = 0.00 < 0.05)
                        thrpt:  [-4.8860% -4.5335% -4.2235%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe


So something like %35 gain across the board.
I'll also test on x86 machine in a little bit

@ozgrakkurt
Copy link
Author

ozgrakkurt commented Jun 27, 2023

about avx on distributed code: ClickHouse/ClickHouse#40459.

One problem with compiler generated assembly is that compiler doesn't seem to generate _mm256_testc_si256 kind of instruction for the check operation.

Not sure why insert is slower, it seems like it generates exactly the same thing but it has 1/3x the performance for some reason

Also probably, there is a huge loss because of this part: https://github.com/ozgrakkurt/sbbf-rs/blob/c63e9bbb4b4257d954c3c2a4333258f998ad10a0/benches/parquet_impl.rs#L139

It indexes into slice which might be (it actually does generate branching if I paste it into godbolt) generating branching. I wouldn't trust rust compiler to optimize this kind of thing, even though it is supposed to be very good with this. In this case it doesn't understand fast modulo reduction so it thinks it can index to outside of the slice?

So overall, if the version of master was to be optimized a bit more, and the pattern with Sbbf::new was fixed. It would be as good as intrinsic code, but even then the problem with avx2 remains.

self.0.index_mut(index)
}
}
use xxhash_rust::xxh64::xxh64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like (OSX) this is also part of the speed up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, forgot about that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also in my insert bench I also call contains on parquet impl. But there is big diff in integration tests of writing parquet with bloom filters

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also realised I don't use the sbbf filter properly in this PR branch as well, It allocates a new buffer when Sbbf::new is called, this shouldn't be the case ideally. It can just take borrow of thebitset, assuming it is aligned to 64 bits

@Dandandan
Copy link
Contributor

Can we also compare the current version to the fallback version in performance in sbbf-rs-safe?
It looks like in https://github.com/ozgrakkurt/sbbf-rs/blob/master/src/arch/fallback/parquet2_impl.rs there are a couple of possible regressions, i.e. for insert

@ozgrakkurt
Copy link
Author

ozgrakkurt commented Jun 28, 2023

Can we also compare the current version to the fallback version in performance in sbbf-rs-safe? It looks like in https://github.com/ozgrakkurt/sbbf-rs/blob/master/src/arch/fallback/parquet2_impl.rs there are a couple of possible regressions, i.e. for insert

The fallback version is basically parquet2 in benchmarks I posted. Let me change the fallback to be the implementation from parquet/master so there is little/no regression in fallback case

@ozgrakkurt
Copy link
Author

ozgrakkurt commented Jun 28, 2023

@Dandandan I updated fallback version and made a release, code is here: https://github.com/ozgrakkurt/sbbf-rs/blob/master/src/arch/fallback/parquet_impl.rs

Not sure how it will perform since I modified it to fit the api of the library

seems like it doesn't generate very good assembly for check_hash (https://rust.godbolt.org/z/5TK91ere5), can be fixed with more unsafe code but this is edge case so should be ok. Only fallback targets would be 32 bit x86 cpus and embedded cases I think. It even has an optimized version for wasm

@Dandandan
Copy link
Contributor

Thank you very much @ozgrakkurt I think it makes sense to use the crate as it seems we can't get similar performance relying on auto vectorization only in this case.

@ozgrakkurt
Copy link
Author

Thank you very much @ozgrakkurt I think it makes sense to use the crate as it seems we can't get similar performance relying on auto vectorization only in this case.

no problem, thanks as well!

@tustvold
Copy link
Contributor

tustvold commented Jun 29, 2023

I'm somewhat apprehensive about making our default impl be based off a crate with an unclear long-term maintenance story. Perhaps we could make sbbf-rs-safe optional, with it disabled by default and using the existing impl. This would allow people to explicitly opt-in to a faster bloom filter if this is important to them and they're comfortable with the implications of this

@alamb
Copy link
Contributor

alamb commented Jun 29, 2023

I'm somewhat apprehensive about making our default impl be based off a crate with an unclear long-term maintenance story.

I am likewise similarly apprehensive

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought about this more, I think we need to make this an optional feature that people explicitly opt-in to

@ozgrakkurt
Copy link
Author

Having thought about this more, I think we need to make this an optional feature that people explicitly opt-in to

done, can you check?

@ozgrakkurt ozgrakkurt closed this Jul 4, 2023
@ozgrakkurt ozgrakkurt deleted the sbbf branch July 4, 2023 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use optimized implementation of bloom filter
6 participants