Harley Seal AVX-512 implementations #138

ashvardanian · 2024-06-11T02:17:25Z

Binary representations are becoming increasingly popular in Machine Learning and I'd love to explore the opportunity for faster Hamming and Jaccard distance calculations. I've looked into several benchmarks, most importantly the WojciechMula/sse-popcount library, that compares several optimizations for population-counts -the most expensive part of the Hamming/Jaccard kernel.

Extensive benchmarks and the design itself suggest that AVX-512 Harley Seal variant should be the fastest on long inputs beyond 1 KB. Here is a sample of the most recent results obtained on an i3 Cannonlake Intel CPU:

procedure	32 B	64 B	128 B	256 B	512 B	1024 B	2048 B	4096 B
lookup-8	1.19464	1.09949	1.21245	1.11428	1.69827	1.65605	1.63299	1.62148
lookup-64	1.16739	1.09284	1.19636	1.10018	1.69524	1.65319	1.63670	1.62359
harley-seal	1.00883	0.82805	0.51017	0.39659	0.54067	0.49312	0.46917	0.45787
avx2-lookup	0.45543	0.28456	0.20674	0.14150	0.18920	0.16951	0.15977	0.15527
avx2-lookup-original	1.53184	0.90269	0.61849	0.41858	0.34503	0.32416	0.23073	0.25976
avx2-harley-seal	1.03679	0.59198	0.37492	0.26418	0.20457	0.15556	0.13097	0.11904
avx512-harley-seal	3.36585	0.71542	0.40990	0.26028	0.29072	0.10719	0.07310	0.05560
avx512bw-shuf	2.56808	1.99008	1.04359	0.55736	0.48551	0.25119	0.20256	0.15851
avx512vbmi-shuf	2.51702	1.99085	1.09241	0.54717	0.49385	0.25181	0.20032	0.15249
builtin-popcnt	0.22182	0.28289	0.26755	0.31640	0.39424	0.38940	0.36062	0.33525
builtin-popcnt32	0.46220	0.46701	0.51513	0.59160	0.89925	0.85613	0.84084	0.84065
builtin-popcnt-unrolled	0.25161	0.17290	0.14147	0.12966	0.20433	0.22086	0.20939	0.20628
builtin-popcnt-movdq	0.21983	0.18868	0.17849	0.18037	0.34305	0.31526	0.29713	0.29047

I've tried copying the best solution into SimSIMD benchmarking suite and sadly didn't achieve similar improvements on more recent CPUs. On Intel Sapphire Rapids CPUs:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
hamming_b8_haswell_4096b/min_time:10.000/threads:1       50.3 ns         50.3 ns    277340752 abs_delta=0 bytes=162.807G/s pairs=19.8739M/s relative_error=0
hamming_b8_ice_4096b/min_time:10.000/threads:1           34.8 ns         34.8 ns    402233197 abs_delta=0 bytes=235.632G/s pairs=28.7636M/s relative_error=0
hamming_b8_icehs_4096b/min_time:10.000/threads:1         42.4 ns         42.4 ns    330077077 abs_delta=0 bytes=193.07G/s pairs=23.5681M/s relative_error=0

On AMD Genoa:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
hamming_b8_haswell_4096b/min_time:10.000/threads:1       40.5 ns         40.5 ns    346163289 abs_delta=0 bytes=202.502G/s pairs=24.7195M/s relative_error=0
hamming_b8_ice_4096b/min_time:10.000/threads:1           40.6 ns         40.6 ns    344646420 abs_delta=0 bytes=201.733G/s pairs=24.6257M/s relative_error=0
hamming_b8_icehs_4096b/min_time:10.000/threads:1         59.8 ns         59.8 ns    234058579 abs_delta=0 bytes=136.96G/s pairs=16.7188M/s relative_error=0

The kernel designed for Haswell simply uses _mm_popcnt_u64.
The kernel designed for Ice Lake uses _mm512_popcnt_epi64.
The icehs is an adaptation of the Harley Seal transform that "zip"-s two input streams with xor.

To reproduce the results:

cmake -DCMAKE_BUILD_TYPE=Release -DSIMSIMD_BUILD_TESTS=1 -DSIMSIMD_BUILD_BENCHMARKS=1 -DSIMSIMD_BUILD_BENCHMARKS_WITH_CBLAS=1 -B build_release
cmake --build build_release --config Release && build_release/simsimd_bench --benchmark_filter="hamming(.*)4096b"

Please let me know if there is a better way to accelerate this kernel 🤗

This commit adds the optimized Harley Seal kernel from the `WojciechMula/sse-popcount` library to the benchmarking suite to investigate optimization opportunities on Intel Sapphire Rapids and AMD Genoa chips.

alexbowe · 2024-06-11T22:42:07Z

cpp/bench.cxx

@@ -223,6 +221,182 @@ void vdot_f64c_blas(simsimd_f64_t const* a, simsimd_f64_t const* b, simsimd_size

 #endif

+namespace AVX512_harley_seal {
+
+uint8_t lookup8bit[256] = {


Does it help to add const/constexpr? I wonder if it would encourage the table to be cached. It might also help to run a loop over it to pre-load it into cache too (although I figure prefetching would most likely get the whole table in the first access).

In my own experiments in the past, I did find the built in instructions to be faster vs LUTs, however.

Assuming the size of the inputs - the tail will never be evaluated separately. I've just copied that part of the code for completeness.

alexbowe · 2024-06-11T22:45:01Z

cpp/bench.cxx

+uint64_t lower_qword(const __m128i v) { return _mm_cvtsi128_si64(v); }
+
+uint64_t higher_qword(const __m128i v) { return lower_qword(_mm_srli_si128(v, 8)); }
+
+uint64_t simd_sum_epu64(const __m128i v) { return lower_qword(v) + higher_qword(v); }
+
+uint64_t simd_sum_epu64(const __m256i v) {


I think modern compilers might do this without asking in some cases, but using inline might encourage it (and could help with these small functions).

The changes I've suggested so far are just low hanging fruit though. Have you used profiling tools to find which lines of code each approach is spending the most time in?

Most time is spent in the main loop computing CSAs. Sadly, I can't access hardware performance counters on those machines.

Add: Harley Seal AVX-512 implementation

70ce8d5

This commit adds the optimized Harley Seal kernel from the `WojciechMula/sse-popcount` library to the benchmarking suite to investigate optimization opportunities on Intel Sapphire Rapids and AMD Genoa chips.

ashvardanian added the help wanted Extra attention is needed label Jun 11, 2024

alexbowe reviewed Jun 11, 2024

View reviewed changes

ashvardanian force-pushed the main-dev branch from 44b3fd1 to ce0b088 Compare June 30, 2024 04:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harley Seal AVX-512 implementations #138

Harley Seal AVX-512 implementations #138

ashvardanian commented Jun 11, 2024 •

edited

Loading

alexbowe Jun 11, 2024 •

edited

Loading

ashvardanian Jun 12, 2024

alexbowe Jun 11, 2024

alexbowe Jun 11, 2024

ashvardanian Jun 12, 2024

Harley Seal AVX-512 implementations #138

Are you sure you want to change the base?

Harley Seal AVX-512 implementations #138

Conversation

ashvardanian commented Jun 11, 2024 • edited Loading

alexbowe Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

ashvardanian Jun 12, 2024

Choose a reason for hiding this comment

alexbowe Jun 11, 2024

Choose a reason for hiding this comment

alexbowe Jun 11, 2024

Choose a reason for hiding this comment

ashvardanian Jun 12, 2024

Choose a reason for hiding this comment

ashvardanian commented Jun 11, 2024 •

edited

Loading

alexbowe Jun 11, 2024 •

edited

Loading