Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Rust impls #108

Merged
merged 2 commits into from
Apr 8, 2024
Merged

Conversation

ChillFish8
Copy link
Contributor

Related to #107

Optimizes the native implementation in a way that the compiler can actually vectorize the implementations despite the IEE rules.

Although not the simplest, it is more realistic of a 'native' implementation if you are trying to get the maximum speed by going down to intrinsic instructions like AVX.

@ashvardanian
Copy link
Owner

Hi @ChillFish8! Thanks for your contribution!
Indeed, your loop-unrolled variant is much faster than the naive Rust approach, even the procedural code.

     Running rust/benches/cosine.rs (target/release/deps/cosine-e0cccefbe212a606)
Gnuplot not found, using plotters backend
SIMD Cosine/SimSIMD/0   time:   [91.178 ns 91.296 ns 91.444 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  4 (4.00%) high mild
  4 (4.00%) high severe
SIMD Cosine/Rust Procedural/0
                        time:   [793.02 ns 796.96 ns 802.25 ns]
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe
SIMD Cosine/Rust Functional/0
                        time:   [794.70 ns 797.24 ns 801.14 ns]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
SIMD Cosine/Rust Unrolled/0
                        time:   [208.64 ns 209.64 ns 211.12 ns]
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe

I am mostly working on recent CPUs and on Intel Sapphire Rapids SimSIMD currently wins thanks to AVX-512 support. I wouldn't expect much difference for f32 on AVX2-only machines. For other types, it may be noticeable. Maybe it makes sense to add benchmarks for i8, the wins can be very noticeable there 🤗

@ashvardanian ashvardanian merged commit 508e7a0 into ashvardanian:main-dev Apr 8, 2024
36 checks passed
ashvardanian pushed a commit that referenced this pull request Apr 8, 2024
# [4.3.0](v4.2.2...v4.3.0) (2024-04-08)

### Add

* `toBinary` for JavaScript ([1f1fd3a](1f1fd3a))

### Improve

* Procedural Rust benchmarks ([e01ec6c](e01ec6c))
* Unrolled Rust benchmarks (#108) ([508e7a0](508e7a0)), closes [#108](#108)
@ashvardanian
Copy link
Owner

🎉 This PR is included in version 4.3.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@ChillFish8
Copy link
Contributor Author

@ashvardanian Do you have a rough idea what the performance difference is between the Intel rapids AVX512 vs something like an AMD 7700 or Epyc chip? Just curious since I develop mostly on AMD type CPUs which can be a bit difficult to predict how performance goes on intel chipsets.

@ashvardanian
Copy link
Owner

@ChillFish8 on Zen4 most of AVX512 is available, except for FP16 extensions. Everything except for that should work great.

If you are on Zen3 or older, SimSIMD will use F16C extensions for FMA. They are quite slow, but still much better than serial code for half-precision, as modern compilers can't handle that type well. For single-precision you may not get any gains on older CPUs.

For int8, SimSIMD should work great on both old and new CPUs. That type is often used in heavily-quantized embedding models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants