Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prime Number Functions #400

Draft
wants to merge 117 commits into
base: develop
Choose a base branch
from
Draft

Conversation

mborland
Copy link
Member

An initial addition of two prime number functions:

prime_sieve is a linear prime sieve algorithm. Currently benchmarks O(n).

prime_range returns all the prime numbers in the range [lower_bound, upper_bound]. For the first 1000 primes it can use the already built in lookup tables; outside of that it will call prime_sieve. This is the function intended for end users to call.

Future additions would include a Miller-Rabin primality test.

@jzmaddock
Copy link
Collaborator

FYI there's a Miller Rabin test in Multiprecision already.

@NAThompson
Copy link
Collaborator

@mborland : OMG thank you; we've needed a prime sieve for so long.

@NAThompson
Copy link
Collaborator

@mborland : I would definitely change the name from prime_functions.hpp to prime_sieve.hpp; the first is not very memorable.

@NAThompson
Copy link
Collaborator

@mborland : IIRC, the Joy of Factoring also creates bitsets which set prime at bit i and has it zeroed at composites. Is this a useful API?

@mborland
Copy link
Member Author

@NAThompson That sounds like wheel factorization. That method is added to the sieve of Eratosthenes, but would not be applicable here.

@NAThompson
Copy link
Collaborator

Would it be more ergonomic to have prime_range use a back_inserter, but prime_sieve behave more like std::fill?

std::vector<int64_t> primes(1000); // gimme 1000 primes
prime_sieve(primes.begin(), primes.end());

(I need to stay in my lane here; @jeremy-murphy is much better at ergonomics.)

@NAThompson
Copy link
Collaborator

NAThompson commented Jul 11, 2020

@mborland : I have begun a performance comparison with Kim Walish's prime sieve. Here is the code:

#include <vector>
#include <boost/math/special_functions/prime_sieve.hpp>
#include <benchmark/benchmark.h>
#include <primesieve.hpp>

template <class Z>
void prime_sieve(benchmark::State& state)
{
    Z upper = static_cast<Z>(state.range(0));
    for(auto _ : state)
    {
        std::vector<Z> primes;
        benchmark::DoNotOptimize(boost::math::prime_sieve(static_cast<Z>(2), upper, std::back_inserter(primes)));
    }
    state.SetComplexityN(state.range(0));
}
BENCHMARK_TEMPLATE(prime_sieve, int64_t)->RangeMultiplier(2)->Range(1 << 1, 1 << 22)->Complexity();

template <class Z>
void kimwalish_primes(benchmark::State& state)
{

    Z upper = static_cast<Z>(state.range(0));
    for (auto _ : state)
    {
        std::vector<Z> primes;
        primesieve::generate_primes(upper, &primes);
        benchmark::DoNotOptimize(primes.back());
    }
    state.SetComplexityN(state.range(0));
}
BENCHMARK_TEMPLATE(kimwalish_primes, int64_t)->RangeMultiplier(2)->Range(1 << 1, 1 << 22)->Complexity();

BENCHMARK_MAIN();

and the results:

Run on (16 X 2300 MHz CPU s)
CPU Caches:
  L1 Data 32K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 262K (x8)
  L3 Unified 16777K (x1)
Load Average: 2.49, 1.81, 1.78
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
prime_sieve<int64_t>/2                   348 ns          348 ns      1939182
prime_sieve<int64_t>/4                   463 ns          462 ns      1467524
prime_sieve<int64_t>/8                   603 ns          602 ns      1174654
prime_sieve<int64_t>/16                  717 ns          717 ns       956781
prime_sieve<int64_t>/32                  907 ns          907 ns       732693
prime_sieve<int64_t>/64                 1146 ns         1146 ns       604673
prime_sieve<int64_t>/128                1410 ns         1410 ns       494574
prime_sieve<int64_t>/256                2190 ns         2190 ns       319042
prime_sieve<int64_t>/512                3838 ns         3838 ns       186498
prime_sieve<int64_t>/1024               6769 ns         6769 ns       103041
prime_sieve<int64_t>/2048              12757 ns        12756 ns        54359
prime_sieve<int64_t>/4096              25295 ns        25291 ns        26912
prime_sieve<int64_t>/8192              50898 ns        50880 ns        13661
prime_sieve<int64_t>/16384            101938 ns       101936 ns         6707
prime_sieve<int64_t>/32768            201957 ns       201936 ns         3455
prime_sieve<int64_t>/65536            396735 ns       396709 ns         1754
prime_sieve<int64_t>/131072           779986 ns       779933 ns          864
prime_sieve<int64_t>/262144          1534550 ns      1534233 ns          446
prime_sieve<int64_t>/524288          3020453 ns      3019390 ns          228
prime_sieve<int64_t>/1048576         6089458 ns      6089187 ns          107
prime_sieve<int64_t>/2097152        14131817 ns     14130295 ns           44
prime_sieve<int64_t>/4194304        30953282 ns     30952476 ns           21
prime_sieve<int64_t>_BigO               0.33 NlgN       0.33 NlgN
prime_sieve<int64_t>_RMS                   9 %             9 %
kimwalish_primes<int64_t>/2              276 ns          276 ns      2263102
kimwalish_primes<int64_t>/4              280 ns          280 ns      2617321
kimwalish_primes<int64_t>/8              279 ns          279 ns      2461625
kimwalish_primes<int64_t>/16             289 ns          289 ns      2526273
kimwalish_primes<int64_t>/32             296 ns          296 ns      2348788
kimwalish_primes<int64_t>/64             287 ns          287 ns      2397589
kimwalish_primes<int64_t>/128            357 ns          357 ns      2088954
kimwalish_primes<int64_t>/256            368 ns          368 ns      1894662
kimwalish_primes<int64_t>/512            459 ns          459 ns      1601636
kimwalish_primes<int64_t>/1024          2177 ns         2176 ns       319871
kimwalish_primes<int64_t>/2048          2535 ns         2534 ns       274488
kimwalish_primes<int64_t>/4096          3377 ns         3377 ns       206241
kimwalish_primes<int64_t>/8192          4527 ns         4525 ns       155072
kimwalish_primes<int64_t>/16384         7015 ns         7014 ns        98151
kimwalish_primes<int64_t>/32768        10982 ns        10981 ns        63122
kimwalish_primes<int64_t>/65536        19631 ns        19627 ns        35826
kimwalish_primes<int64_t>/131072       35359 ns        35356 ns        19769
kimwalish_primes<int64_t>/262144       66530 ns        66523 ns        10449
kimwalish_primes<int64_t>/524288      126772 ns       126769 ns         4532
kimwalish_primes<int64_t>/1048576     246035 ns       245968 ns         2837
kimwalish_primes<int64_t>/2097152     477212 ns       477177 ns         1450
kimwalish_primes<int64_t>/4194304    1109480 ns      1109200 ns          621
kimwalish_primes<int64_t>_BigO          0.01 NlgN       0.01 NlgN
kimwalish_primes<int64_t>_RMS             11 %            11 %

So it looks like there is 1-2 orders of magnitude of performance improvement left in the boost implementation; presumably we need to find it. . .

@mborland
Copy link
Member Author

@NAThompson It would be fairly easy to create light wrappers to the current prime_sieve implementation like in the uni-variate statistics library.

I will have to do some digging to see where more performance can be squeezed out.

@NAThompson
Copy link
Collaborator

It would be fairly easy to create light wrappers to the current prime_sieve implementation like in the uni-variate statistics library.

Nah, if it's not obviously a better way to do it, I'm not interested in it.

@NAThompson
Copy link
Collaborator

@mborland : Also, make sure to tag your commit messages with [CI SKIP] until we are about to merge.

@NAThompson
Copy link
Collaborator

NAThompson commented Jul 11, 2020

I just went through The Joy of Factoring to refresh my memory on how these sieves work. I implemented algorithm 8.2 of that book, which the author claims runs in O(J*log(log(J)) time. He also references P.A. Pritchard who has given an algorithm which runs in O(J/log(log(J)) time; see A sublinear additive sieve for finding prime numbers, Comm. ACM 24, 1981.

In any case, this is how my naive implementation of algorithm 8.2 looks:

// Copyright 2020 Matt Borland
//
// Use, modification and distribution are subject to the
// Boost Software License, Version 1.0.
// (See accompanying file LICENSE_1_0.txt
// or copy at https://www.boost.org/LICENSE_1_0.txt)

#include <vector>
#include <boost/math/special_functions/prime_sieve.hpp>
#include <benchmark/benchmark.h>
#include <primesieve.hpp>

template <class Z>
void prime_sieve(benchmark::State& state)
{
    Z upper = static_cast<Z>(state.range(0));
    for(auto _ : state)
    {
        std::vector<Z> primes;
        benchmark::DoNotOptimize(boost::math::prime_sieve(static_cast<Z>(2), upper, std::back_inserter(primes)));
    }
    state.SetComplexityN(state.range(0));
}
BENCHMARK_TEMPLATE(prime_sieve, int64_t)->RangeMultiplier(2)->Range(1 << 1, 1 << 22)->Complexity();

template <class Z>
void kimwalish_primes(benchmark::State& state)
{

    Z upper = static_cast<Z>(state.range(0));
    for (auto _ : state)
    {
        std::vector<Z> primes;
        primesieve::generate_primes(upper, &primes);
        benchmark::DoNotOptimize(primes.back());
    }
    state.SetComplexityN(state.range(0));
}
BENCHMARK_TEMPLATE(kimwalish_primes, int64_t)->RangeMultiplier(2)->Range(1 << 1, 1 << 22)->Complexity();


template<typename Z>
std::vector<bool> joy_of_factoring_bitset(const Z J)
{
    using std::sqrt;
    using std::log;
    // A vector of bools behaves like a bitset, but with runtime size:
    std::vector<bool> P(J, true);
    P[0] = false;
    P[1] = false;
    Z p = 2;
    Z root_j = sqrt(J);
    while (p <= root_j) {
        Z i = p + p;
        while (i <= J) {
            P[i] = false;
            i = i + p;
        }
        i = p + 1;
        while (i <= root_j && P[i] == false) {
            i += 1;   
        }
        p = i;
    }
    return P;   
}

template<typename Z>
std::vector<Z> joy_of_factoring_primes(const Z J)
{
    auto P = joy_of_factoring_bitset(J);
    std::vector<Z> primes;
    if (J >= 355991) {
        double inv_log = 1/log(J);
        size_t N = J*inv_log*(1+inv_log + 2.51*inv_log*inv_log);
        primes.reserve(N);
    }
    else if (J>=55) {
        double n = static_cast<double>(J)/(log(J) - 4.0);
        primes.reserve(static_cast<size_t>(n));
    }
    primes.emplace_back(2);
    for (size_t i = 3; i < J; i += 2) {
        if (P[i]) {
            primes.emplace_back(i);   
        }
    }
    return primes;
}

template <class Z>
void joy_of_factoring_bm(benchmark::State& state)
{
    
    Z upper = static_cast<Z>(state.range(0));
    for (auto _ : state)
    {
        benchmark::DoNotOptimize(joy_of_factoring_primes(upper));
    }
    state.SetComplexityN(state.range(0));
}
BENCHMARK_TEMPLATE(joy_of_factoring_bm, size_t)->RangeMultiplier(2)->Range(1 << 1, 1 << 22)->Complexity();

BENCHMARK_MAIN();

And the results:

prime_sieve<int64_t>/2                     352 ns          351 ns      1951056
prime_sieve<int64_t>/4                     470 ns          470 ns      1481710
prime_sieve<int64_t>/8                     597 ns          597 ns      1129615
prime_sieve<int64_t>/16                    738 ns          738 ns       953198
prime_sieve<int64_t>/32                    936 ns          936 ns       762660
prime_sieve<int64_t>/64                   1145 ns         1145 ns       616909
prime_sieve<int64_t>/128                  1407 ns         1407 ns       495898
prime_sieve<int64_t>/256                  2247 ns         2246 ns       314915
prime_sieve<int64_t>/512                  3715 ns         3713 ns       189649
prime_sieve<int64_t>/1024                 6664 ns         6663 ns       103458
prime_sieve<int64_t>/2048                12745 ns        12741 ns        54189
prime_sieve<int64_t>/4096                25364 ns        25360 ns        27329
prime_sieve<int64_t>/8192                51157 ns        51151 ns        13472
prime_sieve<int64_t>/16384              102583 ns       102567 ns         6739
prime_sieve<int64_t>/32768              203596 ns       203550 ns         3428
prime_sieve<int64_t>/65536              399037 ns       399001 ns         1753
prime_sieve<int64_t>/131072             781683 ns       781580 ns          876
prime_sieve<int64_t>/262144            1538995 ns      1538545 ns          448
prime_sieve<int64_t>/524288            3095005 ns      3094665 ns          230
prime_sieve<int64_t>/1048576           6326961 ns      6326208 ns          101
prime_sieve<int64_t>/2097152          14024588 ns     14024205 ns           44
prime_sieve<int64_t>/4194304          31189328 ns     31180333 ns           21
prime_sieve<int64_t>_BigO                 0.33 NlgN       0.33 NlgN
prime_sieve<int64_t>_RMS                     9 %             9 %
kimwalish_primes<int64_t>/2                279 ns          279 ns      2509536
kimwalish_primes<int64_t>/4                289 ns          289 ns      2494788
kimwalish_primes<int64_t>/8                282 ns          282 ns      2466786
kimwalish_primes<int64_t>/16               313 ns          313 ns      2333162
kimwalish_primes<int64_t>/32               293 ns          293 ns      2271474
kimwalish_primes<int64_t>/64               304 ns          304 ns      2286663
kimwalish_primes<int64_t>/128              372 ns          372 ns      2002437
kimwalish_primes<int64_t>/256              383 ns          383 ns      1830046
kimwalish_primes<int64_t>/512              444 ns          443 ns      1595158
kimwalish_primes<int64_t>/1024            2178 ns         2177 ns       318499
kimwalish_primes<int64_t>/2048            2557 ns         2556 ns       271694
kimwalish_primes<int64_t>/4096            3363 ns         3363 ns       206716
kimwalish_primes<int64_t>/8192            4591 ns         4591 ns       152819
kimwalish_primes<int64_t>/16384           6976 ns         6974 ns        97478
kimwalish_primes<int64_t>/32768          11104 ns        11103 ns        62514
kimwalish_primes<int64_t>/65536          19537 ns        19534 ns        35619
kimwalish_primes<int64_t>/131072         35760 ns        35756 ns        19317
kimwalish_primes<int64_t>/262144         66588 ns        66586 ns        10404
kimwalish_primes<int64_t>/524288        128041 ns       128034 ns         5385
kimwalish_primes<int64_t>/1048576       245994 ns       245973 ns         2815
kimwalish_primes<int64_t>/2097152       481273 ns       481164 ns         1445
kimwalish_primes<int64_t>/4194304      1110850 ns      1110702 ns          618
kimwalish_primes<int64_t>_BigO            0.01 NlgN       0.01 NlgN
kimwalish_primes<int64_t>_RMS               11 %            11 %
joy_of_factoring_bm<size_t>/2             66.1 ns         66.1 ns     10413103
joy_of_factoring_bm<size_t>/4              252 ns          252 ns      2807457
joy_of_factoring_bm<size_t>/8              385 ns          385 ns      1901306
joy_of_factoring_bm<size_t>/16             508 ns          507 ns      1370265
joy_of_factoring_bm<size_t>/32             618 ns          618 ns      1111023
joy_of_factoring_bm<size_t>/64             238 ns          238 ns      2947071
joy_of_factoring_bm<size_t>/128            334 ns          334 ns      2148722
joy_of_factoring_bm<size_t>/256            532 ns          532 ns      1259151
joy_of_factoring_bm<size_t>/512           1027 ns         1027 ns       662553
joy_of_factoring_bm<size_t>/1024          2490 ns         2490 ns       281698
joy_of_factoring_bm<size_t>/2048          5211 ns         5211 ns       130022
joy_of_factoring_bm<size_t>/4096         10454 ns        10454 ns        65775
joy_of_factoring_bm<size_t>/8192         20807 ns        20805 ns        33289
joy_of_factoring_bm<size_t>/16384        41352 ns        41351 ns        16687
joy_of_factoring_bm<size_t>/32768        82184 ns        82180 ns         8354
joy_of_factoring_bm<size_t>/65536       161913 ns       161879 ns         4242
joy_of_factoring_bm<size_t>/131072      320842 ns       320755 ns         2170
joy_of_factoring_bm<size_t>/262144      640485 ns       640433 ns         1073
joy_of_factoring_bm<size_t>/524288     1289366 ns      1289235 ns          536
joy_of_factoring_bm<size_t>/1048576    2589004 ns      2588625 ns          269
joy_of_factoring_bm<size_t>/2097152    5304918 ns      5302623 ns          130
joy_of_factoring_bm<size_t>/4194304   11621459 ns     11620593 ns           59
joy_of_factoring_bm<size_t>_BigO          0.12 NlgN       0.12 NlgN
joy_of_factoring_bm<size_t>_RMS              5 %             5 %

Algorithm 8.2 is actually competitive with Kim Walish for smaller N but starts to lag for larger N.

@mborland
Copy link
Member Author

@NAThompson I am seeing a similar result using a segmented sieve. Currently faster than kimwalish, and joy_of_factoring_bm from your post until 524288. After that kimwalish pulls ahead.

@NAThompson
Copy link
Collaborator

NAThompson commented Jul 11, 2020

Just ran this benchmark under perf:

kimwalish

So Kim Walish's prime generator does indeed compute logarithms and fill up a table of primes; quite a bit of time in malloc; quite a bit of time in fill.

However if you look at the Joy of Factoring algorithm, it spends way more time in the fill:

joy_of_factoring

So there are some opportunities there; you'll see that I'm only checking odd numbers after 2, there's probably some way to extend that to making the fill faster.

@mborland
Copy link
Member Author

@NAThompson I have found some performance improvements using C-style arrays instead of std::vector::reserve() which makes sense given that data.

@jzmaddock
Copy link
Collaborator

For initializing with a specific pattern what do you think about using the gaps in the MOD 30 wheel as the specific pattern to set?

I think that for the other optimisations to work, and for the wheel to be in synch with the underlying bytes, we have to use a compacted data structure - one that contains only the 8 values in the wheel that form the prime candidates on each rotation (ie the spokes of the wheel).

Soronson in this paper: https://www.researchgate.net/publication/2274357_Trading_Time_for_Space_in_Prime_Number_Sieves has some very lucid descriptions of the various sieves and how to index a compacted sieve (see "reducing space with a wheel" in the above ref). Note that this is normally much slower than a straight SoE as the inner loop become much more complex. However, once the small and medium prime optimisations I listed above are applied they are all effectively the same complexity, and now just differ in how far they skip forward with each loop / wheel turn.

Another way to look at this: my odd's only sieve is actually a wheel sieve with just one prime (2) in the wheel. So you get 16 numbers and 8 candidate primes per byte. A 2,3,5 wheel has 8 candidates per wheel turn (30), so we would normally start with all bits 1, and each byte represents the 8 possible primes in each block of 30 values of a full wheel turn. Now lets suppose you take that data structure and sieve out 7, 11 and 13 as well. Now you have a repeating bit pattern 71113 bytes long you can use to initialize the sieve, you can then start sieving at 17 rather than at 7 as you normally would.

I hope that makes sense, but maybe I misunderstood you?

@mborland
Copy link
Member Author

Each turn of the wheel being represented in one 8 bit byte since there are only 8 possible primes using the 2,3,5 wheel was what I was getting at. Let me see if I can get this working with the 2,3,5 wheel and then expanding to a larger wheel should be easy enough. In prime_wheel.hpp the class Wheel is used to compute and print the patterns into the terminal so I can copy and paste them into the pre-computed wheels (MOD30Wheel and MOD210Wheel) that are actually used.

@mborland
Copy link
Member Author

@jzmaddock I am making some good headway on your list of additions and optimizations. In commit eafbefc all multiples of the wheel basis in the bitset have been removed so each turn of the wheel now takes only 8 bits to represent. The implementation passes all unit tests, but the sieve is many times slower than it was, especially when using boost::multiprecision types. When you have a chance can you take a look? Thanks.

@jzmaddock
Copy link
Collaborator

Apologies for being late back here - I'll take a look as soon as I can, but I would expect the code to be significantly slower until the sieving optimisations are in place. Sorenson notes this in his paper, the reason being that indexing into the sieve is now much more complex until the optimisations are applied after which all the sieves (compressed or not, wheel or not) are equal. Hope this makes sense!

@mborland mborland marked this pull request as draft December 12, 2020 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants