Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A suggestion of code improvement for BloomFilter. #7346

Open
3 tasks done
longlong354 opened this issue Aug 5, 2024 · 6 comments
Open
3 tasks done

A suggestion of code improvement for BloomFilter. #7346

longlong354 opened this issue Aug 5, 2024 · 6 comments
Assignees
Labels
P3 type=enhancement Make an existing feature better

Comments

@longlong354
Copy link

longlong354 commented Aug 5, 2024

API(s)

com.google.common.hash.BloomFilter::optimalNumOfBits(long n, double p)
&
com.google.common.hash.BloomFilter::optimalNumOfHashFunctions(long n, long m)

How do you want it to be improved?

1. use static caculated value of log(2) and squared log(2) :

        private static final double LOG_TWO = Math.log(2);
	private static final double SQUARED_LOG_TWO = Math.pow(LOG_TWO,2);

2. calculate optimalNumOfBits by static values:

	static long optimalNumOfBits(long n, double p) {
	    if (p == 0) {
			p = Double.MIN_VALUE;
	    }
	    return (long) (-n * Math.log(p) / SQUARED_LOG_TWO);
	}

3. caculate optimalNumOfHashFunction by false positive rate(p) directly and using LOG_TWO :

     	static int optimalNumOfHashFunctions(double p){
		return  Math.max(1, (int) Math.round( - Math.log(p) / LOG_TWO));
	}

Why do we need it to be improved?

  1. Slightly reduce the method's execution time.
  2. Refine the calculation of the optimalNumOfHashFunctions method to be more concise and efficient. The current code may lead people to mistakenly believe that the result is related to the number of samples(n) and the number of bits(m), when in fact it is only related to the false positive rate(p).

Example

as the given codes inHow do you want it to be improved

Current Behavior

as the source codes in “How do you want it to be improved”

Desired Behavior

as the given codes in
com.google.common.hash.BloomFilter::optimalNumOfBits(long n, double p)
&
com.google.common.hash.BloomFilter::optimalNumOfHashFunctions(long n, long m)

Concrete Use Cases

as "How do you want it to be improved"

Checklist

@longlong354 longlong354 added the type=enhancement Make an existing feature better label Aug 5, 2024
@eamonnmcmanus
Copy link
Member

This appears to have been a deliberate change back in 2012. Maybe @kluever remembers the rationale.

@longlong354
Copy link
Author

Thanks for reply. My fault didn't express the main point clearly enough.
The main recommendation is using

optimalNumOfHashFunctions(double p)

instead of

optimalNumOfHashFunctions(long n, long m)

@Ayush-Thakur-geek
Copy link

Hey, @longlong354

So, you just want the usage of (double p) and not (long m, long n) as argument for bloomfilter.

@eamonnmcmanus
Copy link
Member

The situation now is roughly that we start from $n$, the expected number of entries, and $p$, the desired false positive probability, and we derive $m$, the optimal number of bits, as

$$ m = \lfloor {-n\ \ln\ p } / { (\ln\ 2)^2} \rfloor $$

Then we further derive $k$, the optimal number of hash functions, as

$$ k = (m / n)\ \ln\ 2 \approx {-n\ \ln\ p } / { \ln\ 2} $$

rounded to an integer. You are proposing instead

$$ k = { -\ln\ p } / { \ln\ 2 } $$

removing the factor of $n$. That's a different number. Are you saying that it's more accurate? Could you explain why?

@longlong354
Copy link
Author

longlong354 commented Aug 29, 2024

The derivation of the formula

k = (m / n) ln 2 ≈ -n ln p / ln 2

is inaccurate; it should be

k = (m / n) ln 2 ≈ - ln p / ln 2

Please kindly note that the bolded 'n' in the numerator cancels out with the 'n' in m( m = ⌊ − n ln ⁡ p / ( ln ⁡ 2 ) 2 ⌋ )"

ps:
refering to https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions
also can be verified in https://krisives.github.io/bloom-calculator/ that n does not affect k

@longlong354
Copy link
Author

Hey, @longlong354

So, you just want the usage of (double p) and not (long m, long n) as argument for bloomfilter.

yep~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 type=enhancement Make an existing feature better
Projects
None yet
Development

No branches or pull requests

4 participants