-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
questions on bottom-sketch #30
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Dear Anndrii,
I am writing to ask about the actually implementation of MinHash, there are 3 different implementations, k-min sketch, bottom-k sketch, which is:
a ‘bottom sketch’ strategy, originally proposed by Broder (1997), is commonly used to implement the MinHash algorithm, Instead of using m hash functions, all elements from a given set A are passed through a single hash function and the smallest m hash values (instead of elements) are stored in a sorted sketch S_b (A) of size m. The probability that sketch S_b(A), S_b(B) share a hash value represents the probability of random sampling a shared element from the union of set A and B. So, the resemblance of set A, B can be quickly estimated by counting the matched values between S_b(A) and S_b(B). The idea is mentioned here: http:https://www.cohenwang.org/edith/Surveys/minhash.pdf
This idea will increase the variance of the estimator right because only one hash function is used. More hash functions used, it will be more accurate. What to you think.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: