questions on bottom-sketch #30

jianshu93 · 2023-09-21T23:08:41Z

Dear Anndrii,

I am writing to ask about the actually implementation of MinHash, there are 3 different implementations, k-min sketch, bottom-k sketch, which is:

a ‘bottom sketch’ strategy, originally proposed by Broder (1997), is commonly used to implement the MinHash algorithm, Instead of using m hash functions, all elements from a given set A are passed through a single hash function and the smallest m hash values (instead of elements) are stored in a sorted sketch S_b (A) of size m. The probability that sketch S_b(A), S_b(B) share a hash value represents the probability of random sampling a shared element from the union of set A and B. So, the resemblance of set A, B can be quickly estimated by counting the matched values between S_b(A) and S_b(B). The idea is mentioned here: http:https://www.cohenwang.org/edith/Surveys/minhash.pdf

This idea will increase the variance of the estimator right because only one hash function is used. More hash functions used, it will be more accurate. What to you think.

Thanks,

Jianshu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions on bottom-sketch #30

questions on bottom-sketch #30

jianshu93 commented Sep 21, 2023

questions on bottom-sketch #30

questions on bottom-sketch #30

Comments

jianshu93 commented Sep 21, 2023