-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add _approx_count_distinct as a builtin reduce function #1346
Conversation
COUCHDB-2971
commit ee32cd5825aaf63448651c9521f0927083d2281e Author: Adam Kocoloski <[email protected]> Date: Wed Mar 1 09:28:45 2017 -0500 Add a cardinality estimator builtin reduce This introduces a _distinct builtin reduce function, which uses a HyperLogLog algorithm to estimate the number of distinct keys in the view index. The precision is currently fixed to 2^11 observables and therefore uses approximately 1.5 KB of memory. COUCHDB-2971
commit 5d18415237e7a01e1ac401607f7fc36b671bf640 Author: Adam Kocoloski <[email protected]> Date: Thu Apr 28 15:12:44 2016 -0400 Add a finalize step after rereduce Currently this is a noop for every reduce function except the HLL cardinately estimator implemented in _distinct. COUCHDB-2971
commit f7c3c24db17db5908894447e1822936212d61fcd Author: Adam Kocoloski <[email protected]> Date: Wed Mar 1 10:37:00 2017 -0500 Add _distinct as a built-in reduce COUCHDB-2971
rebar.config.script
Outdated
@@ -61,6 +61,8 @@ DepDescs = [ | |||
{tag, "v1.1.15"}, [raw]}, | |||
%% Third party deps | |||
{folsom, "folsom", {tag, "CouchDB-0.8.2"}}, | |||
{hyper, {url, "https://github.com/GameAnalytics/hyper.git"}, | |||
"4b1abc4284fc784f6def4f4928f715b0d33136f9"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are fully read/write on GitHub is it possible to fork this repo into Apache? Or is the requirement still to clone and push a copy (and eliminate the edge in the network graph in the process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still required to go through code clearance and clone and push a copy. We recently did this for bcrypt support, check the history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note the infra piece of this is now self-service assuming we vet the licensing and ensure we don't break our own buildchain.
See https://gitbox.apache.org/ for the link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any formal code clearance entry for bcrypt at http:https://incubator.apache.org/ip-clearance/, just @janl's commit to amend our overall LICENSE
and NOTICE
in 817b2b6. Guessing I can do the same here.
The hyper
code is fully MIT-licensed; there was an LGPL file in the C implementation but this was expunged in GameAnalytics/hyper#16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now change rebar.config.script
to point at our copy :)
You also shouldn't point at a specific hash, but rather a tagged release.
Standard practice btw is to create an upstream
branch and tag any upstream releases on that as X.Y.Z
, saving the master
branch for any local changes we might have to make. If we need to make a release with our customisations, use tags of the form COUCHDB-X.Y.Z
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed a transitive dependency on bisect:
Which brings in proper master dependency, which fails to compile on R16. However the good news is we don't use the bisect backend just the default (binary one). So I think we can make a branch on our version of hyper to exclude bisect and it should be good. Another one might be to avoid building carray C module (since we don't use it) but that's optional. |
Good catch @nickva I totally missed that dependency. Have removed it now. Left the carray to minimize the fork from upstream. |
Added some tests for this feature: #1364 |
A small PR to add hyper to top level |
Jira: COUCHDB-2971
So I realized we have one issue we need to discuss here: collation. The reduce function is inserting each key into the filter using a regular old
My initial preference is for Option 2 but I wanted to bring this up before merging to see if others have strong opinions. |
I am thinking of Option 2 with the documentation note. For Options 3, Erlang 20.0 has a bunch of new normalization functions like say: http:https://erlang.org/doc/man/unicode.html#characters_to_nfc_binary-1. Maybe recursively transform each key to a normalized format before calling term_to_binary on it? Would that be enough? Even if it is would need to backport the module to Erlang versions < 20.0 |
This would be enough of a reason to start requiring Erlang 20 for me, in say 2.3.0 or 3.0. It could be transparent to the user; if the function is available, use it and improve the accuracy of the response. |
Very cool, I had forgotten about all of that core unicode work. Definitely a nice future enhancement. |
Releases and dialyzer checks need app dependencies to work properly Issue: apache#1346
Releases and dialyzer checks need app dependencies to work properly Issue: #1346
FYI this broke the Windows build:
We really need to get Windows into the CI build farm somehow. |
Since I faced the same issue, I started looking to fix that. Could it be fixed with a simple alias such as: export cc=g++ ? |
no, it's worse than that sadly, there is some invalid pointer arith and some C99 reserved words the MS compiler doesn't support. fixing rebar.config is pretty easy, there's no need to alias CC=cc, just leave it out. I have this worked out already I'll get to it tomorrow. |
Overview
We’ve seen a number of applications now where a user needs to count the number of unique keys in a view. Currently the recommended approach is to add a trivial reduce function and then count the number of rows in a _list function or client-side application code, but of course that doesn’t scale nicely.
It seems that in a majority of these cases all that’s required is an approximation of the number of distinct entries, which brings us into the space of hash sets, linear probabilistic counters, and the ever-popular “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a nice candidate for a builtin reduce. The size of the data structure is independent of the number of input elements and individual HLL filters can be unioned together. There’s already what seems to be a good MIT-licensed implementation on GitHub:
https://github.com/GameAnalytics/hyper
This PR adds a new
_approx_count_distinct
function which uses the binary HLL implementation above. It has the precision fixed to 11 which yields a relative error of 2%. A future improvement would make this precision configurable in the design document.Testing recommendations
_approx_count_distinct
as a reduce function.Related Issues or Pull Requests
See https://issues.apache.org/jira/browse/COUCHDB-2971 for history.
Checklist