Add _approx_count_distinct as a builtin reduce function #1346

kocolosk · 2018-05-28T19:15:41Z

Overview

We’ve seen a number of applications now where a user needs to count the number of unique keys in a view. Currently the recommended approach is to add a trivial reduce function and then count the number of rows in a _list function or client-side application code, but of course that doesn’t scale nicely.

It seems that in a majority of these cases all that’s required is an approximation of the number of distinct entries, which brings us into the space of hash sets, linear probabilistic counters, and the ever-popular “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a nice candidate for a builtin reduce. The size of the data structure is independent of the number of input elements and individual HLL filters can be unioned together. There’s already what seems to be a good MIT-licensed implementation on GitHub:

https://github.com/GameAnalytics/hyper

This PR adds a new _approx_count_distinct function which uses the binary HLL implementation above. It has the precision fixed to 11 which yields a relative error of 2%. A future improvement would make this precision configurable in the design document.

Testing recommendations

Generate a view with a known number of distinct keys.
Add _approx_count_distinct as a reduce function.
Compare the reduce output with the exact number of distinct keys. The result should be within 2%.

Related Issues or Pull Requests

See https://issues.apache.org/jira/browse/COUCHDB-2971 for history.

Checklist

Code is written and works correctly;
Changes are covered by tests;
Documentation reflects the changes (separate PR forthcoming).

COUCHDB-2971

commit ee32cd5825aaf63448651c9521f0927083d2281e Author: Adam Kocoloski <[email protected]> Date: Wed Mar 1 09:28:45 2017 -0500 Add a cardinality estimator builtin reduce This introduces a _distinct builtin reduce function, which uses a HyperLogLog algorithm to estimate the number of distinct keys in the view index. The precision is currently fixed to 2^11 observables and therefore uses approximately 1.5 KB of memory. COUCHDB-2971

commit 5d18415237e7a01e1ac401607f7fc36b671bf640 Author: Adam Kocoloski <[email protected]> Date: Thu Apr 28 15:12:44 2016 -0400 Add a finalize step after rereduce Currently this is a noop for every reduce function except the HLL cardinately estimator implemented in _distinct. COUCHDB-2971

commit f7c3c24db17db5908894447e1822936212d61fcd Author: Adam Kocoloski <[email protected]> Date: Wed Mar 1 10:37:00 2017 -0500 Add _distinct as a built-in reduce COUCHDB-2971

kocolosk · 2018-05-28T19:24:04Z

rebar.config.script

@@ -61,6 +61,8 @@ DepDescs = [
 {tag, "v1.1.15"}, [raw]},
 %% Third party deps
 {folsom, "folsom", {tag, "CouchDB-0.8.2"}},
+{hyper, {url, "https://github.com/GameAnalytics/hyper.git"},
+ "4b1abc4284fc784f6def4f4928f715b0d33136f9"},


Now that we are fully read/write on GitHub is it possible to fork this repo into Apache? Or is the requirement still to clone and push a copy (and eliminate the edge in the network graph in the process).

We are still required to go through code clearance and clone and push a copy. We recently did this for bcrypt support, check the history.

I'll note the infra piece of this is now self-service assuming we vet the licensing and ensure we don't break our own buildchain.

See https://gitbox.apache.org/ for the link.

I don't see any formal code clearance entry for bcrypt at http:https://incubator.apache.org/ip-clearance/, just @janl's commit to amend our overall LICENSE and NOTICE in 817b2b6. Guessing I can do the same here.

The hyper code is fully MIT-licensed; there was an LGPL file in the C implementation but this was expunged in GameAnalytics/hyper#16

Now change rebar.config.script to point at our copy :)

You also shouldn't point at a specific hash, but rather a tagged release.

Standard practice btw is to create an upstream branch and tag any upstream releases on that as X.Y.Z, saving the master branch for any local changes we might have to make. If we need to make a release with our customisations, use tags of the form COUCHDB-X.Y.Z instead.

@wohali I already did all that in 22e8f97 ;) Your description of the standard practice sounds like a good one to me. In this case upstream doesn't actually have any tags (🤷‍♂️), so I picked X.Y.Z optimistically as 2.2.0 😄

nickva · 2018-05-30T21:09:11Z

Noticed a transitive dependency on bisect:

(HEAD detached at CouchDB-2.2.0)) $ more rebar.config
{cover_enabled, true}.
{deps, [
        {bisect, "",
         {git, "https://github.com/knutin/bisect.git", {branch, "master"}}}
       ]}.

Which brings in proper master dependency, which fails to compile on R16. However the good news is we don't use the bisect backend just the default (binary one). So I think we can make a branch on our version of hyper to exclude bisect and it should be good.

Another one might be to avoid building carray C module (since we don't use it) but that's optional.

kocolosk · 2018-05-30T21:56:59Z

Good catch @nickva I totally missed that dependency. Have removed it now. Left the carray to minimize the fork from upstream.

nickva · 2018-05-31T17:24:43Z

Added some tests for this feature: #1364

nickva · 2018-05-31T17:28:33Z

A small PR to add hyper to top level .gitignore file: #1365

Jira: COUCHDB-2971

kocolosk · 2018-06-04T23:15:59Z

So I realized we have one issue we need to discuss here: collation. The reduce function is inserting each key into the filter using a regular old term_to_binary(Key), so it will (incorrectly) treat two keys that are not identical but compare equal in ICU as distinct keys in its computation. I see a handful of paths that we could take:

Force the user to set "collation": "raw" in the design document options in order to use this reducer.
Merge this feature as-is and hide behind the "approximate" nature of the computation, documenting that it may overestimate the number of distinct keys by more than the stated uncertainty in the presence of such keys.
Compute a "canonical form" for each key and insert that into the filter instead. I don't even know if that's possible.

My initial preference is for Option 2 but I wanted to bring this up before merging to see if others have strong opinions.

nickva · 2018-06-05T21:37:29Z

I am thinking of Option 2 with the documentation note.

For Options 3, Erlang 20.0 has a bunch of new normalization functions like say: http:https://erlang.org/doc/man/unicode.html#characters_to_nfc_binary-1. Maybe recursively transform each key to a normalized format before calling term_to_binary on it? Would that be enough? Even if it is would need to backport the module to Erlang versions < 20.0

wohali · 2018-06-05T23:20:45Z

This would be enough of a reason to start requiring Erlang 20 for me, in say 2.3.0 or 3.0. It could be transparent to the user; if the function is available, use it and improve the accuracy of the response.

kocolosk · 2018-06-06T01:41:45Z

Very cool, I had forgotten about all of that core unicode work. Definitely a nice future enhancement.

Releases and dialyzer checks need app dependencies to work properly Issue: apache#1346

Releases and dialyzer checks need app dependencies to work properly Issue: #1346

wohali · 2018-07-16T23:46:57Z

FYI this broke the Windows build:

Compiling c:/relax/couchdb/src/hyper/c_src/hyper_carray.c
'cc' is not recognized as an internal or external command,
operable program or batch file.
ERROR: compile failed while processing c:/relax/couchdb/src/hyper: rebar_abort
make: *** [couch] Error 1

We really need to get Windows into the CI build farm somehow.

popojargo · 2018-07-17T00:33:23Z

Since I faced the same issue, I started looking to fix that. Could it be fixed with a simple alias such as: export cc=g++ ?

wohali · 2018-07-17T01:34:37Z

no, it's worse than that sadly, there is some invalid pointer arith and some C99 reserved words the MS compiler doesn't support.

fixing rebar.config is pretty easy, there's no need to alias CC=cc, just leave it out. I have this worked out already

I'll get to it tomorrow.

kocolosk added 6 commits May 28, 2018 09:08

Add hyper to the build

25bc44e

COUCHDB-2971

Squash f7c3c24d to rebase COUCHDB-2971 work

9122d45

commit f7c3c24db17db5908894447e1822936212d61fcd Author: Adam Kocoloski <[email protected]> Date: Wed Mar 1 10:37:00 2017 -0500 Add _distinct as a built-in reduce COUCHDB-2971

Rename to _approx_count_distinct and return an int

5802afd

Bump hyper dependency

77d4947

kocolosk mentioned this pull request May 28, 2018

Expand docs on builtin reducers, add _approx_count_distinct apache/couchdb-documentation#279

Merged

3 tasks

kocolosk commented May 28, 2018

View reviewed changes

Import hyper to ASF, update LICENSE and NOTICE

22e8f97

Update hyper dependency to remove bisect

cea8b31

Update to latest hyper

b7cf2b9

Add JS integration tests for _approx_count_distinct

e240b70

Jira: COUCHDB-2971

nickva approved these changes Jun 3, 2018

View reviewed changes

nickva and others added 2 commits June 4, 2018 14:38

Add hyper directory to .gitingore file (#1365)

2dc733f

Jira: COUCHDB-2971

Merge branch 'master' into 2971-count-distinct

39d0b06

kocolosk merged commit 6d44e17 into master Jun 6, 2018

kocolosk deleted the 2971-count-distinct branch June 6, 2018 02:10

nickva added a commit to cloudant/couchdb that referenced this pull request Jun 8, 2018

Add hyper app to dependencies

7db0152

Releases and dialyzer checks need app dependencies to work properly Issue: apache#1346

nickva mentioned this pull request Jun 8, 2018

Add hyper app to dependencies #1379

Merged

nickva added a commit that referenced this pull request Jun 8, 2018

Add hyper app to dependencies

8a46473

Releases and dialyzer checks need app dependencies to work properly Issue: #1346

wohali mentioned this pull request Jul 17, 2018

Fix the Windows build #1449

Closed

wohali mentioned this pull request Jul 18, 2018

bump hyper dependency, fix Windows build #1456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add _approx_count_distinct as a builtin reduce function #1346

Add _approx_count_distinct as a builtin reduce function #1346

kocolosk commented May 28, 2018 •

edited

Loading

kocolosk May 28, 2018

wohali May 29, 2018

wohali May 29, 2018

kocolosk May 30, 2018

kocolosk May 30, 2018

wohali May 30, 2018

kocolosk May 30, 2018

nickva commented May 30, 2018

kocolosk commented May 30, 2018

nickva commented May 31, 2018

nickva commented May 31, 2018

kocolosk commented Jun 4, 2018

nickva commented Jun 5, 2018

wohali commented Jun 5, 2018

kocolosk commented Jun 6, 2018

wohali commented Jul 16, 2018

popojargo commented Jul 17, 2018

wohali commented Jul 17, 2018

Add _approx_count_distinct as a builtin reduce function #1346

Add _approx_count_distinct as a builtin reduce function #1346

Conversation

kocolosk commented May 28, 2018 • edited Loading

Overview

Testing recommendations

Related Issues or Pull Requests

Checklist

kocolosk May 28, 2018

Choose a reason for hiding this comment

wohali May 29, 2018

Choose a reason for hiding this comment

wohali May 29, 2018

Choose a reason for hiding this comment

kocolosk May 30, 2018

Choose a reason for hiding this comment

kocolosk May 30, 2018

Choose a reason for hiding this comment

wohali May 30, 2018

Choose a reason for hiding this comment

kocolosk May 30, 2018

Choose a reason for hiding this comment

nickva commented May 30, 2018

kocolosk commented May 30, 2018

nickva commented May 31, 2018

nickva commented May 31, 2018

kocolosk commented Jun 4, 2018

nickva commented Jun 5, 2018

wohali commented Jun 5, 2018

kocolosk commented Jun 6, 2018

wohali commented Jul 16, 2018

popojargo commented Jul 17, 2018

wohali commented Jul 17, 2018

kocolosk commented May 28, 2018 •

edited

Loading