Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add _approx_count_distinct as a builtin reduce function #1346

Merged
merged 12 commits into from
Jun 6, 2018

Conversation

kocolosk
Copy link
Member

@kocolosk kocolosk commented May 28, 2018

Overview

We’ve seen a number of applications now where a user needs to count the number of unique keys in a view. Currently the recommended approach is to add a trivial reduce function and then count the number of rows in a _list function or client-side application code, but of course that doesn’t scale nicely.

It seems that in a majority of these cases all that’s required is an approximation of the number of distinct entries, which brings us into the space of hash sets, linear probabilistic counters, and the ever-popular “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a nice candidate for a builtin reduce. The size of the data structure is independent of the number of input elements and individual HLL filters can be unioned together. There’s already what seems to be a good MIT-licensed implementation on GitHub:

https://github.com/GameAnalytics/hyper

This PR adds a new _approx_count_distinct function which uses the binary HLL implementation above. It has the precision fixed to 11 which yields a relative error of 2%. A future improvement would make this precision configurable in the design document.

Testing recommendations

  1. Generate a view with a known number of distinct keys.
  2. Add _approx_count_distinct as a reduce function.
  3. Compare the reduce output with the exact number of distinct keys. The result should be within 2%.

Related Issues or Pull Requests

See https://issues.apache.org/jira/browse/COUCHDB-2971 for history.

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes (separate PR forthcoming).

commit ee32cd5825aaf63448651c9521f0927083d2281e
Author: Adam Kocoloski <[email protected]>
Date:   Wed Mar 1 09:28:45 2017 -0500

    Add a cardinality estimator builtin reduce

    This introduces a _distinct builtin reduce function, which uses a
    HyperLogLog algorithm to estimate the number of distinct keys in the
    view index. The precision is currently fixed to 2^11 observables and
    therefore uses approximately 1.5 KB of memory.

    COUCHDB-2971
commit 5d18415237e7a01e1ac401607f7fc36b671bf640
Author: Adam Kocoloski <[email protected]>
Date:   Thu Apr 28 15:12:44 2016 -0400

    Add a finalize step after rereduce

    Currently this is a noop for every reduce function except the HLL
    cardinately estimator implemented in _distinct.

    COUCHDB-2971
commit f7c3c24db17db5908894447e1822936212d61fcd
Author: Adam Kocoloski <[email protected]>
Date:   Wed Mar 1 10:37:00 2017 -0500

    Add _distinct as a built-in reduce

    COUCHDB-2971
@@ -61,6 +61,8 @@ DepDescs = [
{tag, "v1.1.15"}, [raw]},
%% Third party deps
{folsom, "folsom", {tag, "CouchDB-0.8.2"}},
{hyper, {url, "https://github.com/GameAnalytics/hyper.git"},
"4b1abc4284fc784f6def4f4928f715b0d33136f9"},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are fully read/write on GitHub is it possible to fork this repo into Apache? Or is the requirement still to clone and push a copy (and eliminate the edge in the network graph in the process).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still required to go through code clearance and clone and push a copy. We recently did this for bcrypt support, check the history.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note the infra piece of this is now self-service assuming we vet the licensing and ensure we don't break our own buildchain.

See https://gitbox.apache.org/ for the link.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any formal code clearance entry for bcrypt at http:https://incubator.apache.org/ip-clearance/, just @janl's commit to amend our overall LICENSE and NOTICE in 817b2b6. Guessing I can do the same here.

The hyper code is fully MIT-licensed; there was an LGPL file in the C implementation but this was expunged in GameAnalytics/hyper#16

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now change rebar.config.script to point at our copy :)

You also shouldn't point at a specific hash, but rather a tagged release.

Standard practice btw is to create an upstream branch and tag any upstream releases on that as X.Y.Z, saving the master branch for any local changes we might have to make. If we need to make a release with our customisations, use tags of the form COUCHDB-X.Y.Z instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wohali I already did all that in 22e8f97 ;) Your description of the standard practice sounds like a good one to me. In this case upstream doesn't actually have any tags (🤷‍♂️), so I picked X.Y.Z optimistically as 2.2.0 😄

@nickva
Copy link
Contributor

nickva commented May 30, 2018

Noticed a transitive dependency on bisect:

(HEAD detached at CouchDB-2.2.0)) $ more rebar.config
{cover_enabled, true}.
{deps, [
        {bisect, "",
         {git, "https://github.com/knutin/bisect.git", {branch, "master"}}}
       ]}.

Which brings in proper master dependency, which fails to compile on R16. However the good news is we don't use the bisect backend just the default (binary one). So I think we can make a branch on our version of hyper to exclude bisect and it should be good.

Another one might be to avoid building carray C module (since we don't use it) but that's optional.

@kocolosk
Copy link
Member Author

Good catch @nickva I totally missed that dependency. Have removed it now. Left the carray to minimize the fork from upstream.

@nickva
Copy link
Contributor

nickva commented May 31, 2018

Added some tests for this feature: #1364

@nickva
Copy link
Contributor

nickva commented May 31, 2018

A small PR to add hyper to top level .gitignore file: #1365

@kocolosk
Copy link
Member Author

kocolosk commented Jun 4, 2018

So I realized we have one issue we need to discuss here: collation. The reduce function is inserting each key into the filter using a regular old term_to_binary(Key), so it will (incorrectly) treat two keys that are not identical but compare equal in ICU as distinct keys in its computation. I see a handful of paths that we could take:

  1. Force the user to set "collation": "raw" in the design document options in order to use this reducer.
  2. Merge this feature as-is and hide behind the "approximate" nature of the computation, documenting that it may overestimate the number of distinct keys by more than the stated uncertainty in the presence of such keys.
  3. Compute a "canonical form" for each key and insert that into the filter instead. I don't even know if that's possible.

My initial preference is for Option 2 but I wanted to bring this up before merging to see if others have strong opinions.

@nickva
Copy link
Contributor

nickva commented Jun 5, 2018

I am thinking of Option 2 with the documentation note.

For Options 3, Erlang 20.0 has a bunch of new normalization functions like say: http:https://erlang.org/doc/man/unicode.html#characters_to_nfc_binary-1. Maybe recursively transform each key to a normalized format before calling term_to_binary on it? Would that be enough? Even if it is would need to backport the module to Erlang versions < 20.0

@wohali
Copy link
Member

wohali commented Jun 5, 2018

This would be enough of a reason to start requiring Erlang 20 for me, in say 2.3.0 or 3.0. It could be transparent to the user; if the function is available, use it and improve the accuracy of the response.

@kocolosk
Copy link
Member Author

kocolosk commented Jun 6, 2018

Very cool, I had forgotten about all of that core unicode work. Definitely a nice future enhancement.

@kocolosk kocolosk merged commit 6d44e17 into master Jun 6, 2018
@kocolosk kocolosk deleted the 2971-count-distinct branch June 6, 2018 02:10
nickva added a commit to cloudant/couchdb that referenced this pull request Jun 8, 2018
Releases and dialyzer checks need app dependencies to work properly

Issue: apache#1346
nickva added a commit that referenced this pull request Jun 8, 2018
Releases and dialyzer checks need app dependencies to work properly

Issue: #1346
@wohali
Copy link
Member

wohali commented Jul 16, 2018

FYI this broke the Windows build:

Compiling c:/relax/couchdb/src/hyper/c_src/hyper_carray.c
'cc' is not recognized as an internal or external command,
operable program or batch file.
ERROR: compile failed while processing c:/relax/couchdb/src/hyper: rebar_abort
make: *** [couch] Error 1

We really need to get Windows into the CI build farm somehow.

@wohali wohali mentioned this pull request Jul 17, 2018
@popojargo
Copy link
Member

Since I faced the same issue, I started looking to fix that. Could it be fixed with a simple alias such as: export cc=g++ ?

@wohali
Copy link
Member

wohali commented Jul 17, 2018

no, it's worse than that sadly, there is some invalid pointer arith and some C99 reserved words the MS compiler doesn't support.

fixing rebar.config is pretty easy, there's no need to alias CC=cc, just leave it out. I have this worked out already

I'll get to it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants