Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: build Mango indexes from dynamic expressions #3912

Open
wants to merge 6 commits into
base: 3.x
Choose a base branch
from

Conversation

jcoglan
Copy link
Contributor

@jcoglan jcoglan commented Jan 25, 2022

Overview

This PR represents a prototype of a feature @janl and I have been working on. It lets us build indexes on dynamically computed values inside Mango, which would normally require writing a JS view. It does this by extending the syntax for indexes so that as well as the asc and desc sort directions, we allow an expression to define a virtual field, for example:

"fields": [
  { "foo_words": { "$jq": ".foo | split(\" \") | .[]" } }
]

This definition means that the virtual field foo_words is generated by splitting the foo property on spaces and emitting each result. So if we have a document like:

{
  "foo": "a b c"
}

then this index lets us find that doc using the _find query "foo_words": "b".

This is a prototype we're presenting to see if the functionality is of interest, before we commit any more work to making it more production-ready. Our reasoning for using jq for this is:

  • It's a ready-built expression language, we don't need to build a lot of the same functionality ourselves
  • It addresses design issues we faced trying to come up with our own function definition syntax, e.g.:
    • How do we indicate that a function input is taken from doc property vs a literal value
    • How do we indicate that we want to use an array result as an index key vs using each member of the array as a key
    • How do we support composition of different functions to produce a result
    • jq has nice answers to these questions already
  • CouchDB users are likely to be familiar with jq so it's one less thing to learn, and they can experiment with it in their shell while designing their indexes
  • It's very concise, compare .foo | split(" ") | .[] to our { "$explode": { "$field": "foo", "$separator": " " } }, which doesn't address the array vs elements problem

That said, there is risk with adopting a native dependency and we fully understand if that's not a path others think we should go down. We're opening this to gauge interest in the idea of indexing on dynamic functions inside Mango, rather than whether we use jq specifically.

Testing recommendations

The Python test script included in the PR indicates how to use the functionality. You may need to augment the rebar script to add build flags for your environment; this was developed on macOS with jq installed via Homebrew.

If we developed this further for production, we would want to add comprehensive unit tests for the couch_jq module to make sure it round-trips all JSON values correctly (I have verified this by hand but not written automated tests as such).

If we decide to stick with jq then we should also fuzz-test the native code, and we should decide whether to vendor the jq codebase or compile against the system copy.

There are also some warts in the implementation such as the addition of the virtual field into results based on the selector, which we'd need to come up with a cleaner solution for.

Checklist

janl and others added 6 commits January 25, 2022 16:18
This puts in just enough machinery to support indexing on a function of
a document's content inside Mango, without going through the JS engine.
Say we have a document like:

    {
      "bar": "a b c"
    }

If we want to index on the individual "words" of `bar`, i.e. find this
doc via the keys "a", "b" or "c", we'd normally need a JS map function:

    function (doc) {
      for (let word of doc.bar.split(" ")) {
        emit(word)
      }
    }

This patch lets us do this inside Mango by defining an index on a
"virtual" field whose value is a function of the doc's other fields. We
put a view containing this in a design doc:

    "map": {
      "fields": {
        "bar_words": {
          "$explode": { "$field": "bar", "$separator": " " }
        }
      }
    }

And this lets us perform `_find` queries for e.g. { "bar_words": "b" }
to get our original document.

As this is a proof of concept designed to get the index machinery
working, `$explode` is the only function defined.
Here we add a module named `couch_jq` which exports an interface to the
jq [1] library, letting us run jq programs against CouchDB docs held in
memory as Erlang terms. The exported functions are:

- `couch_jq:compile/1`: takes a binary containing a jq expression, and
  compiles it. Returns a resource object holding the resulting struct
  that stored the parsed expression.

- `couch_jq:eval/2`: takes a compiled jq program and a JSON document and
  returns the results of evaluating the program against the doc, as a
  list.

Both functions return either `{ok, Result}` or `{error, Message}`.

This assumes that jq programs will in general return multiple results.
Because the Erlang list is built from the tail, the results are returned
in the reverse order that they're emitted by jq. We're going to put
these values into an index where they'll get re-sorted so the order is
not important.

Most of the native code here is concerned with translating between
Erlang and jq representations of JSON values. Erlang terms are checked
to make sure they conform to the Couch document representation, and
anything else is rejected.

[1]: https://stedolan.github.io/jq/
Here we replace our prototype `$explode` operator with the more general
`$jq` operator. The index is built by executing the jq expression
against each doc and storing each result into the index.
If have a document like:

    {
      "foo": "a b",
      "bar": "x y"
    }

And an index definition like:

    [
      { "foo_words": { "$jq": '.foo | split(" ") | .[]' } },
      { "bar_words": { "$jq": '.bar | split(" ") | .[]' } }
    ]

Then this should produce four index keys for the document:

    ("a", "x")
    ("a", "y")
    ("b", "x")
    ("b", "y")

This lets us query on multiple virtual fields in a single query. The
implementation here allows jq expressions (that return multiple values)
to be mixed with normal field access that returns a single value;
`flatten_keys/1` returns the product of any multi-valued index fields.
For example, above `foo_words` produces values `["a", "b"]` and
`bar_words` produces `["x", "y"]`, and we multiply this out giving the
four keys above.
The first implementation of building Mango indexes from jq expressions
re-compiled the jq program for every document it processed. This is
wasteful and we can instead to this once, when the index definitions are
updated.

When the index definition is loaded in, field descriptions of the form
`{[{<<"$jq">>, Expr}]}` are converted to `{jq, CompiledExpr}`. This
non-JSON term is used because the compiled jq program can be held in
memory but not serialised back out into the design doc.
We currently have to create jq-based indexes by writing out the whole
design document containing the Mango index definition, because the
`/{db}/_index` endpoint does not except anything except "asc" or "desc"
next to field names in the sort syntax.

Here we expand this to allow { "$jq": "..." } as well, letting us create
jq-based indexes without knowing the internal document format.

Even though internally we support mixing jq expressions with normal
field lookups in multi-field indexes, for now I'm restricting this so
that multi-field indexes have to either all use the same sort direction,
or all be jq expressions.
@tonysun83
Copy link
Contributor

tonysun83 commented Jan 25, 2022

Pretty neat feature! IIUC, these new virtual fields will usually lead to a full key range scan because the answer derived by from the jq query will be what the user wants.

Ex: "Give me all docs where foo does not contain 1,2, or 3.":

{"foo" : {"$nin": [1, 2, 3]}

Currently, we don't use an index for this query, so we'll scan _all_docs and then run the filter.

Now with this new feature, you can have a precompiled jq query that filters all foo fields that don't have 1,2,3:

  { "foo_words": { "$jq": ".foo | <some expression to get all all values not in [1,2,3]> | .[]" } }
Doc1
{"foo": "1"}
Doc2
{"foo": "4"}
Doc3
{"foo": "5"}

The table with the new feature would look like:

foo_words, 4
foo_words, 5

From the original query, the user wants 4,5 back. So the query would {"foo_words" : {"$gt" : null}}

I'm thinking this would be more common than the user having a precompiled jq, filtering the results and then running another query on top of that. Perhaps we can think of just doing the full key range scan (from the emitted keys) for them to make things simpler? Basically have something like {"foo_words" : "$results"}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants