Feature: build Mango indexes from dynamic expressions #3912

jcoglan · 2022-01-25T17:11:12Z

Overview

This PR represents a prototype of a feature @janl and I have been working on. It lets us build indexes on dynamically computed values inside Mango, which would normally require writing a JS view. It does this by extending the syntax for indexes so that as well as the asc and desc sort directions, we allow an expression to define a virtual field, for example:

"fields": [
  { "foo_words": { "$jq": ".foo | split(\" \") | .[]" } }
]

This definition means that the virtual field foo_words is generated by splitting the foo property on spaces and emitting each result. So if we have a document like:

{
  "foo": "a b c"
}

then this index lets us find that doc using the _find query "foo_words": "b".

This is a prototype we're presenting to see if the functionality is of interest, before we commit any more work to making it more production-ready. Our reasoning for using jq for this is:

It's a ready-built expression language, we don't need to build a lot of the same functionality ourselves
It addresses design issues we faced trying to come up with our own function definition syntax, e.g.:
- How do we indicate that a function input is taken from doc property vs a literal value
- How do we indicate that we want to use an array result as an index key vs using each member of the array as a key
- How do we support composition of different functions to produce a result
- jq has nice answers to these questions already
CouchDB users are likely to be familiar with jq so it's one less thing to learn, and they can experiment with it in their shell while designing their indexes
It's very concise, compare .foo | split(" ") | .[] to our { "$explode": { "$field": "foo", "$separator": " " } }, which doesn't address the array vs elements problem

That said, there is risk with adopting a native dependency and we fully understand if that's not a path others think we should go down. We're opening this to gauge interest in the idea of indexing on dynamic functions inside Mango, rather than whether we use jq specifically.

Testing recommendations

The Python test script included in the PR indicates how to use the functionality. You may need to augment the rebar script to add build flags for your environment; this was developed on macOS with jq installed via Homebrew.

If we developed this further for production, we would want to add comprehensive unit tests for the couch_jq module to make sure it round-trips all JSON values correctly (I have verified this by hand but not written automated tests as such).

If we decide to stick with jq then we should also fuzz-test the native code, and we should decide whether to vendor the jq codebase or compile against the system copy.

There are also some warts in the implementation such as the addition of the virtual field into results based on the selector, which we'd need to come up with a cleaner solution for.

Checklist

Code is written and works correctly
Changes are covered by tests
Any new configurable parameters are documented in rel/overlay/etc/default.ini
A PR for documentation changes has been made in https://github.com/apache/couchdb-documentation

This puts in just enough machinery to support indexing on a function of a document's content inside Mango, without going through the JS engine. Say we have a document like: { "bar": "a b c" } If we want to index on the individual "words" of `bar`, i.e. find this doc via the keys "a", "b" or "c", we'd normally need a JS map function: function (doc) { for (let word of doc.bar.split(" ")) { emit(word) } } This patch lets us do this inside Mango by defining an index on a "virtual" field whose value is a function of the doc's other fields. We put a view containing this in a design doc: "map": { "fields": { "bar_words": { "$explode": { "$field": "bar", "$separator": " " } } } } And this lets us perform `_find` queries for e.g. { "bar_words": "b" } to get our original document. As this is a proof of concept designed to get the index machinery working, `$explode` is the only function defined.

Here we add a module named `couch_jq` which exports an interface to the jq [1] library, letting us run jq programs against CouchDB docs held in memory as Erlang terms. The exported functions are: - `couch_jq:compile/1`: takes a binary containing a jq expression, and compiles it. Returns a resource object holding the resulting struct that stored the parsed expression. - `couch_jq:eval/2`: takes a compiled jq program and a JSON document and returns the results of evaluating the program against the doc, as a list. Both functions return either `{ok, Result}` or `{error, Message}`. This assumes that jq programs will in general return multiple results. Because the Erlang list is built from the tail, the results are returned in the reverse order that they're emitted by jq. We're going to put these values into an index where they'll get re-sorted so the order is not important. Most of the native code here is concerned with translating between Erlang and jq representations of JSON values. Erlang terms are checked to make sure they conform to the Couch document representation, and anything else is rejected. [1]: https://stedolan.github.io/jq/

Here we replace our prototype `$explode` operator with the more general `$jq` operator. The index is built by executing the jq expression against each doc and storing each result into the index.

If have a document like: { "foo": "a b", "bar": "x y" } And an index definition like: [ { "foo_words": { "$jq": '.foo | split(" ") | .[]' } }, { "bar_words": { "$jq": '.bar | split(" ") | .[]' } } ] Then this should produce four index keys for the document: ("a", "x") ("a", "y") ("b", "x") ("b", "y") This lets us query on multiple virtual fields in a single query. The implementation here allows jq expressions (that return multiple values) to be mixed with normal field access that returns a single value; `flatten_keys/1` returns the product of any multi-valued index fields. For example, above `foo_words` produces values `["a", "b"]` and `bar_words` produces `["x", "y"]`, and we multiply this out giving the four keys above.

The first implementation of building Mango indexes from jq expressions re-compiled the jq program for every document it processed. This is wasteful and we can instead to this once, when the index definitions are updated. When the index definition is loaded in, field descriptions of the form `{[{<<"$jq">>, Expr}]}` are converted to `{jq, CompiledExpr}`. This non-JSON term is used because the compiled jq program can be held in memory but not serialised back out into the design doc.

We currently have to create jq-based indexes by writing out the whole design document containing the Mango index definition, because the `/{db}/_index` endpoint does not except anything except "asc" or "desc" next to field names in the sort syntax. Here we expand this to allow { "$jq": "..." } as well, letting us create jq-based indexes without knowing the internal document format. Even though internally we support mixing jq expressions with normal field lookups in multi-field indexes, for now I'm restricting this so that multi-field indexes have to either all use the same sort direction, or all be jq expressions.

tonysun83 · 2022-01-25T22:23:11Z

Pretty neat feature! IIUC, these new virtual fields will usually lead to a full key range scan because the answer derived by from the jq query will be what the user wants.

Ex: "Give me all docs where foo does not contain 1,2, or 3.":

{"foo" : {"$nin": [1, 2, 3]}

Currently, we don't use an index for this query, so we'll scan _all_docs and then run the filter.

Now with this new feature, you can have a precompiled jq query that filters all foo fields that don't have 1,2,3:

  { "foo_words": { "$jq": ".foo | <some expression to get all all values not in [1,2,3]> | .[]" } }

Doc1
{"foo": "1"}
Doc2
{"foo": "4"}
Doc3
{"foo": "5"}

The table with the new feature would look like:

foo_words, 4
foo_words, 5

From the original query, the user wants 4,5 back. So the query would {"foo_words" : {"$gt" : null}}

I'm thinking this would be more common than the user having a precompiled jq, filtering the results and then running another query on top of that. Perhaps we can think of just doing the full key range scan (from the emitted keys) for them to make things simpler? Basically have something like {"foo_words" : "$results"}.

janl and others added 6 commits January 25, 2022 16:18

Replace the $explode operator with $jq

4c27463

Here we replace our prototype `$explode` operator with the more general `$jq` operator. The index is built by executing the jq expression against each doc and storing each result into the index.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: build Mango indexes from dynamic expressions #3912

Feature: build Mango indexes from dynamic expressions #3912

jcoglan commented Jan 25, 2022

tonysun83 commented Jan 25, 2022 •

edited

Loading

Feature: build Mango indexes from dynamic expressions #3912

Are you sure you want to change the base?

Feature: build Mango indexes from dynamic expressions #3912

Conversation

jcoglan commented Jan 25, 2022

Overview

Testing recommendations

Checklist

tonysun83 commented Jan 25, 2022 • edited Loading

tonysun83 commented Jan 25, 2022 •

edited

Loading