-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: build Mango indexes from dynamic expressions #3912
base: 3.x
Are you sure you want to change the base?
Conversation
This puts in just enough machinery to support indexing on a function of a document's content inside Mango, without going through the JS engine. Say we have a document like: { "bar": "a b c" } If we want to index on the individual "words" of `bar`, i.e. find this doc via the keys "a", "b" or "c", we'd normally need a JS map function: function (doc) { for (let word of doc.bar.split(" ")) { emit(word) } } This patch lets us do this inside Mango by defining an index on a "virtual" field whose value is a function of the doc's other fields. We put a view containing this in a design doc: "map": { "fields": { "bar_words": { "$explode": { "$field": "bar", "$separator": " " } } } } And this lets us perform `_find` queries for e.g. { "bar_words": "b" } to get our original document. As this is a proof of concept designed to get the index machinery working, `$explode` is the only function defined.
Here we add a module named `couch_jq` which exports an interface to the jq [1] library, letting us run jq programs against CouchDB docs held in memory as Erlang terms. The exported functions are: - `couch_jq:compile/1`: takes a binary containing a jq expression, and compiles it. Returns a resource object holding the resulting struct that stored the parsed expression. - `couch_jq:eval/2`: takes a compiled jq program and a JSON document and returns the results of evaluating the program against the doc, as a list. Both functions return either `{ok, Result}` or `{error, Message}`. This assumes that jq programs will in general return multiple results. Because the Erlang list is built from the tail, the results are returned in the reverse order that they're emitted by jq. We're going to put these values into an index where they'll get re-sorted so the order is not important. Most of the native code here is concerned with translating between Erlang and jq representations of JSON values. Erlang terms are checked to make sure they conform to the Couch document representation, and anything else is rejected. [1]: https://stedolan.github.io/jq/
Here we replace our prototype `$explode` operator with the more general `$jq` operator. The index is built by executing the jq expression against each doc and storing each result into the index.
If have a document like: { "foo": "a b", "bar": "x y" } And an index definition like: [ { "foo_words": { "$jq": '.foo | split(" ") | .[]' } }, { "bar_words": { "$jq": '.bar | split(" ") | .[]' } } ] Then this should produce four index keys for the document: ("a", "x") ("a", "y") ("b", "x") ("b", "y") This lets us query on multiple virtual fields in a single query. The implementation here allows jq expressions (that return multiple values) to be mixed with normal field access that returns a single value; `flatten_keys/1` returns the product of any multi-valued index fields. For example, above `foo_words` produces values `["a", "b"]` and `bar_words` produces `["x", "y"]`, and we multiply this out giving the four keys above.
The first implementation of building Mango indexes from jq expressions re-compiled the jq program for every document it processed. This is wasteful and we can instead to this once, when the index definitions are updated. When the index definition is loaded in, field descriptions of the form `{[{<<"$jq">>, Expr}]}` are converted to `{jq, CompiledExpr}`. This non-JSON term is used because the compiled jq program can be held in memory but not serialised back out into the design doc.
We currently have to create jq-based indexes by writing out the whole design document containing the Mango index definition, because the `/{db}/_index` endpoint does not except anything except "asc" or "desc" next to field names in the sort syntax. Here we expand this to allow { "$jq": "..." } as well, letting us create jq-based indexes without knowing the internal document format. Even though internally we support mixing jq expressions with normal field lookups in multi-field indexes, for now I'm restricting this so that multi-field indexes have to either all use the same sort direction, or all be jq expressions.
Pretty neat feature! IIUC, these new virtual fields will usually lead to a full key range scan because the answer derived by from the jq query will be what the user wants. Ex: "Give me all docs where foo does not contain 1,2, or 3.":
Currently, we don't use an index for this query, so we'll scan Now with this new feature, you can have a precompiled jq query that filters all
The table with the new feature would look like:
From the original query, the user wants 4,5 back. So the query would I'm thinking this would be more common than the user having a precompiled jq, filtering the results and then running another query on top of that. Perhaps we can think of just doing the full key range scan (from the emitted keys) for them to make things simpler? Basically have something like |
Overview
This PR represents a prototype of a feature @janl and I have been working on. It lets us build indexes on dynamically computed values inside Mango, which would normally require writing a JS view. It does this by extending the syntax for indexes so that as well as the
asc
anddesc
sort directions, we allow an expression to define a virtual field, for example:This definition means that the virtual field
foo_words
is generated by splitting thefoo
property on spaces and emitting each result. So if we have a document like:then this index lets us find that doc using the
_find
query"foo_words": "b"
.This is a prototype we're presenting to see if the functionality is of interest, before we commit any more work to making it more production-ready. Our reasoning for using
jq
for this is:.foo | split(" ") | .[]
to our{ "$explode": { "$field": "foo", "$separator": " " } }
, which doesn't address the array vs elements problemThat said, there is risk with adopting a native dependency and we fully understand if that's not a path others think we should go down. We're opening this to gauge interest in the idea of indexing on dynamic functions inside Mango, rather than whether we use jq specifically.
Testing recommendations
The Python test script included in the PR indicates how to use the functionality. You may need to augment the rebar script to add build flags for your environment; this was developed on macOS with
jq
installed via Homebrew.If we developed this further for production, we would want to add comprehensive unit tests for the
couch_jq
module to make sure it round-trips all JSON values correctly (I have verified this by hand but not written automated tests as such).If we decide to stick with jq then we should also fuzz-test the native code, and we should decide whether to vendor the
jq
codebase or compile against the system copy.There are also some warts in the implementation such as the addition of the virtual field into results based on the selector, which we'd need to come up with a cleaner solution for.
Checklist
rel/overlay/etc/default.ini