Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Mango-based update handler / VDU functionality #1554

Open
wohali opened this issue Aug 12, 2018 · 27 comments
Open

Additional Mango-based update handler / VDU functionality #1554

wohali opened this issue Aug 12, 2018 · 27 comments
Assignees
Projects

Comments

@wohali
Copy link
Member

wohali commented Aug 12, 2018

This ticket is intended to gather ideas for what other sorts of update handler / VDU functionality we could add to Mango to provide a declarative replacement for JS-based update handlers and VDUs. This is intended as a "blue-sky" ticket to gather ideas that can be implemented simply in a declarative fashion.

From #1534, I wrote:

There are other proposals to provide similar/useful functionality (#1532, #1533, #1498) if the operator decides to run without a JS engine (#1513), plus I suspect there are thoughts about doing Mango-based update functionality to filter fields, as well as apply database schema enforcement (type and format checking) in a declarative fashion for a NoJS mode. Could your embedded RasPi update function be written declaratively? I assume you're adding info such as incoming IP address, central timestamp, etc. which I could see being done automatically quite easily. This is a use case we can accelerate without the use of JS readily; we just need to agree on an API and how to write it down in a JSON blob.

Extracting:

  • Add datetime stamp (Auto-record data on document create / update #1533)
  • Add IP address from incoming host, or other fields from the req object
  • Add additional content based on fields in the req object
    • Perhaps a lookup table that maps a req field value to a key:value pair (or JSON Patch) to insert?
  • Filter out (blacklist) or only allow (whitelist) certain fields (lists) - this is partial JSON schema support
  • Type-checking (field type is string, field date is ISO datetime|UNIX epoch, etc.)
  • Field validation (format-checking, i.e. string must match a regex like "$###.##" or an integer must be within a certain range)
  • Consider using output of another Mango index/query for validating a field? For instance, a Mango query of documents that are of type source and extracting the field ip_address, then only allowing documents to come through that have a matching req.ip_address (for example).
    • Same as above, but to implement the lookup table. Example: look up source's IP address, auto-add name of sensor which is in the first document that matches in the Mango index
@wohali
Copy link
Member Author

wohali commented Aug 12, 2018

Got to thinking more about this one in the shower. Here's a sample Mango-like document structure for a combined VDU/update handler that would capture all of the above:

{
  "queries": [
    {
      "name": "sensor_name",
      "selector": {
        "type": "sensor",
        "ip_address": "$doc.ip_address"
      },
      "fields": ["name"],
      ...
    }
  ],
  "fields": {
    "whitelist": ["type", "datetime", "ip_address", "station_name", "temperature", "RH%", "rain_gauge", "wind.speed", "wind.direction", "light", "battery.voltage", "battery.current", "line.voltage", "line.current", "solar.voltage", "solar.current", "register_total"],
    "blacklist": ["battery.wattage", "rain_gauge.*"],
    "formats": {
      "type": ["sensor_reading"],
      "datetime": {
        "$or": {
          "$format": "$iso8601",
          "$format": "$unixepoch",
          "$regex": "(\\d{2})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2})"
        }
      },
      "battery.voltage": "$float",
      "battery.current": "$float",
      "register_total": { "$regex": "^\\$\\d{1,9}\\.\\d{2}$" },
      "wind.direction": {"$or": ["N", "NE", "E", "SE", "S", "SW", "W", "NW"]}
      ...
    }
  },
  "patch": [
      { "op": "add", "path": "/", "value": { "name": "$query:sensor_name.name" } },
      { "op": "move", "from": "/ip-address", "path": "/ip_address" },
      { "op": "add", "path": "/", "value": { "server_datetime": "$server_time.iso8601" } },
      ...
  ]
}

Comments:

  • This runs a Mango query up front to get any documents with "type": "sensor" that match the IP address of the incoming request. (Assumes that the $req object has that value, not sure we pass that thru to VDUs right now...)
  • There's then a field presence check. In a real situation, you would probably only have one of fields.whitelist and fields.blacklist.
  • Field values/formats are then checked, some just for general structure, one by regex, one by list of acceptable values. The output of a Mango query could be used as input to this, too. (An aside: Mango aggregations could be really useful here, for instance to reduce a query to a single list of acceptable values. See Add aggregation functions to Mango #1254.)
  • Documents that didn't match "type": "sensor_reading" or any of the other format checks would fall through to the next whatever-we-call-this-thing. If all failed, the document would be rejected.
  • Maybe we add a "$format": "$datetime" that includes all the common datetime formats?

Just some food for discussion.

@ermouth
Copy link
Contributor

ermouth commented Aug 12, 2018

IMO the patch list better be named transform and look like

transform:[
  {
    selector:{  // Mango selector for list of actions
      type: { $neq: "whatever"}
    },
    actions: [  // List of actions
      ["stop"]
    ]
  }, 
  // Next item, used if we didn‘t stop processing
  {selector:{ stamp: { $format:'Int', $gt: 1234567890123}}},
  actions: [
    ["set",  ".ip",  ":peer"   ],
    ["swap", ".a.0", ["a", "1"]],
    ["del",  ".x.y.z"          ]
  ]}
  //...
]

Actions are pretty self-explanatory, so no need to use obj keys. Objects as .actions members may be used later for syntax extensions or nesting.

This also replaces .fields.format typecheck making it more flexible. Typechecks just move to actions list member .selector prop.

Also syntax "battery.voltage": "$float" in fields seems somewhat ambiguous. Using proposed syntax it should be "battery.voltage": {"$format": "Float"}. $ marker before formats/types seems to be excessive, omitting it in favor of uppercasing type tag reads better.

It all seems unexpectedly reasonable. I’ll try to write down some of our _updates in this way, to compare readability and to understand other json-induced benefits/penalties.

And I couldn‘t comprehend what .queries field supposed to do. Can you please give a hint?

@wohali
Copy link
Member Author

wohali commented Aug 12, 2018

@ermouth The patch stuff is coming from JSON Patch, which is an RFC and has widespread support. We're looking at it for partial update support, see #1498. If we build it for that, then we can reuse that support here rather than inventing our own crazy DSL. ;)

@wohali
Copy link
Member Author

wohali commented Aug 12, 2018

Sorry, I hit submit too quickly. The $ before operators is mimicing the Mango syntax, to distinguish operators from operands. You're right that in some places, where the value is expected to be a format (like the float check) we can probably drop the $. As for the format, it's already inside the list for the top-level formats key, so it wouldn't need an additional {"$format":"float"} in my approach, since my approach is intentionally keeping the patch section JSON Patch compatible.

As for queries, the idea is you could run a Mango query before applying the VDU/update handler, then use that information later on. See the first line in the patch entry for use of the lookup to convert an incoming sensor's IP address to a station name. The assumption is that the query will only return the first document, or a reduced set of values (see the reference to Mango aggregation functions being a potential prerequisite for this functionality). Optionally you could cram in there a .0. to reference just the first doc in the result set, I guess. The returned result could also be used in the validator section, for instance, to look up the current list of allowed wind direction values or something.

The pre-update-Mango-query stuff is speculative; there may well be pushback from the db core team on making update functions a little too clever and slow since they're now performing a read before a write. We'll see. I think the proposal is valuable even without that specific functionality.

@janl
Copy link
Member

janl commented Aug 14, 2018

I really like where this is going, great first draft!

Quick initial notes up top:

  • I like the unification of VDU and _update into one thing that is logically “stuff that happens before a doc hits the db.”
  • I’d leave out queries in this example, this would open up having to handle cluster network error on the doc write path. We can revisit this at a later point, the current design here allows for an easy extension later on. I think it’ll simplify the already complex discussions required for a baseline feature (which I’m going to extend shortly).

One further aspect that we should discuss the other use of VDUs. To recap, VDUs are used to enforce document schema (this is handled in this draft) and authorisation, where check doc contents against the context of ctx.

http:https://docs.couchdb.org/en/stable/ddocs/ddocs.html#vdufun

For Example:

function (newDoc, oldDoc, userCtx) {
  // in an authenticated request, userCtx.user is the authenticated username
  // so we can:

  if (oldDoc.author != userCtx.name) {
    throw({ forbidden: 'you can’t update other user’s docs.' })
  }
}

userCtx Definition

In addition, there is a fourth parameter, the secObj, which is the database’s _security object

function (newDoc, oldDoc, userCtx, secObj) {
  if (secObj.members.names.indexOf(userCtx.name) === -1) {
    throw({ forbidden: 'you can’t update if you’re not a member.' }) // contrived example, you wouldn’t get to this part in the first place, because the _security check comes first, but you should get what is meant
  }
}

We could argue that this could co-incide with the security overhaul, which would take care of any authentication, and I’m happy to discuss this separately if needed, I just wanted to bring this up.

Maybe the draft can be amended, so fields becomes schema, and we introduce authorisation as a new top level field:

{
"fields": {
    "whitelist": ["type", "datetime", "ip_address", "station_name", "temperature", "RH%", "rain_gauge", "wind.speed", "wind.direction", "light", "battery.voltage", "battery.current", "line.voltage", "line.current", "solar.voltage", "solar.current", "register_total"],
    "blacklist": ["battery.wattage", "rain_gauge.*"],
    "formats": {
      "type": ["sensor_reading"],
      "datetime": {
        "$or": {
          "$format": "$iso8601",
          "$format": "$unixepoch",
          "$regex": "(\\d{2})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2})"
        }
      },
      "battery.voltage": "$float",
      "battery.current": "$float",
      "register_total": { "$regex": "^\\$\\d{1,9}\\.\\d{2}$" },
      "wind.direction": {"$or": ["N", "NE", "E", "SE", "S", "SW", "W", "NW"]}
      ...
    }
  },
  "authorisation": [
       {
         "author": { "$eq" : "$userCtx.name" },
         "throw": "you can’t update other user’s docs." 
    ]}
  }

where the line "throw": "you can’t update other user’s docs." is equivalent to the throw({ forbidden: 'you can’t update other user’s docs.' })-line from the first example above. $secObj would work similarly.

@wohali
Copy link
Member Author

wohali commented Aug 14, 2018

I like this idea, and especially the ability to use multiple docs of this format to enforce rules such as "only users with role sensor are able to add/modify documents of type sensor. It's not fully-fledged per-doc auth since this only is traversed on the write path, but it might be a way forward for the read-path too (if people are willing to accept the massive performance penalty it'll invoke).

The queries idea was blue sky, happy to break it into a separate issue so it doesn't get lost when we implement this one.

I'm flexible on what lives where in the actual doc, would like to see some other opinions as well.

I would say that for authorisation, since you didn't break things out into separate selector and action sections, we use $throw to ensure we don't overlap any tests against a real document key named throw. We'll need to work out escaping just in case someone DOES have a key named $throw, same for the other sections as well.

The alternative is to prefix all document fields with $doc but I think this makes things needlessly verbose, don't you?

@janl
Copy link
Member

janl commented Aug 14, 2018

yeah not a fan of $doc prefix, but $throw works fine.

That might dovetail into my next proposal (that I’ll open a new ticket for), to add string manipulation funs to Mango, those then would follow the same lead, say $strlen(field.sub), but let’s keep that definitely separate from this proposal ;D

@mikerhodes
Copy link
Contributor

@janl should your JSON snippet be (i..e, with the schema and authorization top level fields):

{
    "schema": {
        "whitelist": [
            "type",
            "datetime",
            "others....."
        ],
        ...
    },
    "authorization": [
        {
            "author": { "$eq": "$userCtx.name" },
            "throw": "you can’t update other user’s docs."
        }
    ]
}

Also, I used American spelling for "authorization".

Suggestion:

  • I think for clarity, splitting out the selector and action in the authorization field as @wohali suggests means you can avoid the ambiguity of field name vs. processing instruction. Collapsing them together feels confusing to me.

Questions:

  • What would happen when replicating to older db versions that don't understand some part of the validation (e.g., didn't understand formats part of schema)? Be lenient (ignore bits not understood) or strict (fail validation when can't process the whole set of instructions)?
  • There's an information leak possible: write-only users can discern document shape by observing error messages for schema.
    • Perhaps an optional throw clause which would allow the user to override default error to prevent this leakage.

@janl
Copy link
Member

janl commented Aug 14, 2018

@mikerhodes good catch, that's a typo!

@wohali
Copy link
Member Author

wohali commented Aug 14, 2018

@mikerhodes asks:

  • What would happen when replicating to older db versions that don't understand some part of the validation (e.g., didn't understand formats part of schema)? Be lenient (ignore bits not understood) or strict (fail validation when can't process the whole set of instructions)?

Idea 1: Maybe we need a version field, regardless of what we do with this, if not specified we assume it's the version of whatever the current runtime is
Idea 2: Maybe we need a "fail": "closed|open" setting per-VDU
Idea 3: Maybe we need a server setting in local.ini to toggle the same thing

The latter 2 come from electronic door locks that generally come with a user-configurable setting for this.

  • There's an information leak possible: write-only users can discern document shape by observing error messages for schema. Perhaps an optional throw clause which would allow the user to override default error to prevent this leakage.

Well, in Apache CouchDB, there are no such things as write-only users to my knowledge. Are you talking about a Cloudant-specific thing, or something that might come in with #1504 ?

I was imagining that if any of the tests failed (pre-@janl's change) a generic response would be provided, along the lines of document failed validation, maybe including the name of the whatever-we-call-this to point a finger, but not explictly including the precise reason for failure. I like the idea of a $throw.default to override that message if you want, but I'm not sure it does much?

The other option would be to provide a "debug": true/false boolean toggle that would include precise info about the reason for a document's failure. We've done a poor job of helping users debug their JS functions in the past. Putting the control over debugging a Mango-VDU in the hands of the developer would be a big step up in my opinion.

Thanks for the discussion!

@mikerhodes
Copy link
Contributor

Cloudant has write-only, but I did think that it'd likely come in with any access rewrite as it's definitely a useful thing.

@ermouth
Copy link
Contributor

ermouth commented Aug 15, 2018

I‘ve tried to re-write several real simple _update fns with proposed syntax. My observations:

  • schema, whatever implementation is, better have conditionals/guards
  • update requires conditionals/guards even more: VDU results commute, so you can have many VDUs in different ddocs; this is not true for updates – without conditionals we need as many updates (and endpoints) as number of schemas, which is bit messy and unsafe from integrity POV
  • transform/patch lists of actions also require guards: schema match doesn‘t automatically mean plain list of actions is applicable
  • JSON patch is definitely hard to read, too verbose; for long lists repeated key names look as an excessive markup noise
  • constantly switching between dot notation of Mango paths and slash notation of JSON patch is a direct way to errors
  • it all needs sandboxed testing tools, entire syntax is not easy to read and, unlike Mango, errors are immediately destructive for incoming data.

making update functions a little too clever and slow

AFAIK more risk is to make them dumb and cryptic as old-style rewrites )

Here is a set of existing tools of the sort, hope it can be a source for further inspiration https://stackoverflow.com/questions/1618038/xslt-equivalent-for-json/49011455#49011455

@janl
Copy link
Member

janl commented Aug 17, 2018

schema, whatever implementation is, better have conditionals/guards

@ermouth do you have an example?

@ermouth
Copy link
Contributor

ermouth commented Aug 17, 2018

@janl latest json schema has if/then/else

http:https://json-schema.org/draft-07/json-schema-release-notes.html

@mikerhodes
Copy link
Contributor

mikerhodes commented Aug 17, 2018

I'll suggest the obvious: that the guards be declarative and mango selectors.

I think "condition" may be the right term for the authorization's actual business logic (might be a better term from the authorization literature, but it escapes me right now).

Combining the suggestion to split out the authz section into three fields:

{
    "authorization": [
        {
            "guard": {"type": { "$eq": "post" }},
            "condition": {"author": { "$eq": "$userCtx.name" }},
            "throw": "you can’t update other user’s docs."
        }
    ]
}

@ermouth
Copy link
Contributor

ermouth commented Aug 17, 2018

@mikerhodes I already suggested exactly same approach (using Mango selectors as guards/conditions), see above, however it was assumed too clever. Nice to see this obvious and practical way came not only in my head.

What is the difference between guard and condition in you example? And why special authorization field needed; I mean why special field if a list of guard/actionList pairs may be used without pinpointing each pair semantic meaning?

@wohali
Copy link
Member Author

wohali commented Aug 17, 2018

@ermouth I think I misunderstood what you were proposing...probably a language barrier. It's clearer to me now.

I think the guard is checked first, to make sure that the document would even have the field in question, and then the condition is checked. This would translate to JS in such a way:

if (doc.type === "post") {
  if (doc.author != userCtx.name) {
    throw("you can't update other users' docs.")
  }
}

Am I right @mikerhodes ? And if so, perhaps your condition has the wrong logic sense? Not sure...

@mikerhodes
Copy link
Contributor

mikerhodes commented Aug 22, 2018

@wohali Yes, that's the basic logic. I think my condition logic is the wrong way around, because I would say that the throw happens if condition is false rather than if condition is true.

if (doc.type === "post") {   // i.e., "if <guard condition>"
  if (! (doc.author == userCtx.name)) {  // i.e., "if not <condition>"
    throw("you can't update other users' docs.")
  }
}

@ermouth I can see that you could combine the guard and the condition into a single clause quite easily given they are both Mango selectors. I just find that the logical distinction of "the types of thing this authorisation decision applies to" (guard) and "the authorisation condition itself" (condition) makes things more clear.

As to whether schema and authorization are worthy of separate top-level concepts when I could see that you're right again that they could be combined in a kind of guard-condition-action type framework, I'm unsure. Again, it perhaps helps to separate them from a readability point of view, "here's my schema check, here's my authorisation check". Unsure.

@garbados
Copy link
Contributor

garbados commented Jan 18, 2019

I wrote up a gist describing a possible /{db}/_update endpoint for facilitating Mango-based VDUs, based on the discussion in this thread. I based the transform syntax on this secret update syntax in the Mango source. The transform syntax could also be dropped from the proposal entirely, as it adds a great deal of extra work over simply validating an update.

What do y'all think?

@garbados
Copy link
Contributor

After digging into Mango's source and discussing this feature with folks, I think only one new method+endpoint is necessary to facilitate Mango VDUs: POST /{db}/_update, which acts much like POST /{db} but it applies the new doc to the first matching VDU. The ?use_index=... query parameter can be used to force the use of a particular VDU, much like in POST /{db}/_find.

VDUs in this case are Mango indexes with the type field set to vdu. Their form differs from a "type": "json" index in the following ways:

  • No partial_filter_selector since applying something at "index time" doesn't make sense for this context.
  • No index.fields, since there is no indexing involved in a VDU.
  • The field index.guard contains a selector which checks if the VDU applies to a given document. Documents which do not pass this selector are not affected by this VDU.
  • The field index.condition contains a selector which validates the document against the VDU. The VDU will reject documents which pass its guard but not its condition.
  • If an update selector syntax is ever developed or implemented, an index.action or index.transform field could be used to apply them to documents which pass both the guard and the condition.

I suggest this approach because mango_idx.erl already contains facilities for supporting new types of indexes, making it much easier to develop a new type of index than to reimplement all the /{db}/_index handlers for a new endpoint.

VDU selectors in this formulation can use string substitution to access $userCtx and $secObj. Accessing $oldDoc requires a fabric call, which extends the window for conflicts to appear, so I would be inclined to permit only index.condition to access it so as to minimize which updates can create lengthy conflict windows.

For example:

{
  "index": {
    "guard": { "type": "article" },
    "condition": {
      "name": { "$in": "$secObj.members.names" },
      "$or": {
        "$not": { "$exists": "$oldDoc.author" },
        "author": { "$eq": "$oldDoc.author" }
      }
    }
  },
  "type": "vdu"
}

Does that make sense? Seem reasonable?

@mikerhodes
Copy link
Contributor

Good to see some movement here, and nice write up.

It feels a little odd to overload the term index in this way -- in particular because, as you say, "there is no indexing involved" 😄 . I'm not sure this UX is going to work out, it feels like putting a VDU as a sibling to a map function. In many ways, this is unrelated to Mango querying so I'd say that, like /_update is new, a new /_vdus or whatever endpoint that's a parallel to /_index is more appropriate.

Another reason to avoid overloading the index term is that the VDU index section has a very different schema to a normal Mango index section too, which makes writing client libraries pretty tough (especially those libraries that map JSON to concrete types, which is a primary utility of a client library over raw HTTP calls IMO).

@garbados
Copy link
Contributor

That makes sense, that overloading /_index with a process that doesn't involve indexing does indeed not make sense. Thanks for that reality check :)

I still like the guard-condition syntax for the VDU, where the guard and condition are selectors with different expectations and abilities. I'll go for another few dives into the source and see if I can't figure out a PR.

@garrensmith
Copy link
Member

garrensmith commented Jan 23, 2019 via email

@garbados
Copy link
Contributor

garbados commented Feb 3, 2019

@garrensmith That's very much appreciated. I'd love for some pointers on which files will contain pertinent code, functions I might have to edit or copy, etc. Otherwise I'm just stumbling through the source and hitting my local couch with requests, hoping the responses change in response to my actions.

@garrensmith
Copy link
Member

I'm going to try and leave some ideas on how to implement this. The one thing I've noticed is that this is quite a large amount of work, so I think try and break it up into smaller usable PR's. I also think it is worth creating a RFC for this with all the information on this and add in the gist so that its in 1 central place.

I would start on the transform section first. That section could possibly be used in the mango query section. We could take a document we want to index and then transform it before adding it to the index. This has been a requested feature.

For $transform, $schema and $authorization section
All of the mango syntax manipulation lives here https://github.com/apache/couchdb/blob/master/src/mango/src/mango_selector.erl

I think it might be best to create a new mango_vdu_selector or something file that uses parts of mango_selector and does all the validation and transformation of a document based on the syntax.

Saving/creating/getching of VDU's:
The CRUD for a mango index is here https://github.com/apache/couchdb/blob/master/src/mango/src/mango_idx_view.erl
Again I would probably create another file e.g mango_vdu and leverage mango_idx_view to do all the retrieving of the vdu's.

I hope that helps a little.

@janl
Copy link
Member

janl commented Feb 20, 2019

I swear I thought I had mentioned this already, but apparently not on this thread.

I’d leave the transform bits out of this proposal and focus on the VDU part: validate that a doc looks like what you want it to be and comes from someone you want it to be.

I’d leave mango-like syntax for doc transformation to #1559.

@garrensmith
Copy link
Member

garrensmith commented Feb 20, 2019 via email

@wohali wohali added this to Proposed (release independent) in Roadmap Jul 11, 2019
@wohali wohali self-assigned this Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Roadmap
  
Proposed (backlog)
Development

No branches or pull requests

6 participants