Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Validate new document writes against max_http_request_size #1253

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

janl
Copy link
Member

@janl janl commented Mar 29, 2018

This supersedes #1200.

New Behaviour

This variant introduces no new config variable and no formula, instead, there is a set of three hurdles each doc write has to pass:

  1. doc body size
  2. individual attachment size
  3. length of multipart representation of full doc body and including attachments.

The validation path is now the following:

  • If a new doc body is > max_document_size, we throw an error.
  • If a new attachment is > max_attachment_size, we throw an error.
  • If the new doc body plus new and/or existing attachments is > max_http_request_size, we throw an error.

Notes

This is again just a sketch to show how something like this could look like. The patch is fairly minimal, but it does include a full additional ?JSON_ENCODE of the doc body, and some munging of the attachment stubs, that I’d like to get a performance review for. I’m sure we can make this fast if we need to, but that would require a larger patch, so it’s this sketch for now.

Compatibility

This also sets the max_document_size to 2 GB, to restore 1.x and 2.0.x compatibility as per #1200 (comment)

I’d suggest we make this a BC change in 3.0 to the suggest 64MB or whatever we feel is appropriate then.

Formalities

  • Includes tests
  • Documentation has been updated // waiting for consensus before doing this

@janl janl added this to the 2.2.0 milestone Mar 29, 2018
@janl janl requested review from rnewson, wohali and nickva March 29, 2018 11:24
The validation path is now the following:

If a new doc body is > max_document_size, we throw an error.

If a new attachment is > max_attachment_size, we throw an error.

If the new doc body in combination with new and/or existing
attachments is > max_http_request_size, we throw an error.

This also sets the max_document_size to 2 GB, to restore 1.x and
2.0.x compatibility.

Closes apache#1200
src/couch/src/couch_doc.erl Outdated Show resolved Hide resolved
@@ -708,6 +709,9 @@ upgrade(#att{} = Att) ->
upgrade(Att) ->
Att.

to_tuple(#att{name=Name, att_len=Len, type=Type, encoding=Encoding}) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not keen on the name. a record already is a tuple, so I'm guessing the only purpose here is to ignore other fields (and any new fields).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason this exists is because couch_multipart_httpd:length_multipart_stream() makes assumptions about the format of the stub. This is usually called from chttpd. I didn’t find a way around this without exposing couch_att’s att# record.

But yeah, I’m very okay with any other name here. What would be better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[16:26:10]  <+jan____>	rnewson: do you have a better name for to_tuple in https://github.com/apache/couchdb/pull/1253/files#r178029434?
[16:27:14]  <+rnewson>	I don't, proceed with to_tuple.

@janl
Copy link
Member Author

janl commented Mar 29, 2018

The travis fails here point to https://github.com/apache/couchdb/blob/master/test/javascript/tests/attachments.js#L300-L301 where we allow attachments stubs to be written verbatim and without length info. There is code to resolve this, but it requires reading the attachment info from disk.

I’m not yet implementing this because I wan’t a review of the approach here first. Could be perf-prohibitive on the write path, tho.

rel/overlay/etc/default.ini Outdated Show resolved Hide resolved
@wohali
Copy link
Member

wohali commented May 1, 2018

Note that whatever we do here - especially if this PR is not merged into 2.2.0 - needs to be documented for the 2.2.0 release, in light of the concerns raised in #1304.

Boundary = couch_uuids:random(), % mock boundary, is only used for the length
Atts = lists:map(fun couch_att:to_tuple/1, Atts0),
{_, DocSum} = couch_httpd_multipart:length_multipart_stream(Boundary,
?JSON_ENCODE(Body), Atts),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we re-encode, we use jiffy, but unless it is a request made by the replicator, the user probably didn't use erlang to encode the data, so we could get a different value. There is also some performance loss in say re-encoding larger document bodies back to json just to check their size.

There is no canonical json encoding and so no canonical encoded json size. Above we are calculating the encoded size using a conservative size estimate (giving the user a benefit of the doubt) and with better performance (https://github.com/nickva/couch_ebench). Maybe make a version of length calculation that takes sizes and then we'd pass the already computed couch_ejson_size:encoded_size(Doc#doc.body) and 32 for boundary size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nickva, I’ll give this a ponder!

@janl
Copy link
Member Author

janl commented Jul 13, 2018

I propose the following:

  1. cherry-pick the change that restores the 4GB request size limit to master/2.2.0.
  2. leave this branch/pr open until we made the validate-request-size-on-write function a) complete and b) fast.
  3. when the function is ready, add it to a post 2.2.0 release with a note that in the future this will be enabled by default and a config option for folks to opt-in at that point.
  4. eventually make it opt-out.

I don’t see this being finished any time soon, and since the end-result is functionally equivalent (sans the opt-in) for 2.2.0, this should not block 2.2.0.

@janl janl removed this from the 2.2.0 milestone Jul 13, 2018
@nickva
Copy link
Contributor

nickva commented Jul 13, 2018

@janl I'll try to make the function. It is close enough to the other one. Otherwise, I think we can keep it.

nickva added a commit to cloudant/couchdb that referenced this pull request Jul 13, 2018
Use the already computed (conservative) body size.

Switch multipart length calcuation to accept body and boundary sizes.

Issue apache#1200
Issue apache#1253
Use the already computed (conservative) body size.

Switch multipart length calcuation to accept body and boundary sizes.

Issue apache#1200
Issue apache#1253
Better total (body + attachments) size checking for documents
@nkosi23
Copy link

nkosi23 commented Jun 30, 2019

When we talk about 64Mb, 2Gb etc... are we talking about default values that can be raised by the user, or hard limits that the end user cannot change?

For example, if 64Mb is selected for 3.0, would individual users still be able to set 4Gb and more?

As far as we are concerned, we enjoy CouchDb for the ability to use it as a file server / streaming server and keep all the data in one place to ensure database consistency (compared to keeping references to third party storage services such as S3, and then keeping everything in sync, ensuring links are not dead etc... Couchdb is a blackbox neatly integrated with the Application Layer, thus greatly simplifying maintenance and system administration).

Filtered replication is particularly nice with this regard, since it becomes extremely easy to create clusters of multimedia attachments to balance load. For example, one cluster may only contains replicated videos, and the Application Layer uses Command-Query separation to route the user to the right cluster depending on the type of Query being performed. All of this is transparent from the application's perspective, the various addresses of clusters handling specific query types simply need to be configured during the deployment of the application, and further load balancing can be done at the DNS level.

CouchDb hugely simplifies the infrastructure work. The management of large multimedia clusters becomes a breeze with continuous filtered replication, and consumption by application users is very efficient thanks to CouchDb support for HTTP range requests.

It would be interesting to retain the ability to have large documents / attachments / requests sizes for those who know what they are doing.

@wohali
Copy link
Member

wohali commented Oct 9, 2020

@janl This is very stale. Any plans to get this in? I am preparing to mass-close old PRs that never got merged.

@bessbd
Copy link
Member

bessbd commented Mar 29, 2021

@janl This is very stale. Any plans to get this in? I am preparing to mass-close old PRs that never got merged.

@janl / @wohali : ping

@wohali
Copy link
Member

wohali commented Mar 29, 2021

@bessbd I think what Jan proposed in #1253 (comment) is basically done. We've pushed the default limit in 3.x pretty low, and as you know 4.0 changes all the rules.

I think it is probably safe to close this out, but I'd like to see @janl +1 that.

@adityajoshi12
Copy link

is there any update on this issue?

@saurabhprasadsah
Copy link

this is very stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants