Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_revid - Badarg error in HTTP request #2015

Open
ronag opened this issue Apr 24, 2019 · 32 comments
Open

parse_revid - Badarg error in HTTP request #2015

ronag opened this issue Apr 24, 2019 · 32 comments

Comments

@ronag
Copy link

ronag commented Apr 24, 2019

Does revid have some undocumented limitation? couchdb 1.7.1

[error] [<0.11900.1269>] Badarg error in HTTP request
                                      ["MeGVONcBrkF5jQ-_ingest_scheduler",16],
                                      []},
                                     {couch_doc,parse_revid,1,
                                      [{file,"couch_doc.erl"},{line,163}]},
                                     {couch_doc,
                                      '-transfer_fields/2-lc$^1/1-1-',1,
                                      [{file,"couch_doc.erl"},{line,283}]},
                                     {couch_doc,
                                      '-transfer_fields/2-lc$^1/1-1-',1,
                                      [{file,"couch_doc.erl"},{line,283}]},
                                     {couch_doc,transfer_fields,2,
                                      [{file,"couch_doc.erl"},{line,283}]},
                                     {couch_httpd_db,'-db_req/2-fun-3-',1,
                                      [{file,"couch_httpd_db.erl"},
                                       {line,345}]},
                                     {lists,map,2,
                                      [{file,"lists.erl"},{line,1237}]},
                                     {couch_httpd_db,db_req,2,
                                      [{file,"couch_httpd_db.erl"},
                                       {line,344}]}]
@ronag
Copy link
Author

ronag commented Apr 24, 2019

it turns out... that when the revid is 32 characters long couchdb assumes it's a hex string...

@ronag
Copy link
Author

ronag commented Apr 24, 2019

I think if the hex parse fails it should fallback to just doing l2b?

@wohali
Copy link
Member

wohali commented Apr 24, 2019

Are you really making your own _rev tokens? That's pretty far outside the supported API.

@rnewson
Copy link
Member

rnewson commented Apr 24, 2019

parse_revid(RevId) when size(RevId) =:= 32 ->
    RevInt = erlang:list_to_integer(?b2l(RevId), 16),
     <<RevInt:128>>;
parse_revid(RevId) when length(RevId) =:= 32 ->
    RevInt = erlang:list_to_integer(RevId, 16),
     <<RevInt:128>>;
parse_revid(RevId) when is_binary(RevId) ->
    RevId;
parse_revid(RevId) when is_list(RevId) ->
    ?l2b(RevId).```

@rnewson
Copy link
Member

rnewson commented Apr 24, 2019

making your own _rev values is quite unusual. What are you doing?

@ronag
Copy link
Author

ronag commented Apr 24, 2019

That's pretty far outside the supported API.

It is? How so?

What are you doing?

Adding tracking such as modified time & modified by

@ondra-novak
Copy link

what is the consensus about this issue? Are custom _rev tokens allowed or not? if not, any invalid _rev token should be rejected with proper error message.

The current state is very unclear.

I am using custom suffix on _rev value to track merged and resolved conflicts
suffix "M" - merged (example: 15-4b7c15ab6fa3a8271935ea91eddf8eb1M)
suffix "R" - resolved - deleted conflicted revision (example: 11-45637d8602d84d82e395a6fa55d12f80R)

@rnewson
Copy link
Member

rnewson commented Apr 27, 2019

The two uses, "Adding tracking such as modified time & modified by", are best done with json key/value attributes rather than co-opting the _rev value.

Tracking merged and resolved conflicts is also not common (I've never heard of anyone do this before you mentioned it four hours ago). Conflicts are resolved by deleting the revisions that "lose", even if you have to make a new revision that is a combination of others. I'm curious as to what value you get out of this that isn't already present in our MVCC system.

Custom _rev values are supported but it is not common to do it, hence the presence of bugs or unexpected behaviour. The historical reason we assume that 32 char rev strings can be decoded to integers escapes me but would need rooting out before we could determine if it was safe to remove those clauses.

@rnewson
Copy link
Member

rnewson commented Apr 27, 2019

Also worth noting that replication will not adopt your custom _rev approaches and will make the normal _rev strings.

@rnewson
Copy link
Member

rnewson commented Apr 27, 2019

the == 32 clauses dated back to 2009 when deterministic _rev was first introduced. I'm inferring it was done as a space/time optimization (integers are smaller than the binary text expression of them), so perhaps a fallback to just using the rev if the decode fails would be acceptable after all.

Given the future foundationdb transition and its constraints on key/value sizes, this is probably a good time to make a project decision on whether users can generate their own rev values and, if so, what constraints we might apply to them (a maximum length seems very likely to be one of them).

@rnewson
Copy link
Member

rnewson commented Apr 27, 2019

@ronag
Copy link
Author

ronag commented Apr 27, 2019

For now... isn't it simply a case of trying to parse as MD5 and if that fails fallback to l2b?

I understand this is more of a fundamental issue and this might be removed/deprecated in future versions. But previous versions where it is supported shouldn't we at least avoid these kinds of "magical" fail cases?

@ondra-novak
Copy link

ondra-novak commented Apr 27, 2019

"I'm curious as to what value you get out of this that isn't already present in our MVCC system."

Easy: This allows you to distinguish between intentionally deleted document (without R) and deleted revision (with R).

And you are right, this still can be handled by a key in the document itself. But revision is always available without need to download the whole document (include_docs=false)

EDIT: ... and also in revision log (_revisions: {"ids":..., "start":...}}

@rnewson
Copy link
Member

rnewson commented Apr 27, 2019

I still think the decoding as integer is about memory usage in the erlang runtime system. We "know" we can decode revs of that length to a number because we created them. A rev value itself has no meaning, it's only for comparison at update time. The deterministic rev system confuses matters by adding a meaning to gain an advantage (the same update to two separate couchdb systems generate the same new _rev, so that replication between them later does not create a conflict).

I've invited others to comment (and obviously this thread is public) but my view at present is that we should not remove the '== 32' clauses or add a fallback if those don't decode correctly. At most, I suggest we add a line in the docs that user-generated rev values are discouraged and a stipulation that if those values are exactly 32 characters long they are required to be encoded integers.

The determining factor here will be whatever we are going to define for CouchDB 4.0. If we're supporting custom _rev values in that release, we will have to provide documentation on what rules apply (maximum/minimum length, allowed characters, etc) and then, as you've requested, we'll consider any rev value that doesn't follow those rules as an error and reject them, with a readable error message.

@ronag
Copy link
Author

ronag commented Apr 27, 2019

I would like to argue against the stance to not to add a fallback.

Custom revs are supported and no limitations have been specified. Especially considering this limitation is a bit "random" from an API user standpoint. Hence, there might be a lot of users out in the wild who are generating customs revs and their databases can randomly "fail". This is what happened to us. We've been running for 4 years like this at multiple clients and it worked fine until it didn't, in the middle of a live broadcast :/�

I understand that a fallback is a slow path, but it's a lot better than what seems like a "random" failure.

@wohali
Copy link
Member

wohali commented Apr 27, 2019

@rnewson My $0.02:

  • We never said whether we allow or disallow changing the revid, either in the replicator spec or the docs.
  • 2.x (and 3.x) should match 1.x's behaviour here. This behaviour comes from ~2009 so it's been this way for a very long time. We didn't allow it then, we don't allow it now. So: Patch the docs.
  • 4.x should feel free to deprecate user-settable _revs if it makes the implementation cleaner. I know there's some desire to update the replication spec for other reasons - this would be the ideal time to formally specify what a revid actually is.

@rnewson
Copy link
Member

rnewson commented Apr 27, 2019

@wohali agree that this has been the behaviour since forever (since 1.0 itself) and so this is definitely not a regression or should be a surprise to anyone making bespoke rev values.

for 4.0, I have a mild preference for saying that only couchdb gets to generate these values but the reality is that pouchdb generates rev values using a different algorithm and we (couchdb) should also be free to change (i.e, improve) this in future. I note, for full disclosure, that prior to the "deterministic rev" work that _rev used to be a random uuid. Replication works so long as the rev stays the same if the doc does, and changes if the doc changes. Anything beyond that is enhancement.

So... in 4.0 I'd be fine with us defining what the valid format for a _rev string is and explicitly supporting any string that matches that (and dropping the integer conversion when length is 32 chars). 3.0 could maybe drop the '== 32' clauses as a stepping stone?

Other voices should chime in, I'm not comfortable defining this here. (cc @davisp @kocolosk @chewbranca @daleharvey, a response from all of you would help, of course other voices are equally welcome).

@rnewson
Copy link
Member

rnewson commented Apr 27, 2019

(I am clearly very on the fence here and could tip either way. I do 100% agree that this halfway situation is not ok. You can define your own _rev strings but there are weird undocumented gotchas.).

@ronag
Copy link
Author

ronag commented Apr 27, 2019

Since I know the gotcha this is not really a problem for me. I just add an extra char at the end to make it 33 chars when it would be 32 chars otherwise. However, I am a bit confused about the reluctance to just add a fallback? I can only see advantages and no disadvantage. Is it because you don't want to have to create new patch releases for 1.x and 2.x?

This behaviour comes from ~2009 so it's been this way for a very long time.

Well... we've been running since 2014 and had no problems until just the other week... so I'm not quite sure what this comment tries to point out? It being a problem is very unlikely... but it can happen none the less

@rnewson
Copy link
Member

rnewson commented Apr 27, 2019

It's not a reluctance to make releases (we would not make a 1.x release for anything but a security related issue anyway).

This behaviour has existed since 1.0 and the consequences of changing it are not fully known. I appreciate you don't see a disadvantage but that isn't sufficient.

By deciding what we'll support in 4.0 we can plan what 3.0 might do and give more people a chance to chime in with hitherto unforeseen problems that might arise (or, we hope, more confirmation that there's no problems).

Summary: It's been this way for 10 years, we're not changing it without a lot more input.

@ricellis
Copy link

Re: #2015 (comment)

but the reality is that pouchdb generates rev values using a different algorithm

There are also other replicator compatible data stores that generate rev IDs differently to (at least current) CouchDB e.g. the Cloudant sync libraries. We don't do anything particularly unusual, in fact, we use a random UUID similarly to CouchDB prior to deterministic revs. I do think that maintaining backwards compatibility with those types of revs probably ought to be a goal here as trying to migrate rev formats in existing datastores would be painful.

If there have to be some sort of forced rev format/algorithm then I think there needs to be clearly defined versioning for replicators such that Couch N would always be able to replicate with rev ids that were supported by the replicator from at least Couch N-1, for example.

@davisp
Copy link
Member

davisp commented Apr 29, 2019

I haven't got any issue adding a fallback that just reuses the provided rev even if its doesn't parse as hex. I'd agree that its basically just space savings for both RAM and disk usage.

The only issue I have with people creating "meaningful" revisions is that there's always the possibility that we evolve our own "differently meaningful" revisions in the future which may end up being incompatible. However, given that we're currently not versioning them that's also on us to be thoughtful of if we were to ever do that (not that I've ever heard of any ideas around doing that, just noting the possibility exists).

Also, fun fact but we also have weirdness in the opposite direction. If someone were to create a custom revision that is 16 bytes long, it'll get hex-ified [1] on the way out which is also something to think about.

[1] https://github.com/apache/couchdb/blob/master/src/couch/src/couch_doc.erl#L70

@ronag
Copy link
Author

ronag commented Apr 29, 2019

Also, fun fact but we also have weirdness in the opposite direction. If someone were to create a custom revision that is 16 bytes long, it'll get hex-ified [1] on the way out which is also something to think about.

Not sure how to solve that... even with a fallback... don't use custom revid that are 16 chars long :)

@ronag
Copy link
Author

ronag commented Apr 29, 2019

A controversial suggestion... but if you don't use this optimisation in CouchDB 1-3 (i.e. no "smart" parse to 128 bit integer) and then re-add it in e.g. CouchDB 4 (where it is clearly documented)... then the currently documented/assumed behaviours will all work... with a performance regression of course (revid will take 16 bytes extra memory)... but how big is that regression in reality?

@janl
Copy link
Member

janl commented Apr 29, 2019

(we would not make a 1.x release for anything but a security related issue anyway)

just to clear a minor point up: 1.x is even past security releases.

@ronag
Copy link
Author

ronag commented May 1, 2019

@wohali
Copy link
Member

wohali commented Mar 13, 2020

@rnewson @davisp @janl In light of 4.0 development, we should decide one way or the other on this.

In short: are we going to allow user-defined _rev values in 4.0? Or no?

@wohali
Copy link
Member

wohali commented Jun 25, 2020

bump on this. @garrensmith @rnewson Are we going to prevent user-defined _rev values in fdb-based CouchDB? This issue makes it clear it could be a "sharp edge" that users could get caught on.

If so I want to mark this in the documentation for 3.x ASAP.

@rjharmon
Copy link

rjharmon commented Jun 25, 2020

As a developer who uses it, I selected couchdb in part because the _rev is opaque and CAN be controlled from outside couchdb. Using md5(erlang-data-representation) in those _revs is far from ideal for deterministic versioning, so it's not just a nice point of flexibility. But it is nice, for sure.

Would push replicators from other versions/implementations (including and not limited to PouchDB) have to pull the new _rev and write it back to disk, when pushing to a server that ignores provided _rev's? If so, that would be a minor ick.

Impact assessment: if one generates revs that have a 32-byte non-int, one would get an error, search and find this issue and then add or remove a byte from their _rev representation. I'm not seeing that as "sharp", myself. Did I miss something?

I would suggest, as the simplest change, to continue the existing semantic of "support provided _rev's opaquely", while correcting the unintended crashahem, error-on-32-char-rev's-not-parsable. I might also add any needed guidance for integrators and authors of custom replicators that they SHOULD generate deterministic _rev's if they use an algorithm different from couchdb's default internal mechanism.

Might it be helpful to strengthen any language for proxies and replication implementers to treat observed _rev's as opaque at baseline? while possibly noting that custom replicators are allowed to probe for any special meanings for some optimization purposes (just as the int-parsing logic does currently), but that they MUST be robust to opaque _rev's that don't fit their hopes and dreams. I think that's another way of indicating the high-level outcome that it seems @ricellis drives at.

@davisp It's possible to suggest people SHOULD NOT use flags in the revs for indicating such facts as deletion or update-times, although given that _rev's are opaque values, it's not clear to me that these bits of meaning even could be harmful to anyone (reminding myself that if one does such things, they can also evaluate and test against any potentially-colliding changes made in later releases of the server) - essentially as Paul said. I'd leave off this bit, myself.

There's some additional perspective from the field.

@rjharmon
Copy link

Can good 32-byte-hex-ints continue to get an optimized code path, while other 32-byte _revs get a working result as with non-32-byte _rev's?

@janl
Copy link
Member

janl commented Jun 26, 2020

I’m in favour of allowing external revs, and I’d be okay with documenting clearly that 32-byte-non-int revs are an obtuse edge-case that we’re not changing, so implementers will have to do some trickery to work around that.

In addition, whatever 4.0 limitations exist, those need documenting then too.

@wohali
Copy link
Member

wohali commented Jun 26, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants