-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Previous response state retained when an ibrowse client is reused #4870
Comments
Thanks for your report @jcoglan! I think in this case the first step might be to show that it's still happening in the latest release. 2.3.1 is a few years old (4?) and we've had a number of fixes in the ibrowse / replicator since then. So perhaps try to set up a mirrored test cluster with the same data, access patterns, and settings, but with 3.3.2. Or, construct a reproducer script to generate a similar scenario synthetically. See if in the original environment tweaking timeouts, worker processes, batch sizes or other settings makes the problem more pronounced, and crank those up in the test script, to make it more likely for the problem to show up. For instance, say shortening fabric request timeouts to 10 sec, makes the issue more likely to happen, then shorten it to 1 sec in the reproducer script. |
@nickva that’s a fair ask :) — We tracked it as far as checking that the retry logic has not changed (much) since 2.3.1. The problem for us is that the app and environment that causes this fails completely with 3.x, but that’s a thing we are working on separately. Will report back :) |
We (@neighboorhoodie, @janl, myself) have observed some incorrect state conditions in CouchDB’s use of
ibrowse
in the replicator that lead to incorrect parsing of some HTTP responses. This is most likely to occur when a connection has been interrupted or a response not fully consumed, and it can lead to anibrowse
process retaining headers from previous responses and therefore incorrectly parsing later ones. In particular it can lead to non-chunked responses being parsed as chunked ones, which leads to an error wherehexlist_to_integer
andto_ascii
are called with bad input.Description
We observed this when the following error showed up in customer logs:
ibrowse_http_client:to_ascii
is being called with the integer114
, which denotes the letterr
in ASCII, which is not a valid hex digit, so the call fails.The
gen_server
error includes the last message received by the process:The error stacktrace passes through a clause of
parse_11_response
that handles chunked encoding, but the above response message does not have aTransfer-Encoding
header. This is being invoked because it's reacting to thegen_server
state, which in this case included the following header list:(The full state object is quite large, but includes some other values indicating chunked encoding. We can provide the full state with client info redacted if needed.)
These headers do not match those of the incoming message. An HTTP client is expected to handle streamed data, and so needs to keep some state to record headers it's seen so far. However the incoming message is a complete response, not a chunk of a stream, and so it should be parsed from a clean starting state.
Our client has reported that they're seeing a lot of partial/interrupted responses, and our best theory for why this is happening is that an
ibrowse_http_client
instance is partially processing a response, which is interrupted, and the process is then reused for another request without being cleaned up first. Two main scenarios seem likely causes for this:A code path that retries requests: a client makes one request, whose response is partially received, and the client then retries the request using the same
ibrowse_http_client
process, leading to confused state. For example,couch_replicator_api_wrap:open_doc_revs
performs retries, and the above headers indicating a chunked multipart response are consistent with fetching a doc withopen_revs=all
. Thegen_server
state also indicates this is the kind of request being made.Reuse of a client from the pool without resetting its state. The
release
handler incouch_replicator_connection
demonitors the worker and sets itsmref
toundefined
, but otherwise sends no signal to the worker to reset its state. This could lead to a client being reused when it still contains headers from a previous response.Steps to Reproduce
This is tricky to reproduce mechanically because it requires getting responses interrupted in order to trigger the bad state. We have only observed its effects in customer logs and then diagnosed possible causes by code review.
Expected Behaviour
When reusing an
ibrowse
process either immediately or by releasing it back to the pool, its state should be reset so that it does not conflate the state from multiple responses. If this cannot be done then the connection should not be reused.Your Environment
The text was updated successfully, but these errors were encountered: