Fix rare race condition in ZClient causing healthy connections to be discarded #2924
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This one has been bugging me for quite a bit of time now but I managed to finally trace it down. In short, there is a race condition which sometimes causes connections to be invalidated from the connection pool without any good reason (request succeeds and server keeps the connection alive)
The issue is that, we need to fulfil the
onComplete
promise before we invoke the final callback when we collect the async body, but after we removed the handler from the pipeline. The race condition happens because when the request Scope is closed, it also interrupts theonComplete
promise here.So when we make a request like below, if the callback wins the race, then the scope will be closed before we manage to fulfil the promise, and therefore the Channel will be discarded.
The change in this PR ensures that the onComplete promise is fulfilled right before we call the async body handler on the last message, but after we've removed the handler from the pipeline. This is also important since if we remove the handler after we complete the promise, there is another potential race condition where we try to reuse the channel and end up with an error because the handler hasn't been removed yet.