-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChannelClosedException when remote end closes connection before idleTimeMs #2104
Comments
Thanks for filing this issue @hsmade and thanks for the detail repro steps! We will take a look. |
Np, you can ping me on slack if you need more info.
Thx,
Wim Fournier
…On Mon, Aug 20, 2018, 19:26 Dennis Adjei-Baah ***@***.***> wrote:
Thanks for filing this issue @hsmade <https://github.com/hsmade> and
thanks for the detail repro steps! We will take a look.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2104 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABFW9bCAix7-sR22O3LDIV4TlqalaDxeks5uSvEogaJpZM4V_Uoh>
.
|
Thanks for this report @hsmade. That is a surprising behavior. I'd recommend taking a look in WatermarkPool to see if something funny is going on there. The way that I would expect this to work is that when Linkerd issues the request, it should see that the WatermarkPool connection queue doesn't have any Services in it that are not in the Closed state. Therefore, it should attempt to create a new connection. You can attach a debugger and see if the service's in WatermarkPool's queue are closed (as they should be, if the connection has been closed by the remote). Hope this is helpful. I'm happy to chat about this more if you get stuck. |
@hsmade any luck investigating this? |
Sorry, haven't had the time yet. It's on my todo list though.
…On Tue, Sep 11, 2018, 00:20 Alex Leong ***@***.***> wrote:
@hsmade <https://github.com/hsmade> any luck investigating this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2104 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABFW9QPdn8nkvQ1eKQKfELmI1pEwMcMsks5uZuWegaJpZM4V_Uoh>
.
|
Issue Type:
What happened:
When routing thrift, if the remote end closes the connection before the hostConnectionPool.idleTimeMs expires, the next call for that connection fails with a
com.twitter.finagle.ChannelClosedException
which is bubbled up to the calling service. On top of that, it seems that hostConnectionPool.idleTimeMs is actually between it's value and twice it's value (or even more).What you expected to happen:
I would expect linkerd to recognise that the connection is closed, before trying to call. Also, I would expect that in this case (when it's recognised), the failure is marked as requeue-able.
I'm not sure if it's possible to (cheaply) ask the connection, that we get from the pool, if is still open before sending data. If so, we can mark it as failed if it's closed and let it requeue. If this is too costly, we should probably focus on why this fails even if the remote end timeout is set to more than twice the idleTimeMs (and of course document the fact that this value might be doubled as per finagles documentation).
How to reproduce it (as minimally and precisely as possible):
Setup a remote service that uses apache's thrift-0.9 and set the clientTimeout to 6 seconds. Setup a router to that service with the idleTimeMs set to 5 seconds. Start tcpdump and do a single call. Notice that linkerd does not close the connection, the remote end does. Do another call and notice the exception.
Next set the clientTimeout to 12 seconds, do some calls, pause for 12 seconds, do some other. Notice that the exception still happens (albeit less often).
Anything else we need to know?:
Here's the inline doc from finagle about the cache ttl (which is in fact the idleTimeMs): https://github.com/twitter/finagle/blob/finagle-7.1.0/finagle-core/src/main/scala/com/twitter/finagle/util/Cache.scala#L16
Environment:
The text was updated successfully, but these errors were encountered: