-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix replicator hang or delay updating state caused by websocket #15
Conversation
Adhoc couchbase lite client
Try to fix long-lasting connecting or busy state
Even if I do not merge this PR directly, I will, absolutely, use much of the code it contains, and credit you in a comment. |
This PR looks like it is addressing several issues. Can you explain the symptoms you observed and why this change fixes it? Thanks! |
Sure, I'll comment on the changes about my thinkings. Let's see if it is the right way to do it. |
if (closing.getAndSet(true)) { | ||
Log.v(TAG, "CBLWebSocket already closing."); | ||
return; | ||
switch (state.getAndSet(State.REQUEST_CLOSED)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main modification against couchbase/couchbase-lite-java-ce-root#13
In this method, requestClose
, it used a webSocket == null
to tell if the webSocket handshake was successful, because the webSocket
field would not be set until calling onOpen
method.
But we need more than that, on a situation that the core decide to close the webSocket before handshake is done. Then, okhttp should be notified, and have the resources released. Thus, we need to access the webSocket
field here to call WebSocekt.cancel
. After calling cancel, okhttp reacts to it, and call onFailure
later, telling AbstractCBLWebSocket
and core the failure, so that replication fails (and can be restarted from outside).
But in requestClose
, how do we know if the handshake is done or not? The old way (checking webSocket == null
) does not work any longer. So a State
class is there to hold this information. The old closing
AtomicBoolean is also covered by the State
and is removed.
As onOpen
and requestClose
is possibly be called in different thread at the same time (consider this situation: the websocket handshake response is received, at the same moment core decide to close it), several code here is for handling the race condition. Please check if something got missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect. Thanks. On it.
@@ -297,8 +331,16 @@ protected AbstractCBLWebSocket( | |||
|
|||
@Override | |||
protected void openSocket() { | |||
OkHttpClient okHttpClient = httpClient; | |||
if (pingInterval > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following several lines are for couchbase/couchbase-lite-java-ce-root#12
What we really want is to detect bad network by websocket heartbeat, and what's more, to react as soon as possible when network is down. (In our project, a period of 10 seconds is ideal)
I've noticed that websocket heartbeat is already considered inside couchbase-lite-core. I tried to enable it, but unfortunately, it's only for framing
mode. After reading code for "a long time", a realized that okhttp and websocket here in java codes is just a "tunnel" for couchbase-lite-core and can be treated as a pure socket. So it's ok to enable this tunnel an ability to detect bad network, as long as it does't break higher level communication. This is how I finally feel ok to turn heartbeat on here.
Another word for .retryOnConnectionFailure(false)
. Even if the heartbeat is turned on, bad-network-detection is also much delayed because of okhttp retry. Retrying is good, but we want it in our control, don't retry on "every network layer". I
think turning okhttp retry off will not make the system fragile, because retry and restart strategy is one key point that couchbase lite users should concern carefully. They should handle it on the outmost layer, and should not hope the lib do everything for them.
Other codes are just about adding a new config pingInterval
to the config class and pass it here, quite naive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that this is going to conflict with our upcoming feature to make the Core heartbeat configurable by the CBL consumer. It effectively means we'd now have two ways to set the same feature. Most of the time when the network is truly down, the TCP connection will also die which will close the web socket connection. This setting is really only useful when the TCP connection is still alive but the other side is not responding for whatever reason. I'd be interested to hear about scenarios like this, that would not be solved by a configurable core heartbeat. Do you know of any?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heartbeat is necessary. As couchbase lite is for mobile devices, any bad point between client and server will make it happen. Especially in our situation, each of our android device doesn't access to internet directly, but through another linux box that communicate with internet by 4G. So, when network is bad, you know... The mobile device can't detect it directly, maybe heartbeat is the only way.
And it is ok for us if core does heartbeats. That makes no difference. What we want include several points:
- Configurable interval in java codes
- Fail in time when network down is detected
According to the version I read, there is heartbeat in the core, but only for framing mode. Will it be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question. The fact that Java is using its own framing is something I was actually unaware of. Apple and .NET both using framing. If I ponder out loud in this comment, the whole existence of framing is basically a choice that the platform side makes between "I want Core to format my messages for me so that I can just send them over a raw socket" or "I want Core to give me the message payload and I will format / send it." Extending that, it means that it is not possible for Core to send a heartbeat if framing is not being used because it is in the dark about how to send it (or if it needs to be sent at all). Side note: The current other choice besides "no framing" is not "framing", but "web socket client framing" which is why Core can confidently send the heartbeat on its own in that mode. So in light of that, you are correct that Java must be the master of its own heartbeat. That being said, I am going to assume that this difference is going to be hidden behind the new API that we will offer in the next 2.8.x release to make both the retry logic and heartbeat configurable, rendering the public facing part of this PR moot and conflicting (the implementation, however, can remain).
Some optional background information that hopefully explains why we are introducing this API change, and why it will also satisfy your requirement, is that we have a large number of users who put Sync Gateway between a network load balancer which tends to cut off connections it views as "idle." This is bad for continuous replication, which spends large amounts of time idling. Currently we have the heartbeat set at a hard coded 5 minutes, but very often load balancers cut off connections before that and so we are going to allow the consumer to configure the heartbeat interval. As it stands now, the purpose of the heartbeat is twofold: It both makes sure that the other side does not hang up, and it checks the health of the connection. If a reply is not received in a timely manner (10 seconds), the connection is failed and replication stops (I forget if this is considered recoverable or not).
All of this is unrelated in a platform that does not use framing, though, so Java is free to do as it wishes in terms of heartbeats as long as it conforms to the API we will provide (out of your control for now) and accomplishes the same two goals, the second of which I believe is also your goal. My recommendation is to leave the replicator configuration changes out of the PR (since they will just be overwritten later when the new API comes) and keep the implementation. @bmeike Does this sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@borrrden You mean, for java and android, enabling heartbeat this way is ok, but the user interface should comply with your coming version, right? So I should revert some code for the Configuration class, maybe CBLWebSocket remains. Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is what my recommendation is. In general the changes of merging something into the public API space from outside users is low because we need to weigh them against the need to bring the same features to other platforms besides (in this case) Java. In this particular case though, we are already planning the same change that you desire (though your PR has helped us realize this mistake in Java that otherwise would have continued to go unnoticed, I think) so the functionality is definitely welcome, with the API to be introduced later.
That being said, @bmeike must sign off on this as the owner of this repo. This is only my personal recommendation as a sort of "outside counsel" (I work with LiteCore and I basically came up here to explain why things are the way they are at the core level, and my recommendation might have been an overreach).
@@ -184,7 +184,7 @@ bool litecore::jni::initC4Replicator(JNIEnv *env) { | |||
|
|||
static jobject toJavaObject(JNIEnv *env, C4ReplicatorStatus status) { | |||
jobject obj = env->NewObject(cls_C4ReplStatus, m_C4ReplStatus_init); | |||
env->SetIntField(obj, f_C4ReplStatus_activityLevel, (int) status.level); | |||
env->SetIntField(obj, f_C4ReplStatus_activityLevel, (int) (status.level == kC4Stopping ? kC4Stopped : status.level)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modification is not mentioned in the pr, in fact it's another problem. The C4Replicator has an interval status, kC4Stopping
, which does not exposes outside to java code. There's no such value in status enum in java code.
But sometimes it "leaks" outside. If the status is checked right after the replicator is asked to stop, a "stopping" will leak to java code and exception raised.
The "safe" way is to add the status also to java code, but then the java user interface is slightly changed.
I think it may be ok to "hide" in this way. If the replicator is stopping, just tell outside it is stopped, because normally speaking "stopping" means "promised to stop later". Callers can use it just as it is stopped, to start it, or to stop it, etc. (I've read the code that start and stop during stopping status are all considered) Maybe it makes no difference.
This is not a careful fix and feel free to replace it with better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Probably better to replace stopping with busy, not stopped. A lot of CBL consumer code will react to stopped by thinking it is allowed to do things like delete a database, which at this point it would still not be able to do. Ultimately, core should not be letting this out though! It takes care to not let it leak via callbacks, but asking for it directly seems leaky! I'll file an issue to plug that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have another way to fix this that seems much safer to me. It will introduce a new state in the Java platform, UNKNOWN, which will be used whenever the platform code cannot figure out the replicators state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 I'll revert this change then. Just fix it elsewhere.
… caused app crash" This reverts commit 409f12e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right way to solve this problem.
There are, actually, two problems:
- Core should never leak this status. That's a bug. It will be fixed.
- the platform code should deal reasonably with unrecognized statuses. The way to do that is to use constructor args to create this object, instead of directly setting the field values.
I have a commit to address these things, in master. I totally appreciate the effort you've put into this. That commit, though, is a better solution.
@bmeike Which commit ? Could I read it ? If you mean the kC4Stopping status, I've already revert the change. |
@workingenius Sure: it is in couchbase-lite-java-common (submodule common) @ e82ebfe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks ok to me. Might need some small tweaks later when the API comes but overall good.
I'm changing the target branch for this PR. The changes here conflict with some new code, so I need to be able to cherry pick it. I'm going to put it on master first and then back-port it to the 2.8 branch. I thank you very much for your work on this. I'll buy you a beer, sometime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging to Hotfix branch. Will cherry-pick to master, integrate with current code, then backport to Hydrogen
@workingenius The commit that fixes the problem with kC4Stopping is here: couchbase-lite-java-common @ e82ebfe I'll add the commit here, when I push it. |
The fix for this is in couchbase-lite-java-common @ d2413f2 As I said, I believe it fixes the bug. If it does not, it should make it relatively easy to do so. I'd be interested to hear your results. |
@bmeike Thank you for invitation. It's a big change and I'll read the code carefully in my spare time. Since we patched something to couchbase lite (both core and java client), our system seems become more stable. Now we are troubled by other parts but not couchbase. Personally I'm happy to help test new versions, but it depends on our team. 🤝 |
For couchbase/couchbase-lite-java-ce-root#12 and couchbase/couchbase-lite-java-ce-root#13