Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replicator hang or delay updating state caused by websocket #15

Merged
merged 14 commits into from
Dec 18, 2020

Conversation

workingenius
Copy link

For couchbase/couchbase-lite-java-ce-root#12 and couchbase/couchbase-lite-java-ce-root#13

  1. Enable websocket heartbeat. Add a new config to ReplicatorConfiguration and pass it to CBLWebSocket.
  2. Cancel the attempt to connect on connection timeout before websocket handshake succeed.

@bmeike
Copy link
Contributor

bmeike commented Dec 10, 2020

Even if I do not merge this PR directly, I will, absolutely, use much of the code it contains, and credit you in a comment.
Many many thanks.

@bmeike
Copy link
Contributor

bmeike commented Dec 11, 2020

This PR looks like it is addressing several issues. Can you explain the symptoms you observed and why this change fixes it?

Thanks!

@workingenius
Copy link
Author

This PR looks like it is addressing several issues. Can you explain the symptoms you observed and why this change fixes it?

Thanks!

Sure, I'll comment on the changes about my thinkings. Let's see if it is the right way to do it.

if (closing.getAndSet(true)) {
Log.v(TAG, "CBLWebSocket already closing.");
return;
switch (state.getAndSet(State.REQUEST_CLOSED)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main modification against couchbase/couchbase-lite-java-ce-root#13

In this method, requestClose, it used a webSocket == null to tell if the webSocket handshake was successful, because the webSocket field would not be set until calling onOpen method.

But we need more than that, on a situation that the core decide to close the webSocket before handshake is done. Then, okhttp should be notified, and have the resources released. Thus, we need to access the webSocket field here to call WebSocekt.cancel. After calling cancel, okhttp reacts to it, and call onFailure later, telling AbstractCBLWebSocket and core the failure, so that replication fails (and can be restarted from outside).

But in requestClose, how do we know if the handshake is done or not? The old way (checking webSocket == null) does not work any longer. So a State class is there to hold this information. The old closing AtomicBoolean is also covered by the State and is removed.

As onOpen and requestClose is possibly be called in different thread at the same time (consider this situation: the websocket handshake response is received, at the same moment core decide to close it), several code here is for handling the race condition. Please check if something got missed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect. Thanks. On it.

@@ -297,8 +331,16 @@ protected AbstractCBLWebSocket(

@Override
protected void openSocket() {
OkHttpClient okHttpClient = httpClient;
if (pingInterval > 0) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following several lines are for couchbase/couchbase-lite-java-ce-root#12

What we really want is to detect bad network by websocket heartbeat, and what's more, to react as soon as possible when network is down. (In our project, a period of 10 seconds is ideal)

I've noticed that websocket heartbeat is already considered inside couchbase-lite-core. I tried to enable it, but unfortunately, it's only for framing mode. After reading code for "a long time", a realized that okhttp and websocket here in java codes is just a "tunnel" for couchbase-lite-core and can be treated as a pure socket. So it's ok to enable this tunnel an ability to detect bad network, as long as it does't break higher level communication. This is how I finally feel ok to turn heartbeat on here.

Another word for .retryOnConnectionFailure(false). Even if the heartbeat is turned on, bad-network-detection is also much delayed because of okhttp retry. Retrying is good, but we want it in our control, don't retry on "every network layer". I
think turning okhttp retry off will not make the system fragile, because retry and restart strategy is one key point that couchbase lite users should concern carefully. They should handle it on the outmost layer, and should not hope the lib do everything for them.

Other codes are just about adding a new config pingInterval to the config class and pass it here, quite naive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this is going to conflict with our upcoming feature to make the Core heartbeat configurable by the CBL consumer. It effectively means we'd now have two ways to set the same feature. Most of the time when the network is truly down, the TCP connection will also die which will close the web socket connection. This setting is really only useful when the TCP connection is still alive but the other side is not responding for whatever reason. I'd be interested to hear about scenarios like this, that would not be solved by a configurable core heartbeat. Do you know of any?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heartbeat is necessary. As couchbase lite is for mobile devices, any bad point between client and server will make it happen. Especially in our situation, each of our android device doesn't access to internet directly, but through another linux box that communicate with internet by 4G. So, when network is bad, you know... The mobile device can't detect it directly, maybe heartbeat is the only way.

And it is ok for us if core does heartbeats. That makes no difference. What we want include several points:

  1. Configurable interval in java codes
  2. Fail in time when network down is detected

According to the version I read, there is heartbeat in the core, but only for framing mode. Will it be changed?

Copy link
Member

@borrrden borrrden Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question. The fact that Java is using its own framing is something I was actually unaware of. Apple and .NET both using framing. If I ponder out loud in this comment, the whole existence of framing is basically a choice that the platform side makes between "I want Core to format my messages for me so that I can just send them over a raw socket" or "I want Core to give me the message payload and I will format / send it." Extending that, it means that it is not possible for Core to send a heartbeat if framing is not being used because it is in the dark about how to send it (or if it needs to be sent at all). Side note: The current other choice besides "no framing" is not "framing", but "web socket client framing" which is why Core can confidently send the heartbeat on its own in that mode. So in light of that, you are correct that Java must be the master of its own heartbeat. That being said, I am going to assume that this difference is going to be hidden behind the new API that we will offer in the next 2.8.x release to make both the retry logic and heartbeat configurable, rendering the public facing part of this PR moot and conflicting (the implementation, however, can remain).

Some optional background information that hopefully explains why we are introducing this API change, and why it will also satisfy your requirement, is that we have a large number of users who put Sync Gateway between a network load balancer which tends to cut off connections it views as "idle." This is bad for continuous replication, which spends large amounts of time idling. Currently we have the heartbeat set at a hard coded 5 minutes, but very often load balancers cut off connections before that and so we are going to allow the consumer to configure the heartbeat interval. As it stands now, the purpose of the heartbeat is twofold: It both makes sure that the other side does not hang up, and it checks the health of the connection. If a reply is not received in a timely manner (10 seconds), the connection is failed and replication stops (I forget if this is considered recoverable or not).

All of this is unrelated in a platform that does not use framing, though, so Java is free to do as it wishes in terms of heartbeats as long as it conforms to the API we will provide (out of your control for now) and accomplishes the same two goals, the second of which I believe is also your goal. My recommendation is to leave the replicator configuration changes out of the PR (since they will just be overwritten later when the new API comes) and keep the implementation. @bmeike Does this sound reasonable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@borrrden You mean, for java and android, enabling heartbeat this way is ok, but the user interface should comply with your coming version, right? So I should revert some code for the Configuration class, maybe CBLWebSocket remains. Is that ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is what my recommendation is. In general the changes of merging something into the public API space from outside users is low because we need to weigh them against the need to bring the same features to other platforms besides (in this case) Java. In this particular case though, we are already planning the same change that you desire (though your PR has helped us realize this mistake in Java that otherwise would have continued to go unnoticed, I think) so the functionality is definitely welcome, with the API to be introduced later.

That being said, @bmeike must sign off on this as the owner of this repo. This is only my personal recommendation as a sort of "outside counsel" (I work with LiteCore and I basically came up here to explain why things are the way they are at the core level, and my recommendation might have been an overreach).

@@ -184,7 +184,7 @@ bool litecore::jni::initC4Replicator(JNIEnv *env) {

static jobject toJavaObject(JNIEnv *env, C4ReplicatorStatus status) {
jobject obj = env->NewObject(cls_C4ReplStatus, m_C4ReplStatus_init);
env->SetIntField(obj, f_C4ReplStatus_activityLevel, (int) status.level);
env->SetIntField(obj, f_C4ReplStatus_activityLevel, (int) (status.level == kC4Stopping ? kC4Stopped : status.level));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modification is not mentioned in the pr, in fact it's another problem. The C4Replicator has an interval status, kC4Stopping, which does not exposes outside to java code. There's no such value in status enum in java code.

But sometimes it "leaks" outside. If the status is checked right after the replicator is asked to stop, a "stopping" will leak to java code and exception raised.

The "safe" way is to add the status also to java code, but then the java user interface is slightly changed.
I think it may be ok to "hide" in this way. If the replicator is stopping, just tell outside it is stopped, because normally speaking "stopping" means "promised to stop later". Callers can use it just as it is stopped, to start it, or to stop it, etc. (I've read the code that start and stop during stopping status are all considered) Maybe it makes no difference.

This is not a careful fix and feel free to replace it with better way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Probably better to replace stopping with busy, not stopped. A lot of CBL consumer code will react to stopped by thinking it is allowed to do things like delete a database, which at this point it would still not be able to do. Ultimately, core should not be letting this out though! It takes care to not let it leak via callbacks, but asking for it directly seems leaky! I'll file an issue to plug that as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another way to fix this that seems much safer to me. It will introduce a new state in the Java platform, UNKNOWN, which will be used whenever the platform code cannot figure out the replicators state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 I'll revert this change then. Just fix it elsewhere.

Copy link
Contributor

@bmeike bmeike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right way to solve this problem.
There are, actually, two problems:

  1. Core should never leak this status. That's a bug. It will be fixed.
  2. the platform code should deal reasonably with unrecognized statuses. The way to do that is to use constructor args to create this object, instead of directly setting the field values.
    I have a commit to address these things, in master. I totally appreciate the effort you've put into this. That commit, though, is a better solution.

@workingenius
Copy link
Author

This is not the right way to solve this problem.
There are, actually, two problems:

  1. Core should never leak this status. That's a bug. It will be fixed.
  2. the platform code should deal reasonably with unrecognized statuses. The way to do that is to use constructor args to create this object, instead of directly setting the field values.
    I have a commit to address these things, in master. I totally appreciate the effort you've put into this. That commit, though, is a better solution.

@bmeike Which commit ? Could I read it ?

If you mean the kC4Stopping status, I've already revert the change.

@bmeike
Copy link
Contributor

bmeike commented Dec 16, 2020

@workingenius Sure: it is in couchbase-lite-java-common (submodule common) @ e82ebfe

@workingenius
Copy link
Author

@bmeike I'v reverted changes on configuration class as @borrrden suggested. Code now is compatible with current ce and builds successfully. Please take a look.

Copy link
Member

@borrrden borrrden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks ok to me. Might need some small tweaks later when the API comes but overall good.

@bmeike bmeike changed the base branch from android/release/hydrogen to hotfix/CBL-1527 December 18, 2020 20:42
@bmeike
Copy link
Contributor

bmeike commented Dec 18, 2020

I'm changing the target branch for this PR. The changes here conflict with some new code, so I need to be able to cherry pick it.

I'm going to put it on master first and then back-port it to the 2.8 branch.

I thank you very much for your work on this. I'll buy you a beer, sometime.

Copy link
Contributor

@bmeike bmeike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging to Hotfix branch. Will cherry-pick to master, integrate with current code, then backport to Hydrogen

@bmeike bmeike merged commit 9b5805c into couchbase:hotfix/CBL-1527 Dec 18, 2020
@bmeike
Copy link
Contributor

bmeike commented Jan 6, 2021

@workingenius The commit that fixes the problem with kC4Stopping is here: couchbase-lite-java-common @ e82ebfe
I have another commit that will go in tomorrow. It turns CBLAbstractWebSocket into a state machine. That may not address all of the problems immediately but it will make it much easier to do so.

I'll add the commit here, when I push it.

@bmeike
Copy link
Contributor

bmeike commented Jan 6, 2021

The fix for this is in couchbase-lite-java-common @ d2413f2

As I said, I believe it fixes the bug. If it does not, it should make it relatively easy to do so.

I'd be interested to hear your results.

@workingenius
Copy link
Author

@bmeike Thank you for invitation. It's a big change and I'll read the code carefully in my spare time.

Since we patched something to couchbase lite (both core and java client), our system seems become more stable. Now we are troubled by other parts but not couchbase. Personally I'm happy to help test new versions, but it depends on our team. 🤝

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants