Fix replicator hang or delay updating state caused by websocket #15

workingenius · 2020-12-10T12:50:13Z

For couchbase/couchbase-lite-java-ce-root#12 and couchbase/couchbase-lite-java-ce-root#13

Enable websocket heartbeat. Add a new config to ReplicatorConfiguration and pass it to CBLWebSocket.
Cancel the attempt to connect on connection timeout before websocket handshake succeed.

…app crash

Adhoc couchbase lite client

Try to fix long-lasting connecting or busy state

bmeike · 2020-12-10T21:50:56Z

Even if I do not merge this PR directly, I will, absolutely, use much of the code it contains, and credit you in a comment.
Many many thanks.

bmeike · 2020-12-11T01:41:46Z

This PR looks like it is addressing several issues. Can you explain the symptoms you observed and why this change fixes it?

Thanks!

workingenius · 2020-12-11T02:07:00Z

This PR looks like it is addressing several issues. Can you explain the symptoms you observed and why this change fixes it?

Thanks!

Sure, I'll comment on the changes about my thinkings. Let's see if it is the right way to do it.

workingenius · 2020-12-11T02:27:38Z

common/main/java/com/couchbase/lite/internal/replicator/AbstractCBLWebSocket.java

- if (closing.getAndSet(true)) {
- Log.v(TAG, "CBLWebSocket already closing.");
- return;
+ switch (state.getAndSet(State.REQUEST_CLOSED)) {


This is the main modification against couchbase/couchbase-lite-java-ce-root#13

In this method, requestClose, it used a webSocket == null to tell if the webSocket handshake was successful, because the webSocket field would not be set until calling onOpen method.

But we need more than that, on a situation that the core decide to close the webSocket before handshake is done. Then, okhttp should be notified, and have the resources released. Thus, we need to access the webSocket field here to call WebSocekt.cancel. After calling cancel, okhttp reacts to it, and call onFailure later, telling AbstractCBLWebSocket and core the failure, so that replication fails (and can be restarted from outside).

But in requestClose, how do we know if the handshake is done or not? The old way (checking webSocket == null) does not work any longer. So a State class is there to hold this information. The old closing AtomicBoolean is also covered by the State and is removed.

As onOpen and requestClose is possibly be called in different thread at the same time (consider this situation: the websocket handshake response is received, at the same moment core decide to close it), several code here is for handling the race condition. Please check if something got missed.

perfect. Thanks. On it.

workingenius · 2020-12-11T02:50:09Z

common/main/java/com/couchbase/lite/internal/replicator/AbstractCBLWebSocket.java

@@ -297,8 +331,16 @@ protected AbstractCBLWebSocket(

 @Override
 protected void openSocket() {
+ OkHttpClient okHttpClient = httpClient;
+ if (pingInterval > 0) {


The following several lines are for couchbase/couchbase-lite-java-ce-root#12

What we really want is to detect bad network by websocket heartbeat, and what's more, to react as soon as possible when network is down. (In our project, a period of 10 seconds is ideal)

I've noticed that websocket heartbeat is already considered inside couchbase-lite-core. I tried to enable it, but unfortunately, it's only for framing mode. After reading code for "a long time", a realized that okhttp and websocket here in java codes is just a "tunnel" for couchbase-lite-core and can be treated as a pure socket. So it's ok to enable this tunnel an ability to detect bad network, as long as it does't break higher level communication. This is how I finally feel ok to turn heartbeat on here.

Another word for .retryOnConnectionFailure(false). Even if the heartbeat is turned on, bad-network-detection is also much delayed because of okhttp retry. Retrying is good, but we want it in our control, don't retry on "every network layer". I
think turning okhttp retry off will not make the system fragile, because retry and restart strategy is one key point that couchbase lite users should concern carefully. They should handle it on the outmost layer, and should not hope the lib do everything for them.

Other codes are just about adding a new config pingInterval to the config class and pass it here, quite naive.

I worry that this is going to conflict with our upcoming feature to make the Core heartbeat configurable by the CBL consumer. It effectively means we'd now have two ways to set the same feature. Most of the time when the network is truly down, the TCP connection will also die which will close the web socket connection. This setting is really only useful when the TCP connection is still alive but the other side is not responding for whatever reason. I'd be interested to hear about scenarios like this, that would not be solved by a configurable core heartbeat. Do you know of any?

Heartbeat is necessary. As couchbase lite is for mobile devices, any bad point between client and server will make it happen. Especially in our situation, each of our android device doesn't access to internet directly, but through another linux box that communicate with internet by 4G. So, when network is bad, you know... The mobile device can't detect it directly, maybe heartbeat is the only way.

And it is ok for us if core does heartbeats. That makes no difference. What we want include several points:

Configurable interval in java codes

Fail in time when network down is detected

According to the version I read, there is heartbeat in the core, but only for framing mode. Will it be changed?

Interesting question. The fact that Java is using its own framing is something I was actually unaware of. Apple and .NET both using framing. If I ponder out loud in this comment, the whole existence of framing is basically a choice that the platform side makes between "I want Core to format my messages for me so that I can just send them over a raw socket" or "I want Core to give me the message payload and I will format / send it." Extending that, it means that it is not possible for Core to send a heartbeat if framing is not being used because it is in the dark about how to send it (or if it needs to be sent at all). Side note: The current other choice besides "no framing" is not "framing", but "web socket client framing" which is why Core can confidently send the heartbeat on its own in that mode. So in light of that, you are correct that Java must be the master of its own heartbeat. That being said, I am going to assume that this difference is going to be hidden behind the new API that we will offer in the next 2.8.x release to make both the retry logic and heartbeat configurable, rendering the public facing part of this PR moot and conflicting (the implementation, however, can remain).

Some optional background information that hopefully explains why we are introducing this API change, and why it will also satisfy your requirement, is that we have a large number of users who put Sync Gateway between a network load balancer which tends to cut off connections it views as "idle." This is bad for continuous replication, which spends large amounts of time idling. Currently we have the heartbeat set at a hard coded 5 minutes, but very often load balancers cut off connections before that and so we are going to allow the consumer to configure the heartbeat interval. As it stands now, the purpose of the heartbeat is twofold: It both makes sure that the other side does not hang up, and it checks the health of the connection. If a reply is not received in a timely manner (10 seconds), the connection is failed and replication stops (I forget if this is considered recoverable or not).

All of this is unrelated in a platform that does not use framing, though, so Java is free to do as it wishes in terms of heartbeats as long as it conforms to the API we will provide (out of your control for now) and accomplishes the same two goals, the second of which I believe is also your goal. My recommendation is to leave the replicator configuration changes out of the PR (since they will just be overwritten later when the new API comes) and keep the implementation. @bmeike Does this sound reasonable?

@borrrden You mean, for java and android, enabling heartbeat this way is ok, but the user interface should comply with your coming version, right? So I should revert some code for the Configuration class, maybe CBLWebSocket remains. Is that ok?

Yes that is what my recommendation is. In general the changes of merging something into the public API space from outside users is low because we need to weigh them against the need to bring the same features to other platforms besides (in this case) Java. In this particular case though, we are already planning the same change that you desire (though your PR has helped us realize this mistake in Java that otherwise would have continued to go unnoticed, I think) so the functionality is definitely welcome, with the API to be introduced later.

That being said, @bmeike must sign off on this as the owner of this repo. This is only my personal recommendation as a sort of "outside counsel" (I work with LiteCore and I basically came up here to explain why things are the way they are at the core level, and my recommendation might have been an overreach).

workingenius · 2020-12-11T03:04:20Z

common/main/cpp/native_c4replicator.cc

@@ -184,7 +184,7 @@ bool litecore::jni::initC4Replicator(JNIEnv *env) {

 static jobject toJavaObject(JNIEnv *env, C4ReplicatorStatus status) {
 jobject obj = env->NewObject(cls_C4ReplStatus, m_C4ReplStatus_init);
- env->SetIntField(obj, f_C4ReplStatus_activityLevel, (int) status.level);
+ env->SetIntField(obj, f_C4ReplStatus_activityLevel, (int) (status.level == kC4Stopping ? kC4Stopped : status.level));


This modification is not mentioned in the pr, in fact it's another problem. The C4Replicator has an interval status, kC4Stopping, which does not exposes outside to java code. There's no such value in status enum in java code.

But sometimes it "leaks" outside. If the status is checked right after the replicator is asked to stop, a "stopping" will leak to java code and exception raised.

The "safe" way is to add the status also to java code, but then the java user interface is slightly changed.
I think it may be ok to "hide" in this way. If the replicator is stopping, just tell outside it is stopped, because normally speaking "stopping" means "promised to stop later". Callers can use it just as it is stopped, to start it, or to stop it, etc. (I've read the code that start and stop during stopping status are all considered) Maybe it makes no difference.

This is not a careful fix and feel free to replace it with better way.

Note: Probably better to replace stopping with busy, not stopped. A lot of CBL consumer code will react to stopped by thinking it is allowed to do things like delete a database, which at this point it would still not be able to do. Ultimately, core should not be letting this out though! It takes care to not let it leak via callbacks, but asking for it directly seems leaky! I'll file an issue to plug that as well.

I have another way to fix this that seems much safer to me. It will introduce a new state in the Java platform, UNKNOWN, which will be used whenever the platform code cannot figure out the replicators state.

👌 I'll revert this change then. Just fix it elsewhere.

… caused app crash" This reverts commit 409f12e.

bmeike

This is not the right way to solve this problem.
There are, actually, two problems:

Core should never leak this status. That's a bug. It will be fixed.
the platform code should deal reasonably with unrecognized statuses. The way to do that is to use constructor args to create this object, instead of directly setting the field values.
I have a commit to address these things, in master. I totally appreciate the effort you've put into this. That commit, though, is a better solution.

workingenius · 2020-12-16T02:17:19Z

This is not the right way to solve this problem.
There are, actually, two problems:

Core should never leak this status. That's a bug. It will be fixed.

the platform code should deal reasonably with unrecognized statuses. The way to do that is to use constructor args to create this object, instead of directly setting the field values.
I have a commit to address these things, in master. I totally appreciate the effort you've put into this. That commit, though, is a better solution.

@bmeike Which commit ? Could I read it ?

If you mean the kC4Stopping status, I've already revert the change.

bmeike · 2020-12-16T06:57:41Z

@workingenius Sure: it is in couchbase-lite-java-common (submodule common) @ e82ebfe

workingenius · 2020-12-18T02:59:10Z

@bmeike I'v reverted changes on configuration class as @borrrden suggested. Code now is compatible with current ce and builds successfully. Please take a look.

borrrden

It looks ok to me. Might need some small tweaks later when the API comes but overall good.

bmeike · 2020-12-18T20:44:25Z

I'm changing the target branch for this PR. The changes here conflict with some new code, so I need to be able to cherry pick it.

I'm going to put it on master first and then back-port it to the 2.8 branch.

I thank you very much for your work on this. I'll buy you a beer, sometime.

bmeike

Merging to Hotfix branch. Will cherry-pick to master, integrate with current code, then backport to Hydrogen

bmeike · 2021-01-06T01:37:54Z

@workingenius The commit that fixes the problem with kC4Stopping is here: couchbase-lite-java-common @ e82ebfe
I have another commit that will go in tomorrow. It turns CBLAbstractWebSocket into a state machine. That may not address all of the problems immediately but it will make it much easier to do so.

I'll add the commit here, when I push it.

bmeike · 2021-01-06T19:00:42Z

The fix for this is in couchbase-lite-java-common @ d2413f2

As I said, I believe it fixes the bug. If it does not, it should make it relatively easy to do so.

I'd be interested to hear your results.

workingenius · 2021-01-07T09:10:13Z

@bmeike Thank you for invitation. It's a big change and I'll read the code carefully in my spare time.

Since we patched something to couchbase lite (both core and java client), our system seems become more stable. Now we are troubled by other parts but not couchbase. Personally I'm happy to help test new versions, but it depends on our team. 🤝

workingenius and others added 11 commits November 11, 2020 11:30

Enable websocket heartbeat by an interval of 10s

cf575a2

Solve "IllegalStateException: Unrecognized activity level: 5" caused …

409f12e

…app crash

Merge pull request #1 from turingvideo/srq-replication

fd17341

Adhoc couchbase lite client

Add some timeout to try to fix "always-busy" problem

fe430f0

Try to fix long-lasting connecting or busy state

93efaa3

Fix by simplify

9b89261

Code polish

442d6ea

Merge pull request #2 from turingvideo/srq-replication

32d8536

Try to fix long-lasting connecting or busy state

rename status to state

d318f1b

Remove "closing" boolean

b195fa6

Add pingInterval to ReplicatorConfigration

ddf4c14

workingenius mentioned this pull request Dec 10, 2020

Replicator websocket connection timeout is not correctly handled on android couchbase/couchbase-lite-java-ce-root#13

Closed

workingenius commented Dec 11, 2020

View reviewed changes

workingenius commented Dec 14, 2020

View reviewed changes

Revert "Solve "IllegalStateException: Unrecognized activity level: 5"…

f5fa69a

… caused app crash" This reverts commit 409f12e.

bmeike requested changes Dec 15, 2020

View reviewed changes

workingenius added 2 commits December 18, 2020 10:28

Revert changes on configuration class

f95cff5

Be compatible with current ce

6813d3e

workingenius requested review from bmeike and borrrden December 18, 2020 02:59

borrrden approved these changes Dec 18, 2020

View reviewed changes

bmeike changed the base branch from android/release/hydrogen to hotfix/CBL-1527 December 18, 2020 20:42

bmeike approved these changes Dec 18, 2020

View reviewed changes

bmeike merged commit 9b5805c into couchbase:hotfix/CBL-1527 Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix replicator hang or delay updating state caused by websocket #15

Fix replicator hang or delay updating state caused by websocket #15

workingenius commented Dec 10, 2020

bmeike commented Dec 10, 2020

bmeike commented Dec 11, 2020

workingenius commented Dec 11, 2020

workingenius Dec 11, 2020

bmeike Dec 11, 2020

workingenius Dec 11, 2020

borrrden Dec 14, 2020

workingenius Dec 15, 2020

borrrden Dec 15, 2020 •

edited

Loading

workingenius Dec 15, 2020

borrrden Dec 16, 2020

workingenius Dec 11, 2020

borrrden Dec 14, 2020

bmeike Dec 14, 2020

borrrden Dec 15, 2020

workingenius Dec 15, 2020

bmeike left a comment

workingenius commented Dec 16, 2020

bmeike commented Dec 16, 2020

workingenius commented Dec 18, 2020

borrrden left a comment

bmeike commented Dec 18, 2020

bmeike left a comment

bmeike commented Jan 6, 2021 •

edited

Loading

bmeike commented Jan 6, 2021

workingenius commented Jan 7, 2021

Fix replicator hang or delay updating state caused by websocket #15

Fix replicator hang or delay updating state caused by websocket #15

Conversation

workingenius commented Dec 10, 2020

bmeike commented Dec 10, 2020

bmeike commented Dec 11, 2020

workingenius commented Dec 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

borrrden Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmeike left a comment

Choose a reason for hiding this comment

workingenius commented Dec 16, 2020

bmeike commented Dec 16, 2020

workingenius commented Dec 18, 2020

borrrden left a comment

Choose a reason for hiding this comment

bmeike commented Dec 18, 2020

bmeike left a comment

Choose a reason for hiding this comment

bmeike commented Jan 6, 2021 • edited Loading

bmeike commented Jan 6, 2021

workingenius commented Jan 7, 2021

borrrden Dec 15, 2020 •

edited

Loading

bmeike commented Jan 6, 2021 •

edited

Loading