No messages delivered after websocket client takeover #1489

bverhoeven · 2019-11-05T15:41:48Z

Using Mosquitto 1.6.7 (checked both 'master' and 'fixes' branches), I'm seeing an issue with the web socket 'take over' behavior where, once a client has taken over an old session, it doesn't get any messages if it subscribes to the same topics as the old session.

Situation

A publisher is writing data to example/example1 every second.

A websocket client is connected with client_id example, subscribed to example/example1 (qos 1), receiving all those messages just fine. After a while that connection is somehow lost without a clean shutdown (network issues, client crash, bad link, etc.) and the client reconnects.

Mosquitto then correctly prints "Client example already connected, closing old connection".

The new client happily subscribes to the example/example1 topic again, which is acknowledged with SUBACK, all seemingly fine. As of this point, however, the client never gets the messages that are published to example/example until it reconnects again.

There are no errors in mosquitto's logs.

Reproducing

I think I've found an easy way to reproduce this by having:

a websocket client connected with client id example,
a small publisher running (publishing a message every second):
(while true; do mosquitto_pub -h localhost -t example/example1 -m pong; sleep 1; done) &

And then connecting my own client using mosquitto_sub with the example client id:

$ mosquitto_sub -h localhost -q 1 -i example -t example/example1 -v -d
Received CONNACK
Received SUBACK
Subscribed (mid: 1): 1
^C

(No data is received).

The old client is disconnected. No data is ever received for the new session. If I then, however, close this connection, and try to connect again, it'll immediately work:

$ mosquitto_sub -h localhost -q 1 -i example -t example/example1 -v -d
Received CONNACK
Received SUBACK
Subscribed (mid: 1): 1
Received PUBLISH (d0, q0, r0, m0, 'example/example1', ... (4 bytes))
example/example1 pong

It'll also immediately work if:

I use a different client id
I subscribe to the same topic with a different pattern, e.g. example/+

This makes it seem as if the old session isn't cleaned up properly and mosquitto still "sees" the old session, and all it's subscriptions. It's almost as if it feels that "the client with id 'example'" is already subscribed to "example/example1" and thus doesn't feel a need to actually subscribe to it "again".

Possible fix

After some debugging I've tracked this down to the way that websocket clients are cleaned up in a takeover situation, as last changed by 5088202.

This issue seems similar to #1227, but the fix there doesn't seem to have helped and I'm wondering if it's doing what it should be doing:

mosquitto/src/loop.c

Lines 678 to 686 in b622aae

 if(is_duplicate){ 

 /* This occurs if another client is taking over the same client id. 

  * It is important to remove this from the by_id hash here, so it 

  * doesn't leave us with multiple clients in the hash with the same 

  * id. Websockets doesn't actually close the connection here, 

  * unlike for normal clients, which means there is extra time when 

  * there could be two clients with the same id in the hash. */ 

 context__remove_from_by_id(db, context); 

 }

If I replace this call to context__add_to_disused(db, context); this seems to be resolved: the old client is still disconnected, but, on new connection, re-subscribing works and the client gets the messages as expected.

The rationale behind this change is the fact that context__disconnect also calls this function, which then calls the context__remove_from_by_id referenced in the code above, but also seems to free up some more structures.

I've just started with mosquitto so I'm not knowledgeable enough to see if this change would cause any issues. I'd appreciate it if someone with more in-depth knowledge of mosquitto could take a peek.

The text was updated successfully, but these errors were encountered:

ralight · 2019-11-05T22:57:35Z

Thanks for the detailed description. I haven't checked this yet, but I'm sure you're right. Websockets is a right pain.

I don't think what you've proposed is correct. Adding the client to the disused list (using context__add_to_disused()) means that the client will be completley freed soon. This could happen before the websockets part of the code finishes, so bad things would happen.

I'll take a proper look at it soon.

bverhoeven · 2019-11-06T15:14:01Z

Good point, hadn't considered that. That makes sense and I could definitely see it causing issues.

Not sure if it's of any help, but I've done some more testing myself and it seems that I can only reproduce this problem when the web socket listener is using SSL: If I test with regular web sockets then everything works as expected. I can only speculate as to why this is, but it may be related to the fact that it takes a while for a SSL connection to be properly closed?

I've created a small test script (Python 3) that can always reproduce the issue for me, when I have SSL enabled: https://gist.github.com/bverhoeven/00b638487df9f0a1b24a58d6954ffaaa#file-issue-1489-py, I've also included the test configuration I've used.

Example output:

Connecting publisher...
Publisher: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b'example-publisher'
Publisher: Received CONNACK (0, 0)
Ready for tests.

## Client 1
Client 1: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b'example-client'
Client 1: Received CONNACK (0, 0)
Client 1: Sending SUBSCRIBE (d0, m1) [(b'example/example1', 1)]
Client 1: Received SUBACK
Client 1: Subscribed, publishing message with publisher..
Publisher: Sending PUBLISH (d0, q0, r0, m1), 'b'example/example1'', ... (7 bytes)
Client 1: Received PUBLISH (d0, q0, r0, m0), 'example/example1', ...  (7 bytes)

OK: Client received published message.

## Client 2
Client 2: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b'example-client'
Client 1: Disconnected (rc=1)
Client 2: Received CONNACK (0, 0)
Client 2: Sending SUBSCRIBE (d0, m1) [(b'example/example1', 1)]
Client 2: Received SUBACK
Client 2: Subscribed, publishing message with publisher..
Publisher: Sending PUBLISH (d0, q0, r0, m2), 'b'example/example1'', ... (7 bytes)

FAIL: Client did not get published message.

Quick update: I can also easily reproduce it using the public test broker at test.mosquitto.org, so the issue doesn't seem to be limited to just my setup:

server_host = 'test.mosquitto.org'
server_port = 8081
server_use_tls = True

ralight · 2019-11-06T17:19:42Z

I've also reproduced it thanks to your test file, that's very much appreciated.

ralight · 2019-11-06T17:38:56Z

Experimental fix (I'm clocking off for this evening, it works but I haven't tested it further):

diff --git a/src/handle_connect.c b/src/handle_connect.c
index 17434b2..695ce2a 100644
--- a/src/handle_connect.c
+++ b/src/handle_connect.c
@@ -160,6 +160,9 @@ int connect__on_authorised(struct mosquitto_db *db, struct mosquitto *context, v
                        }
                }
 
+               if(context->clean_start == true){
+                       sub__clean_session(db, found_context);
+               }
                session_expiry__remove(found_context);
                will_delay__remove(found_context);
                will__clear(found_context);

bverhoeven · 2019-11-07T09:15:48Z

At first glance that does seem to fix the issue, but by the looks of it only when connecting with 'clean session' set?

A quick test on my side confirms that if I set clean_session=False, in the test script, the issue reappears.

Edit: Would it be bad if this logic was applied to all clients regardless of the "clean_start" flag?

I can confirm that the issue is that the first web socket connection doesn't get closed (and thereby cleaned up/freed) until after the second client has managed to subscribe:

Client 2 connects.
Client 1 needs to be disconnected, but as LWS doesn't offer a direct way to do this, the connection is instead marked for shutdown and a write callback is configured to perform the actual disconnection.
Client 2 receives CONNACK.
Write callback for 'Client 1' triggers a disconnect.
Client 2 sends SUBSCRIBE, mosquitto acknowledges with SUBACK
Client 1 finally disconnects and resources, including subscriptions, for that client are cleaned up.

I've thought of ignoring clients marked for disconnect in the subscription logic (where I figured it probably checks if the subscription already exists), but that all seems a bit too far-fetched and feels hacky.

If we'd always call sub__clean_session on a takeover we'd however fix this issue by removing all subscriptions for the old client as as soon as the new client connects.

The fact that the old connection still lingers on for a few more seconds doesn't seem to be harmful.

It also seems that, as long as we don't call db__messages_delete, no messages are removed so "clean session=false" behavior should still work as expected?

Again, not familiar with mosquitto internals, or too familiar with the specs, so pardon my ignorance.

ralight · 2019-11-07T12:25:54Z

The takeover case when clean session is false is actually easier. In that case, the taking over client takes over the subscriptions of the old client and they are removed from the old client. That means this shouldn't be an issue. I can't reproduce it. Here is what I get for both clean session = True and False:

bverhoeven · 2019-11-07T13:49:06Z

Interesting. For me, with your patch applied, I'm still seeing the issue with clean_start == false:

clean_session=True

## Client 2
Client 2: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b'example-client'
Client 1: Disconnected (rc=1)
Client 2: Received CONNACK (0, 0)
Client 2: Sending SUBSCRIBE (d0, m1) [(b'example/example1', 1)]
Client 2: Received SUBACK
Client 2: Subscribed, publishing message with publisher..
Publisher: Sending PUBLISH (d0, q0, r0, m2), 'b'example/example1'', ... (7 bytes)
Client 2: Received PUBLISH (d0, q0, r0, m0), 'example/example1', ...  (7 bytes)

OK: Client received published message.

clean_session=False

## Client 2
Client 2: Sending CONNECT (u0, p0, wr0, wq0, wf0, c0, k60) client_id=b'example-client'
Client 1: Disconnected (rc=1)
Client 2: Received CONNACK (0, 0)
Client 2: Sending SUBSCRIBE (d0, m1) [(b'example/example1', 1)]
Client 2: Received SUBACK
Client 2: Subscribed, publishing message with publisher..
Publisher: Sending PUBLISH (d0, q0, r0, m2), 'b'example/example1'', ... (7 bytes)

FAIL: Client did not get published message.

It does work fine if I apply the sub__clean_session unconditionally.

ralight · 2019-11-07T14:16:55Z

What version of libwebsockets are you using?

bverhoeven · 2019-11-07T14:22:59Z

I'm using the docker/local/Dockerfile provided by the project as my main environment, that one fetches libwebsockets version 2.4.2.

ralight · 2019-11-07T16:32:19Z

Hmm, very odd. I have tried on a few different systems and can't reproduce it. I am really surprised that you're seeing the problem, it would mean that the whole concept of taking over an existing clean session=false session was broken.

This can occurs after one client takes over a previous connection. Closes #1489. Thanks to Bas Verhoeven.

ralight · 2019-11-07T17:37:16Z

I know you're not seeing it completely fixed, but that is the commit with the above patch in for now.

bverhoeven · 2019-11-08T08:05:27Z

Okay, never mind any of that...

I thought my source tree was clean, but for some reason my helpful editor decided to use an older version of handle_connect.c in which I had removed some chunks of code to debug the initial issue (prior to your fix), mainly the entire block that handles with clean_start == false:

mosquitto/src/handle_connect.c

Lines 128 to 161 in 05171b2

 if(context->clean_start == false && found_context->session_expiry_interval > 0){ 

 if(context->protocol == mosq_p_mqtt311 || context->protocol == mosq_p_mqtt5){ 

 connect_ack |= 0x01; 

 } 

 if(found_context->msgs_in.inflight || found_context->msgs_in.queued 

 || found_context->msgs_out.inflight || found_context->msgs_out.queued){ 

 memcpy(&context->msgs_in, &found_context->msgs_in, sizeof(struct mosquitto_msg_data)); 

 memcpy(&context->msgs_out, &found_context->msgs_out, sizeof(struct mosquitto_msg_data)); 

 memset(&found_context->msgs_in, 0, sizeof(struct mosquitto_msg_data)); 

 memset(&found_context->msgs_out, 0, sizeof(struct mosquitto_msg_data)); 

 db__message_reconnect_reset(db, context); 

 } 

 context->subs = found_context->subs; 

 found_context->subs = NULL; 

 context->sub_count = found_context->sub_count; 

 found_context->sub_count = 0; 

 context->last_mid = found_context->last_mid; 

 for(i=0; i<context->sub_count; i++){ 

 if(context->subs[i]){ 

 leaf = context->subs[i]->subs; 

 while(leaf){ 

 if(leaf->context == found_context){ 

 leaf->context = context; 

 } 

 leaf = leaf->next; 

 } 

 } 

 } 

 }

Also, my apologies for the earlier confusion: I for some reason thought that subscriptions would not be part of a 'session', and thus would not be affected by 'clean session'. After reading the v3.1.1 standard, it turns out that subscriptions are in fact very much part of the session and everything should indeed be working exactly as you described.

With a clean source tree and your patch applied all issues appear to be resolved. Thanks!

bverhoeven · 2019-11-18T09:28:30Z

I've been running a version with this patch applied in production for the past week and all seems to be working as expected. Thanks again for your quick response and again my apologies for the confusion I caused by accidentally stripping out pieces of code.

ralight added a commit that referenced this issue Nov 7, 2019

Fix TLS Websockets clients not receiving messages.

05171b2

This can occurs after one client takes over a previous connection. Closes #1489. Thanks to Bas Verhoeven.

bverhoeven closed this as completed Nov 18, 2019

lock bot locked as resolved and limited conversation to collaborators Feb 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No messages delivered after websocket client takeover #1489

No messages delivered after websocket client takeover #1489

bverhoeven commented Nov 5, 2019 •

edited

ralight commented Nov 5, 2019

bverhoeven commented Nov 6, 2019 •

edited

ralight commented Nov 6, 2019

ralight commented Nov 6, 2019

bverhoeven commented Nov 7, 2019 •

edited

ralight commented Nov 7, 2019

bverhoeven commented Nov 7, 2019 •

edited

ralight commented Nov 7, 2019

bverhoeven commented Nov 7, 2019

ralight commented Nov 7, 2019

ralight commented Nov 7, 2019

bverhoeven commented Nov 8, 2019 •

edited

bverhoeven commented Nov 18, 2019

No messages delivered after websocket client takeover #1489

No messages delivered after websocket client takeover #1489

Comments

bverhoeven commented Nov 5, 2019 • edited

Situation

Reproducing

Possible fix

ralight commented Nov 5, 2019

bverhoeven commented Nov 6, 2019 • edited

ralight commented Nov 6, 2019

ralight commented Nov 6, 2019

bverhoeven commented Nov 7, 2019 • edited

ralight commented Nov 7, 2019

bverhoeven commented Nov 7, 2019 • edited

clean_session=True

clean_session=False

ralight commented Nov 7, 2019

bverhoeven commented Nov 7, 2019

ralight commented Nov 7, 2019

ralight commented Nov 7, 2019

bverhoeven commented Nov 8, 2019 • edited

bverhoeven commented Nov 18, 2019

bverhoeven commented Nov 5, 2019 •

edited

bverhoeven commented Nov 6, 2019 •

edited

bverhoeven commented Nov 7, 2019 •

edited

bverhoeven commented Nov 7, 2019 •

edited

bverhoeven commented Nov 8, 2019 •

edited