Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No messages delivered after websocket client takeover #1489

Closed
bverhoeven opened this issue Nov 5, 2019 · 13 comments
Closed

No messages delivered after websocket client takeover #1489

bverhoeven opened this issue Nov 5, 2019 · 13 comments

Comments

@bverhoeven
Copy link

bverhoeven commented Nov 5, 2019

Using Mosquitto 1.6.7 (checked both 'master' and 'fixes' branches), I'm seeing an issue with the web socket 'take over' behavior where, once a client has taken over an old session, it doesn't get any messages if it subscribes to the same topics as the old session.

Situation

A publisher is writing data to example/example1 every second.

A websocket client is connected with client_id example, subscribed to example/example1 (qos 1), receiving all those messages just fine. After a while that connection is somehow lost without a clean shutdown (network issues, client crash, bad link, etc.) and the client reconnects.

Mosquitto then correctly prints "Client example already connected, closing old connection".

The new client happily subscribes to the example/example1 topic again, which is acknowledged with SUBACK, all seemingly fine. As of this point, however, the client never gets the messages that are published to example/example until it reconnects again.

There are no errors in mosquitto's logs.

Reproducing

I think I've found an easy way to reproduce this by having:

  • a websocket client connected with client id example,
  • a small publisher running (publishing a message every second):
    (while true; do mosquitto_pub -h localhost -t example/example1 -m pong; sleep 1; done) &

And then connecting my own client using mosquitto_sub with the example client id:

$ mosquitto_sub -h localhost -q 1 -i example -t example/example1 -v -d
Received CONNACK
Received SUBACK
Subscribed (mid: 1): 1
^C

(No data is received).

The old client is disconnected. No data is ever received for the new session. If I then, however, close this connection, and try to connect again, it'll immediately work:

$ mosquitto_sub -h localhost -q 1 -i example -t example/example1 -v -d
Received CONNACK
Received SUBACK
Subscribed (mid: 1): 1
Received PUBLISH (d0, q0, r0, m0, 'example/example1', ... (4 bytes))
example/example1 pong

It'll also immediately work if:

  • I use a different client id
  • I subscribe to the same topic with a different pattern, e.g. example/+

This makes it seem as if the old session isn't cleaned up properly and mosquitto still "sees" the old session, and all it's subscriptions. It's almost as if it feels that "the client with id 'example'" is already subscribed to "example/example1" and thus doesn't feel a need to actually subscribe to it "again".

Possible fix

After some debugging I've tracked this down to the way that websocket clients are cleaned up in a takeover situation, as last changed by 5088202.

This issue seems similar to #1227, but the fix there doesn't seem to have helped and I'm wondering if it's doing what it should be doing:

mosquitto/src/loop.c

Lines 678 to 686 in b622aae

if(is_duplicate){
/* This occurs if another client is taking over the same client id.
* It is important to remove this from the by_id hash here, so it
* doesn't leave us with multiple clients in the hash with the same
* id. Websockets doesn't actually close the connection here,
* unlike for normal clients, which means there is extra time when
* there could be two clients with the same id in the hash. */
context__remove_from_by_id(db, context);
}

If I replace this call to context__add_to_disused(db, context); this seems to be resolved: the old client is still disconnected, but, on new connection, re-subscribing works and the client gets the messages as expected.

The rationale behind this change is the fact that context__disconnect also calls this function, which then calls the context__remove_from_by_id referenced in the code above, but also seems to free up some more structures.

I've just started with mosquitto so I'm not knowledgeable enough to see if this change would cause any issues. I'd appreciate it if someone with more in-depth knowledge of mosquitto could take a peek.

@ralight
Copy link
Contributor

ralight commented Nov 5, 2019

Thanks for the detailed description. I haven't checked this yet, but I'm sure you're right. Websockets is a right pain.

I don't think what you've proposed is correct. Adding the client to the disused list (using context__add_to_disused()) means that the client will be completley freed soon. This could happen before the websockets part of the code finishes, so bad things would happen.

I'll take a proper look at it soon.

@bverhoeven
Copy link
Author

bverhoeven commented Nov 6, 2019

Good point, hadn't considered that. That makes sense and I could definitely see it causing issues.

Not sure if it's of any help, but I've done some more testing myself and it seems that I can only reproduce this problem when the web socket listener is using SSL: If I test with regular web sockets then everything works as expected. I can only speculate as to why this is, but it may be related to the fact that it takes a while for a SSL connection to be properly closed?

I've created a small test script (Python 3) that can always reproduce the issue for me, when I have SSL enabled: https://gist.github.com/bverhoeven/00b638487df9f0a1b24a58d6954ffaaa#file-issue-1489-py, I've also included the test configuration I've used.

Example output:

Connecting publisher...
Publisher: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b'example-publisher'
Publisher: Received CONNACK (0, 0)
Ready for tests.

## Client 1
Client 1: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b'example-client'
Client 1: Received CONNACK (0, 0)
Client 1: Sending SUBSCRIBE (d0, m1) [(b'example/example1', 1)]
Client 1: Received SUBACK
Client 1: Subscribed, publishing message with publisher..
Publisher: Sending PUBLISH (d0, q0, r0, m1), 'b'example/example1'', ... (7 bytes)
Client 1: Received PUBLISH (d0, q0, r0, m0), 'example/example1', ...  (7 bytes)

OK: Client received published message.

## Client 2
Client 2: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b'example-client'
Client 1: Disconnected (rc=1)
Client 2: Received CONNACK (0, 0)
Client 2: Sending SUBSCRIBE (d0, m1) [(b'example/example1', 1)]
Client 2: Received SUBACK
Client 2: Subscribed, publishing message with publisher..
Publisher: Sending PUBLISH (d0, q0, r0, m2), 'b'example/example1'', ... (7 bytes)

FAIL: Client did not get published message.

Quick update: I can also easily reproduce it using the public test broker at test.mosquitto.org, so the issue doesn't seem to be limited to just my setup:

server_host = 'test.mosquitto.org'
server_port = 8081
server_use_tls = True

@ralight
Copy link
Contributor

ralight commented Nov 6, 2019

I've also reproduced it thanks to your test file, that's very much appreciated.

@ralight
Copy link
Contributor

ralight commented Nov 6, 2019

Experimental fix (I'm clocking off for this evening, it works but I haven't tested it further):

diff --git a/src/handle_connect.c b/src/handle_connect.c
index 17434b2..695ce2a 100644
--- a/src/handle_connect.c
+++ b/src/handle_connect.c
@@ -160,6 +160,9 @@ int connect__on_authorised(struct mosquitto_db *db, struct mosquitto *context, v
                        }
                }
 
+               if(context->clean_start == true){
+                       sub__clean_session(db, found_context);
+               }
                session_expiry__remove(found_context);
                will_delay__remove(found_context);
                will__clear(found_context);

@bverhoeven
Copy link
Author

bverhoeven commented Nov 7, 2019

At first glance that does seem to fix the issue, but by the looks of it only when connecting with 'clean session' set?

A quick test on my side confirms that if I set clean_session=False, in the test script, the issue reappears.

Edit: Would it be bad if this logic was applied to all clients regardless of the "clean_start" flag?

I can confirm that the issue is that the first web socket connection doesn't get closed (and thereby cleaned up/freed) until after the second client has managed to subscribe:

  • Client 2 connects.
  • Client 1 needs to be disconnected, but as LWS doesn't offer a direct way to do this, the connection is instead marked for shutdown and a write callback is configured to perform the actual disconnection.
  • Client 2 receives CONNACK.
  • Write callback for 'Client 1' triggers a disconnect.
  • Client 2 sends SUBSCRIBE, mosquitto acknowledges with SUBACK
  • Client 1 finally disconnects and resources, including subscriptions, for that client are cleaned up.

I've thought of ignoring clients marked for disconnect in the subscription logic (where I figured it probably checks if the subscription already exists), but that all seems a bit too far-fetched and feels hacky.

If we'd always call sub__clean_session on a takeover we'd however fix this issue by removing all subscriptions for the old client as as soon as the new client connects.

The fact that the old connection still lingers on for a few more seconds doesn't seem to be harmful.

It also seems that, as long as we don't call db__messages_delete, no messages are removed so "clean session=false" behavior should still work as expected?

Again, not familiar with mosquitto internals, or too familiar with the specs, so pardon my ignorance.

@ralight
Copy link
Contributor

ralight commented Nov 7, 2019

The takeover case when clean session is false is actually easier. In that case, the taking over client takes over the subscriptions of the old client and they are removed from the old client. That means this shouldn't be an issue. I can't reproduce it. Here is what I get for both clean session = True and False:
wstakeover

@bverhoeven
Copy link
Author

bverhoeven commented Nov 7, 2019

Interesting. For me, with your patch applied, I'm still seeing the issue with clean_start == false:

clean_session=True

## Client 2
Client 2: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b'example-client'
Client 1: Disconnected (rc=1)
Client 2: Received CONNACK (0, 0)
Client 2: Sending SUBSCRIBE (d0, m1) [(b'example/example1', 1)]
Client 2: Received SUBACK
Client 2: Subscribed, publishing message with publisher..
Publisher: Sending PUBLISH (d0, q0, r0, m2), 'b'example/example1'', ... (7 bytes)
Client 2: Received PUBLISH (d0, q0, r0, m0), 'example/example1', ...  (7 bytes)

OK: Client received published message.

clean_session=False

## Client 2
Client 2: Sending CONNECT (u0, p0, wr0, wq0, wf0, c0, k60) client_id=b'example-client'
Client 1: Disconnected (rc=1)
Client 2: Received CONNACK (0, 0)
Client 2: Sending SUBSCRIBE (d0, m1) [(b'example/example1', 1)]
Client 2: Received SUBACK
Client 2: Subscribed, publishing message with publisher..
Publisher: Sending PUBLISH (d0, q0, r0, m2), 'b'example/example1'', ... (7 bytes)

FAIL: Client did not get published message.

It does work fine if I apply the sub__clean_session unconditionally.

@ralight
Copy link
Contributor

ralight commented Nov 7, 2019

What version of libwebsockets are you using?

@bverhoeven
Copy link
Author

I'm using the docker/local/Dockerfile provided by the project as my main environment, that one fetches libwebsockets version 2.4.2.

@ralight
Copy link
Contributor

ralight commented Nov 7, 2019

Hmm, very odd. I have tried on a few different systems and can't reproduce it. I am really surprised that you're seeing the problem, it would mean that the whole concept of taking over an existing clean session=false session was broken.

ralight added a commit that referenced this issue Nov 7, 2019
This can occurs after one client takes over a previous connection.

Closes #1489. Thanks to Bas Verhoeven.
@ralight
Copy link
Contributor

ralight commented Nov 7, 2019

I know you're not seeing it completely fixed, but that is the commit with the above patch in for now.

@bverhoeven
Copy link
Author

bverhoeven commented Nov 8, 2019

Okay, never mind any of that...

I thought my source tree was clean, but for some reason my helpful editor decided to use an older version of handle_connect.c in which I had removed some chunks of code to debug the initial issue (prior to your fix), mainly the entire block that handles with clean_start == false:

if(context->clean_start == false && found_context->session_expiry_interval > 0){
if(context->protocol == mosq_p_mqtt311 || context->protocol == mosq_p_mqtt5){
connect_ack |= 0x01;
}
if(found_context->msgs_in.inflight || found_context->msgs_in.queued
|| found_context->msgs_out.inflight || found_context->msgs_out.queued){
memcpy(&context->msgs_in, &found_context->msgs_in, sizeof(struct mosquitto_msg_data));
memcpy(&context->msgs_out, &found_context->msgs_out, sizeof(struct mosquitto_msg_data));
memset(&found_context->msgs_in, 0, sizeof(struct mosquitto_msg_data));
memset(&found_context->msgs_out, 0, sizeof(struct mosquitto_msg_data));
db__message_reconnect_reset(db, context);
}
context->subs = found_context->subs;
found_context->subs = NULL;
context->sub_count = found_context->sub_count;
found_context->sub_count = 0;
context->last_mid = found_context->last_mid;
for(i=0; i<context->sub_count; i++){
if(context->subs[i]){
leaf = context->subs[i]->subs;
while(leaf){
if(leaf->context == found_context){
leaf->context = context;
}
leaf = leaf->next;
}
}
}
}

Also, my apologies for the earlier confusion: I for some reason thought that subscriptions would not be part of a 'session', and thus would not be affected by 'clean session'. After reading the v3.1.1 standard, it turns out that subscriptions are in fact very much part of the session and everything should indeed be working exactly as you described.

With a clean source tree and your patch applied all issues appear to be resolved. Thanks!

@bverhoeven
Copy link
Author

I've been running a version with this patch applied in production for the past week and all seems to be working as expected. Thanks again for your quick response and again my apologies for the confusion I caused by accidentally stripping out pieces of code.

@lock lock bot locked as resolved and limited conversation to collaborators Feb 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants