Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnect delay with mosquitto_loop_forever() and WITH_SRV #2947

Open
psychon opened this issue Nov 9, 2023 · 1 comment
Open

Disconnect delay with mosquitto_loop_forever() and WITH_SRV #2947

psychon opened this issue Nov 9, 2023 · 1 comment

Comments

@psychon
Copy link

psychon commented Nov 9, 2023

Hi,

we are using libmosquitto in an externally threaded way (mosquitto_threaded_set(conn, 1)). We have one thread running mosquitto_loop_forever() and want to make this exit by calling mosquitto_disconnect() in another thread. This usually works fine (on Ubuntu using Mosquitto 2.0.11), but we recently had this same code hang for 120 seconds on a Gentoo system (both with 2.0.18 and 2.0.15; 120 is our keepalive value).

A debugging session found out that normally the fourth(*) call to mosquitto_loop() immediately returns MOSQ_ERR_NO_CONN, but on the Gentoo system this only happens after 120 seconds. Some more debugging indicated that this is related to the WITH_SRV compile option, which is not enabled on Ubuntu, but conditionally enabled on Gentoo. Disabling the srv use flag on Gentoo made the hang go away.

Some code inspection came up with a patch to the following code:

mosquitto/lib/loop.c

Lines 83 to 96 in 15292b2

#ifdef WITH_SRV
if(mosq->achan){
if(mosquitto__get_state(mosq) == mosq_cs_connect_srv){
rc = ares_fds(mosq->achan, &readfds, &writefds);
if(rc > maxfd){
maxfd = rc;
}
}else{
return MOSQ_ERR_NO_CONN;
}
}
#else
return MOSQ_ERR_NO_CONN;
#endif

This code is in an #ifdef WITH_SRV. Without SRV, this just returns MOSQ_ERR_NO_CONN. With SRV, this only returns MOSQ_ERR_NO_CONN if mosq->achan != NULL && mosquitto_get_state(mosq) != mosq_cs_connect_srv. In our case, mosq->achan == NULL and thus this code falls through to the following code. With the proposed patch, this instead also returns MOSQ_ERR_NO_CONN if mosq->achan == NULL, changing the code to always return NO_CONN except if achan != NULL and the the state is mosq_cs_connect_srv.

diff --git a/lib/loop.c b/lib/loop.c
index 965294f0..933b98a1 100644
--- a/lib/loop.c
+++ b/lib/loop.c
@@ -88,7 +88,9 @@ int mosquitto_loop(struct mosquitto *mosq, int timeout, int max_packets)
                        }else{
                                return MOSQ_ERR_NO_CONN;
                        }
-               }
+               } else {
+                       return MOSQ_ERR_NO_CONN;
+                }
 #else
                return MOSQ_ERR_NO_CONN;
 #endif

Testing done: None. We are going with the work-around of disabling SRV support. Sorry. I hope you have a better understanding of the SRV feature and can actually test this patch properly.
(*): The program that we debugged with just publishes one message and then disconnects again. Thus, it is quite deterministic and "the fourth call" seems to be meaningful / comparable to the working system.

@psychon
Copy link
Author

psychon commented Nov 11, 2023

I produced a self-contained reproducer without threads
#include <mosquitto.h>
#include <stdio.h>
#include <time.h>

struct data {
	time_t start_time;
};

static void print_time(struct data* data) {
	time_t now = time(NULL);
	printf("After %d sec: ", now - data->start_time);
}

static void my_connect_callback(struct mosquitto *mosq, void* data, int) {
	print_time(data);
	puts("connected");
	mosquitto_disconnect(mosq);
}

static void my_disconnect_callback(struct mosquitto *, void* data, int) {
	print_time(data);
	puts("disconnected");
}

static void my_log_callback(struct mosquitto *, void* data, int, const char* msg) {
	print_time(data);
	printf("Log: %s\n", msg);
}

int main() {
	struct data data;
	struct mosquitto *mosq = NULL;

	data.start_time = time(NULL);

	mosquitto_lib_init();

	mosq = mosquitto_new(NULL, 1, &data);
	if (!mosq) {
		puts("mosquitto_new() failed");
		return 1;
	}

	mosquitto_connect_callback_set(mosq, my_connect_callback);
	mosquitto_disconnect_callback_set(mosq, my_disconnect_callback);
	mosquitto_log_callback_set(mosq, my_log_callback);

	if (MOSQ_ERR_SUCCESS != mosquitto_connect(mosq, "localhost", 1883, 120)) {
		puts("mosquitto_connect() failed");
		return 0;
	}

	mosquitto_loop_forever(mosq, 12 * 1000, 1);

	print_time(&data);
	puts("exiting");

	return 0;
}
The timeout argument to `mosquitto_loop_forever()` is the "important" parameter here. I used 12 seconds in the above program.

I ran this against current master.

Output when build with -DWITH_SRV=OFF:

After 0 sec: Log: Client null sending CONNECT
After 0 sec: Log: Client null received CONNACK (0)
After 0 sec: connected
After 0 sec: Log: Client null sending DISCONNECT
After 0 sec: disconnected
After 0 sec: exiting

Output when build with -DWITH_SRV=ON:

After 0 sec: Log: Client null sending CONNECT
After 0 sec: Log: Client null received CONNACK (0)
After 0 sec: connected
After 0 sec: Log: Client null sending DISCONNECT
After 0 sec: disconnected
After 12 sec: exiting

With the proposed patch from my original post, even -DWITH_SRV=ON immediately exits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant