Broker 1.5+: Configuring a bridge that isn't reachable will block the broker on each connect attempt on Windows #2172

NiclasLindgren · 2021-04-10T00:05:24Z

Haven't studied the code why, but the behaviour is pretty obvious, just have a client listen to topics from the bridge and see them chunk forward on each connect attempt.

Worked fine on 1.4

NiclasLindgren · 2021-04-10T10:18:51Z

The problem seems deeper, if you have more than 1 bridge configured only one of them will work even if they are reachable. I suspect this all works on Linux. So the title is a bit wrong for the issue, should be something like

"Multiple bridge not working on Windows, randomly only one of them will ever connect", will investigate

NiclasLindgren · 2021-04-10T18:47:23Z

The problem is in bridge_check, or if you want in bridge__start_all, if bridge__start_all doesn't get an immediate connect (i.e. get a EWOULDBLOCK back) the sockets isn't properly registered in epoll and the connect event isn't handled and nothing happens forever.

bridge_check needs something like this (or code in the proper place, but to me it feels like all the mux__XXX should be in the net_mosq layer instead as it is really easy to miss stuff else), but anyway, if we add handling of MOSQ_ERR_CONN_PENDING it will eventually work (not this is not the proper code I am sure, just to show)

					rc = bridge__connect(context);
					context->bridge->restart_t = 0;
					if (rc == MOSQ_ERR_SUCCESS) {
						if (context->bridge->round_robin == false && context->bridge->cur_address != 0) {
							context->bridge->primary_retry = db.now_s + 5;
						}
						mux__add_in(context);
						if (context->current_out_packet) {
							mux__add_out(context);
						}
					**} else if (rc == MOSQ_ERR_CONN_PENDING) {
						mux__add_in(context);
						mux__add_out(context);
					} else {**
						context->bridge->cur_address++;
						if(context->bridge->cur_address == context->bridge->address_count){
							context->bridge->cur_address = 0;
						}
					}

I am starting to feel like the windows code paths aren't used much?

NiclasLindgren · 2021-04-10T18:53:02Z

As for the blocking issue, it is a bug in WSAPoll (I never used WSAPoll, always used select)

For Microsoft, it is a known issue:

"Windows 8 Bugs 309411 - WSAPoll does not report failed connections
8/3/2011 6:53 PM Resolved as Won't Fix by muraris
Has been like this forever and people are already used to it."

"The recommendation for now is to not use the WSAPoll function it in case you encounter this issue, but rather the other Net-API functions."

https://curl.se/mail/lib-2012-10/0038.html

NiclasLindgren · 2021-04-10T19:07:08Z

I think maybe adding this in bridge_connect (will cover both cases of bridge_new and bridge_check)

}else if(rc == MOSQ_ERR_CONN_PENDING){
	mosquitto__set_state(context, mosq_cs_connect_pending);
	//mux__add_in(context);
	mux__add_out(context);
}

Should be enough with mux__add_out as the connect event is a POLLOUT event

NiclasLindgren · 2021-04-10T19:11:34Z

Seems if you have #if defined(GLIBC) && defined(WITH_ADNS) enabled this code is in place already btw in bridge_check, but the "unix" path has the same issue with bridge__start_all I believe (don't see why not, but is caught in bridge_check later)

NiclasLindgren · 2021-04-10T19:17:41Z

Seems you need both in and out for some reason, won't bother finding out why, things work now with these patches (and there is no CPU spin due to a poll loop so the code must clear the event somewhere properly)

NiclasLindgren · 2021-04-10T19:23:13Z

There is a still an issue if you have several bridges, only the first will connect on first try, but they other will come online as the retries happen, could maybe be related to where you actually call
bridge__start_all();
in the end

NiclasLindgren · 2021-04-11T18:31:53Z

The initial connect problem is in here
It is writing to the socket before the connect event is finish (the POLLOUT is due to the connect, which write son the socket, which return 10057, I added a little check and return EGAIN which "solves" the issue..

SSL returns 5

define SSL_ERROR_SYSCALL 5/* look at error stack/return

                                       * value/errno */

So it seems SSL isn't properly handling the window socket error, or we are not allowed to write on the socket on first pollout.. not sure

#ifdef WITH_TLS
if(mosq->ssl){
mosq->want_write = false;
ret = SSL_write(mosq->ssl, buf, (int)count);
if(ret < 0){
err = SSL_get_error(mosq->ssl, ret);
if(err == SSL_ERROR_WANT_READ){
ret = -1;
errno = EAGAIN;
}else if(err == SSL_ERROR_WANT_WRITE){
ret = -1;
mosq->want_write = true;
errno = EAGAIN;
}else{
net__print_ssl_error(mosq);
errno = EPROTO;
}
ERR_clear_error();
#ifdef WIN32
int err = WSAGetLastError();
if (err == 10057)
errno = EAGAIN;
err = err;
WSASetLastError(errno);
#endif

NiclasLindgren · 2021-04-11T18:45:46Z

Same "problem" in read

ssize_t net__read(struct mosquitto *mosq, void *buf, size_t count)
{
#ifdef WITH_TLS
int ret;
int err;
#endif
assert(mosq);
errno = 0;
#ifdef WITH_TLS
if(mosq->ssl){
ret = SSL_read(mosq->ssl, buf, (int)count);
if(ret <= 0){
err = SSL_get_error(mosq->ssl, ret);
if(err == SSL_ERROR_WANT_READ){
ret = -1;
errno = EAGAIN;
}else if(err == SSL_ERROR_WANT_WRITE){
ret = -1;
mosq->want_write = true;
errno = EAGAIN;
}else{
net__print_ssl_error(mosq);
errno = EPROTO;
}
ERR_clear_error();
#ifdef WIN32
int err = WSAGetLastError();
if (err == 10057)
{
errno = EAGAIN;
}
WSASetLastError(errno);

And the main reason for the read problem is because we read a socket that doesn't have the event signalled

Here we read the socket not because of POLLIN, but because of the ||, not sure why

	if(pollfds[context->pollfd_index].revents & POLLIN ||
			(context->ssl && context->state == mosq_cs_new)){

Feels like loop_handle_reads_writes needs to be looked at in mux_poll, the SSL code paths should make sure the state is ready before proceeding I think, or we add code to capture the read/write to a non connected socket(yet)

The 10057 checks isn't a solution anyhow =)

NiclasLindgren · 2021-04-11T18:48:51Z

I would just use

	if (pollfds[context->pollfd_index].revents & POLLOUT) {

and

	if (pollfds[context->pollfd_index].revents & POLLIN) {

As in the "non" TLS code paths, I think that code ended up there because of the issues with not calling mux__add_in/out properly, also because the pollfd array got corrupted at start up.

Using that instead I have no issues anyway (and won't hit the 10057 checks)

NiclasLindgren · 2021-04-11T19:02:55Z

Feels like maybe all the want_connect and want_write should be replaced with mux_add_out, but I don't understand exactly how this code is resued in client mode etc so unclear to me what the proper solution is. But reading and/or writing on sockets poll hasn't cleared for it isn't a good thing in non blocking mode. I suspect the clients use blocking mode? And if they do I understand why the code is diferrent and those flags are needed, but then I reckon they don't even use poll/epoll

but then epoll/poll should check those flags and added them to the poll array instead

NiclasLindgren · 2021-04-11T19:06:47Z

Found this code in the, what I think is, client path

#ifdef WITH_TLS
if(mosq->ssl){
if(mosq->want_write){
FD_SET(mosq->sock, &writefds);
}else if(mosq->want_connect){
/* Remove possible FD_SET from above, we don't want to check
* for writing if we are still connecting, unless want_write is
* definitely set. The presence of outgoing packets does not
* matter yet. */
FD_CLR(mosq->sock, &writefds);
}
}
#endif

using select

so that makes sense. And also makes my call to mux__delete in net__socket_close incorrect, but it is under the "WITH_BROKER" ifdef so maybe ok.. You know I am sure =)

At least things are working fine running on windows as a broker with multiple bridges now, been running for 24h with injected reconnects all seem fine, couple of million messages beeing sent back and forth.

NiclasLindgren · 2021-04-12T18:26:17Z

Let me know if you want me to try to make some pull requests, beware though, not done any before, but if you want, I will sign the ECA and try to do it. All in all very few lines of code changes though (but very crucial =)

NiclasLindgren · 2021-04-26T08:52:14Z

A little bump, I think you want these fixes in the windows code path and I would be happy to clean up the patches

NiclasLindgren mentioned this issue Apr 10, 2021

Broker bridge 2.0.10: Corruption of pollfds on Windows #2173

Closed

ralight closed this as completed in 104b94d Jun 9, 2021

github-actions bot locked as resolved and limited conversation to collaborators Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broker 1.5+: Configuring a bridge that isn't reachable will block the broker on each connect attempt on Windows #2172

Broker 1.5+: Configuring a bridge that isn't reachable will block the broker on each connect attempt on Windows #2172

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021 •

edited

Loading

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 11, 2021

NiclasLindgren commented Apr 11, 2021

NiclasLindgren commented Apr 11, 2021

NiclasLindgren commented Apr 11, 2021

NiclasLindgren commented Apr 11, 2021

NiclasLindgren commented Apr 12, 2021

NiclasLindgren commented Apr 26, 2021

Broker 1.5+: Configuring a bridge that isn't reachable will block the broker on each connect attempt on Windows #2172

Broker 1.5+: Configuring a bridge that isn't reachable will block the broker on each connect attempt on Windows #2172

Comments

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 10, 2021 • edited Loading

NiclasLindgren commented Apr 10, 2021

NiclasLindgren commented Apr 11, 2021

define SSL_ERROR_SYSCALL 5/* look at error stack/return

NiclasLindgren commented Apr 11, 2021

NiclasLindgren commented Apr 11, 2021

NiclasLindgren commented Apr 11, 2021

NiclasLindgren commented Apr 11, 2021

NiclasLindgren commented Apr 12, 2021

NiclasLindgren commented Apr 26, 2021

NiclasLindgren commented Apr 10, 2021 •

edited

Loading