Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broker 1.5+: Configuring a bridge that isn't reachable will block the broker on each connect attempt on Windows #2172

Closed
NiclasLindgren opened this issue Apr 10, 2021 · 14 comments

Comments

@NiclasLindgren
Copy link

Haven't studied the code why, but the behaviour is pretty obvious, just have a client listen to topics from the bridge and see them chunk forward on each connect attempt.

Worked fine on 1.4

@NiclasLindgren
Copy link
Author

The problem seems deeper, if you have more than 1 bridge configured only one of them will work even if they are reachable. I suspect this all works on Linux. So the title is a bit wrong for the issue, should be something like

"Multiple bridge not working on Windows, randomly only one of them will ever connect", will investigate

@NiclasLindgren
Copy link
Author

The problem is in bridge_check, or if you want in bridge__start_all, if bridge__start_all doesn't get an immediate connect (i.e. get a EWOULDBLOCK back) the sockets isn't properly registered in epoll and the connect event isn't handled and nothing happens forever.

bridge_check needs something like this (or code in the proper place, but to me it feels like all the mux__XXX should be in the net_mosq layer instead as it is really easy to miss stuff else), but anyway, if we add handling of MOSQ_ERR_CONN_PENDING it will eventually work (not this is not the proper code I am sure, just to show)

					rc = bridge__connect(context);
					context->bridge->restart_t = 0;
					if (rc == MOSQ_ERR_SUCCESS) {
						if (context->bridge->round_robin == false && context->bridge->cur_address != 0) {
							context->bridge->primary_retry = db.now_s + 5;
						}
						mux__add_in(context);
						if (context->current_out_packet) {
							mux__add_out(context);
						}
					**} else if (rc == MOSQ_ERR_CONN_PENDING) {
						mux__add_in(context);
						mux__add_out(context);
					} else {**
						context->bridge->cur_address++;
						if(context->bridge->cur_address == context->bridge->address_count){
							context->bridge->cur_address = 0;
						}
					}

I am starting to feel like the windows code paths aren't used much?

@NiclasLindgren
Copy link
Author

As for the blocking issue, it is a bug in WSAPoll (I never used WSAPoll, always used select)

For Microsoft, it is a known issue:

"Windows 8 Bugs 309411 - WSAPoll does not report failed connections
8/3/2011 6:53 PM Resolved as Won't Fix by muraris
Has been like this forever and people are already used to it."

"The recommendation for now is to not use the WSAPoll function it in case you encounter this issue, but rather the other Net-API functions."

https://curl.se/mail/lib-2012-10/0038.html

@NiclasLindgren
Copy link
Author

I think maybe adding this in bridge_connect (will cover both cases of bridge_new and bridge_check)

}else if(rc == MOSQ_ERR_CONN_PENDING){
	mosquitto__set_state(context, mosq_cs_connect_pending);
	//mux__add_in(context);
	mux__add_out(context);
}

Should be enough with mux__add_out as the connect event is a POLLOUT event

@NiclasLindgren
Copy link
Author

Seems if you have #if defined(GLIBC) && defined(WITH_ADNS) enabled this code is in place already btw in bridge_check, but the "unix" path has the same issue with bridge__start_all I believe (don't see why not, but is caught in bridge_check later)

@NiclasLindgren
Copy link
Author

NiclasLindgren commented Apr 10, 2021

Seems you need both in and out for some reason, won't bother finding out why, things work now with these patches (and there is no CPU spin due to a poll loop so the code must clear the event somewhere properly)

@NiclasLindgren
Copy link
Author

There is a still an issue if you have several bridges, only the first will connect on first try, but they other will come online as the retries happen, could maybe be related to where you actually call
bridge__start_all();
in the end

@NiclasLindgren
Copy link
Author

The initial connect problem is in here
It is writing to the socket before the connect event is finish (the POLLOUT is due to the connect, which write son the socket, which return 10057, I added a little check and return EGAIN which "solves" the issue..

SSL returns 5

define SSL_ERROR_SYSCALL 5/* look at error stack/return

                                       * value/errno */

So it seems SSL isn't properly handling the window socket error, or we are not allowed to write on the socket on first pollout.. not sure

#ifdef WITH_TLS
if(mosq->ssl){
mosq->want_write = false;
ret = SSL_write(mosq->ssl, buf, (int)count);
if(ret < 0){
err = SSL_get_error(mosq->ssl, ret);
if(err == SSL_ERROR_WANT_READ){
ret = -1;
errno = EAGAIN;
}else if(err == SSL_ERROR_WANT_WRITE){
ret = -1;
mosq->want_write = true;
errno = EAGAIN;
}else{
net__print_ssl_error(mosq);
errno = EPROTO;
}
ERR_clear_error();
#ifdef WIN32
int err = WSAGetLastError();
if (err == 10057)
errno = EAGAIN;

err = err;
WSASetLastError(errno);
#endif

@NiclasLindgren
Copy link
Author

Same "problem" in read

ssize_t net__read(struct mosquitto *mosq, void *buf, size_t count)
{
#ifdef WITH_TLS
int ret;
int err;
#endif
assert(mosq);
errno = 0;
#ifdef WITH_TLS
if(mosq->ssl){
ret = SSL_read(mosq->ssl, buf, (int)count);
if(ret <= 0){
err = SSL_get_error(mosq->ssl, ret);
if(err == SSL_ERROR_WANT_READ){
ret = -1;
errno = EAGAIN;
}else if(err == SSL_ERROR_WANT_WRITE){
ret = -1;
mosq->want_write = true;
errno = EAGAIN;
}else{
net__print_ssl_error(mosq);
errno = EPROTO;
}
ERR_clear_error();
#ifdef WIN32
int err = WSAGetLastError();
if (err == 10057)
{
errno = EAGAIN;
}
WSASetLastError(errno);

And the main reason for the read problem is because we read a socket that doesn't have the event signalled

Here we read the socket not because of POLLIN, but because of the ||, not sure why

	if(pollfds[context->pollfd_index].revents & POLLIN ||
			(context->ssl && context->state == mosq_cs_new)){

Feels like loop_handle_reads_writes needs to be looked at in mux_poll, the SSL code paths should make sure the state is ready before proceeding I think, or we add code to capture the read/write to a non connected socket(yet)

The 10057 checks isn't a solution anyhow =)

@NiclasLindgren
Copy link
Author

I would just use

	if (pollfds[context->pollfd_index].revents & POLLOUT) {

and

	if (pollfds[context->pollfd_index].revents & POLLIN) {

As in the "non" TLS code paths, I think that code ended up there because of the issues with not calling mux__add_in/out properly, also because the pollfd array got corrupted at start up.

Using that instead I have no issues anyway (and won't hit the 10057 checks)

@NiclasLindgren
Copy link
Author

Feels like maybe all the want_connect and want_write should be replaced with mux_add_out, but I don't understand exactly how this code is resued in client mode etc so unclear to me what the proper solution is. But reading and/or writing on sockets poll hasn't cleared for it isn't a good thing in non blocking mode. I suspect the clients use blocking mode? And if they do I understand why the code is diferrent and those flags are needed, but then I reckon they don't even use poll/epoll

but then epoll/poll should check those flags and added them to the poll array instead

@NiclasLindgren
Copy link
Author

Found this code in the, what I think is, client path

#ifdef WITH_TLS
if(mosq->ssl){
if(mosq->want_write){
FD_SET(mosq->sock, &writefds);
}else if(mosq->want_connect){
/* Remove possible FD_SET from above, we don't want to check
* for writing if we are still connecting, unless want_write is
* definitely set. The presence of outgoing packets does not
* matter yet. */
FD_CLR(mosq->sock, &writefds);
}
}
#endif

using select

so that makes sense. And also makes my call to mux__delete in net__socket_close incorrect, but it is under the "WITH_BROKER" ifdef so maybe ok.. You know I am sure =)

At least things are working fine running on windows as a broker with multiple bridges now, been running for 24h with injected reconnects all seem fine, couple of million messages beeing sent back and forth.

@NiclasLindgren
Copy link
Author

Let me know if you want me to try to make some pull requests, beware though, not done any before, but if you want, I will sign the ECA and try to do it. All in all very few lines of code changes though (but very crucial =)

@NiclasLindgren
Copy link
Author

A little bump, I think you want these fixes in the windows code path and I would be happy to clean up the patches

@ralight ralight closed this as completed in 104b94d Jun 9, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant