-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broker 1.5+: Configuring a bridge that isn't reachable will block the broker on each connect attempt on Windows #2172
Comments
The problem seems deeper, if you have more than 1 bridge configured only one of them will work even if they are reachable. I suspect this all works on Linux. So the title is a bit wrong for the issue, should be something like "Multiple bridge not working on Windows, randomly only one of them will ever connect", will investigate |
The problem is in bridge_check, or if you want in bridge__start_all, if bridge__start_all doesn't get an immediate connect (i.e. get a EWOULDBLOCK back) the sockets isn't properly registered in epoll and the connect event isn't handled and nothing happens forever. bridge_check needs something like this (or code in the proper place, but to me it feels like all the mux__XXX should be in the net_mosq layer instead as it is really easy to miss stuff else), but anyway, if we add handling of MOSQ_ERR_CONN_PENDING it will eventually work (not this is not the proper code I am sure, just to show)
I am starting to feel like the windows code paths aren't used much? |
As for the blocking issue, it is a bug in WSAPoll (I never used WSAPoll, always used select) For Microsoft, it is a known issue: "Windows 8 Bugs 309411 - WSAPoll does not report failed connections "The recommendation for now is to not use the WSAPoll function it in case you encounter this issue, but rather the other Net-API functions." |
I think maybe adding this in bridge_connect (will cover both cases of bridge_new and bridge_check)
Should be enough with mux__add_out as the connect event is a POLLOUT event |
Seems if you have #if defined(GLIBC) && defined(WITH_ADNS) enabled this code is in place already btw in bridge_check, but the "unix" path has the same issue with bridge__start_all I believe (don't see why not, but is caught in bridge_check later) |
Seems you need both in and out for some reason, won't bother finding out why, things work now with these patches (and there is no CPU spin due to a poll loop so the code must clear the event somewhere properly) |
There is a still an issue if you have several bridges, only the first will connect on first try, but they other will come online as the retries happen, could maybe be related to where you actually call |
The initial connect problem is in here SSL returns 5 define SSL_ERROR_SYSCALL 5/* look at error stack/return
So it seems SSL isn't properly handling the window socket error, or we are not allowed to write on the socket on first pollout.. not sure #ifdef WITH_TLS |
Same "problem" in read ssize_t net__read(struct mosquitto *mosq, void *buf, size_t count) And the main reason for the read problem is because we read a socket that doesn't have the event signalled Here we read the socket not because of POLLIN, but because of the ||, not sure why
Feels like loop_handle_reads_writes needs to be looked at in mux_poll, the SSL code paths should make sure the state is ready before proceeding I think, or we add code to capture the read/write to a non connected socket(yet) The 10057 checks isn't a solution anyhow =) |
I would just use
and
As in the "non" TLS code paths, I think that code ended up there because of the issues with not calling mux__add_in/out properly, also because the pollfd array got corrupted at start up. Using that instead I have no issues anyway (and won't hit the 10057 checks) |
Feels like maybe all the want_connect and want_write should be replaced with mux_add_out, but I don't understand exactly how this code is resued in client mode etc so unclear to me what the proper solution is. But reading and/or writing on sockets poll hasn't cleared for it isn't a good thing in non blocking mode. I suspect the clients use blocking mode? And if they do I understand why the code is diferrent and those flags are needed, but then I reckon they don't even use poll/epoll but then epoll/poll should check those flags and added them to the poll array instead |
Found this code in the, what I think is, client path #ifdef WITH_TLS using select so that makes sense. And also makes my call to mux__delete in net__socket_close incorrect, but it is under the "WITH_BROKER" ifdef so maybe ok.. You know I am sure =) At least things are working fine running on windows as a broker with multiple bridges now, been running for 24h with injected reconnects all seem fine, couple of million messages beeing sent back and forth. |
Let me know if you want me to try to make some pull requests, beware though, not done any before, but if you want, I will sign the ECA and try to do it. All in all very few lines of code changes though (but very crucial =) |
A little bump, I think you want these fixes in the windows code path and I would be happy to clean up the patches |
Haven't studied the code why, but the behaviour is pretty obvious, just have a client listen to topics from the bridge and see them chunk forward on each connect attempt.
Worked fine on 1.4
The text was updated successfully, but these errors were encountered: