Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection_timeout for mysql watchdog #3

Open
TomiMikola opened this issue Dec 11, 2013 · 6 comments
Open

Connection_timeout for mysql watchdog #3

TomiMikola opened this issue Dec 11, 2013 · 6 comments

Comments

@TomiMikola
Copy link

In mysql.sh (row #53 in release 1.0.1) defines no 'connect_timeout' which defaults to 0 in mysql client. With the default value watchdog does not identify properly issue with backend server crash. Setting the 'connect_timeout' to some reasonably short value gives the desired effect of dropping the backend from the pool.

One way the set the 'connect_timeout' parameter is to use OTHER_OPTIONS variable (in glbd.cfg):

OTHER_OPTIONS="-w exec:'mysql.sh --connect_timeout=1 -uglbpinger -pingerpwd'"

This should be noted in the comments of the files/glbd.cfg file. The alternative approach would be to include the connection_timeout parameter in files/mysql.sh row #53 using some variable for setting the timeout value.

@ghost ghost assigned ayurchen Dec 13, 2013
@ayurchen
Copy link
Member

Tomi, thanks for bringing this up. Could you clarify how default timeout of 0 prevents detecting server crash. I never had issues with that. Perhaps you mean false detection of server inavailability when the client can't connect immediately?

@TomiMikola
Copy link
Author

This was the status two weeks ago:

[root@glb1 ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
      10.1.4.87:3306  :    2.000   1.004    N/A   -266
      10.1.4.88:3306  :    4.000   1.004    N/A   -235
      10.1.4.86:3306  :    1.000   1.011    N/A    -95
------------------------------------------------------
Destinations: 3, total connections: -596 of 10000 max

One of the servers (10.1.4.88) was crashed - all connections dropping silently (couldn't even ssh to it)
Unfortunately I'm unable to reproduce the same state now.

What I was able to test was restricting connections from glb server to the backends.

No firewalls restrictions:

[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
     10.1.4.218:3306  :   10.000   0.000    N/A      0
     10.1.4.217:3306  :   10.000   0.000    N/A      0
     10.1.4.214:3306  :   10.000   0.000    N/A      0
------------------------------------------------------
Destinations: 3, total connections: 0 of 10000 max

Dropping access from glb-stage to 10.1.4.218, no connect_timeout defined for watchdog:

[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
     10.1.4.218:3306  :    0.000    -nan    N/A      0
     10.1.4.217:3306  :   10.000   0.000    N/A      0
     10.1.4.214:3306  :   10.000   0.000    N/A      0
------------------------------------------------------
Destinations: 3, total connections: 0 of 10000 max

Firewall restricted access from glb-stage to 10.1.4.218, connect_timeout=1 defined for watchdog:

[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
     10.1.4.214:3306  :   10.000   0.000    N/A      0
     10.1.4.217:3306  :   10.000   0.000    N/A      0
------------------------------------------------------
Destinations: 2, total connections: 0 of 10000 max

So it seems watchdog works fine although for me the latter case seems more accurate when the backend is totally unreachable.

@ayurchen
Copy link
Member

Tomi, thanks, this makes the issue much clearer now. Looks like a bug in watchdog backend, I'll see if it can be fixed there.

The negative connection count though looks far more disturbing... It would be good if there were a way to reproduce it.

@TomiMikola
Copy link
Author

We got again this negative connection issue today. Status looked like this:

[root@glb1 ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
      10.1.4.87:3306  :  100.000   1.004    N/A   -275
      10.1.4.88:3306  :   50.000   1.002    N/A   -464
      10.1.4.86:3306  :    1.000   1.002    N/A   -488
------------------------------------------------------
Destinations: 3, total connections: -1227 of 10000 max

All the backends were functioning normally without anything in logs. Some sort of leakage in the wathdog or the glbd itself?

@TomiMikola
Copy link
Author

anything ideas to help debugging with this?

Edit 2014-01-03 22:06:
By looking at the application log entries I see a lot of "SQLSTATE[08004] [1040] Too many connections" errors some minutes before the crash. And then dozens of deadlocks with the message "SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction"
The max_connections variable was set to 500 on each of the three backends.

@ayurchen
Copy link
Member

ayurchen commented Jan 4, 2014

not right away :(

On Fri, Jan 3, 2014 at 7:29 PM, TomiMikola [email protected] wrote:

anything ideas to help debugging with this?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-31538240
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants