Connection_timeout for mysql watchdog #3

TomiMikola · 2013-12-11T10:22:03Z

In mysql.sh (row #53 in release 1.0.1) defines no 'connect_timeout' which defaults to 0 in mysql client. With the default value watchdog does not identify properly issue with backend server crash. Setting the 'connect_timeout' to some reasonably short value gives the desired effect of dropping the backend from the pool.

One way the set the 'connect_timeout' parameter is to use OTHER_OPTIONS variable (in glbd.cfg):

OTHER_OPTIONS="-w exec:'mysql.sh --connect_timeout=1 -uglbpinger -pingerpwd'"

This should be noted in the comments of the files/glbd.cfg file. The alternative approach would be to include the connection_timeout parameter in files/mysql.sh row #53 using some variable for setting the timeout value.

The text was updated successfully, but these errors were encountered:

ayurchen · 2013-12-13T01:25:01Z

Tomi, thanks for bringing this up. Could you clarify how default timeout of 0 prevents detecting server crash. I never had issues with that. Perhaps you mean false detection of server inavailability when the client can't connect immediately?

TomiMikola · 2013-12-19T09:15:11Z

This was the status two weeks ago:

[root@glb1 ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
      10.1.4.87:3306  :    2.000   1.004    N/A   -266
      10.1.4.88:3306  :    4.000   1.004    N/A   -235
      10.1.4.86:3306  :    1.000   1.011    N/A    -95
------------------------------------------------------
Destinations: 3, total connections: -596 of 10000 max

One of the servers (10.1.4.88) was crashed - all connections dropping silently (couldn't even ssh to it)
Unfortunately I'm unable to reproduce the same state now.

What I was able to test was restricting connections from glb server to the backends.

No firewalls restrictions:

[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
     10.1.4.218:3306  :   10.000   0.000    N/A      0
     10.1.4.217:3306  :   10.000   0.000    N/A      0
     10.1.4.214:3306  :   10.000   0.000    N/A      0
------------------------------------------------------
Destinations: 3, total connections: 0 of 10000 max

Dropping access from glb-stage to 10.1.4.218, no connect_timeout defined for watchdog:

[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
     10.1.4.218:3306  :    0.000    -nan    N/A      0
     10.1.4.217:3306  :   10.000   0.000    N/A      0
     10.1.4.214:3306  :   10.000   0.000    N/A      0
------------------------------------------------------
Destinations: 3, total connections: 0 of 10000 max

Firewall restricted access from glb-stage to 10.1.4.218, connect_timeout=1 defined for watchdog:

[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
     10.1.4.214:3306  :   10.000   0.000    N/A      0
     10.1.4.217:3306  :   10.000   0.000    N/A      0
------------------------------------------------------
Destinations: 2, total connections: 0 of 10000 max

So it seems watchdog works fine although for me the latter case seems more accurate when the backend is totally unreachable.

ayurchen · 2013-12-20T01:49:54Z

Tomi, thanks, this makes the issue much clearer now. Looks like a bug in watchdog backend, I'll see if it can be fixed there.

The negative connection count though looks far more disturbing... It would be good if there were a way to reproduce it.

TomiMikola · 2014-01-03T17:23:17Z

We got again this negative connection issue today. Status looked like this:

[root@glb1 ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
      10.1.4.87:3306  :  100.000   1.004    N/A   -275
      10.1.4.88:3306  :   50.000   1.002    N/A   -464
      10.1.4.86:3306  :    1.000   1.002    N/A   -488
------------------------------------------------------
Destinations: 3, total connections: -1227 of 10000 max

All the backends were functioning normally without anything in logs. Some sort of leakage in the wathdog or the glbd itself?

TomiMikola · 2014-01-03T17:29:30Z

anything ideas to help debugging with this?

Edit 2014-01-03 22:06:
By looking at the application log entries I see a lot of "SQLSTATE[08004] [1040] Too many connections" errors some minutes before the crash. And then dozens of deadlocks with the message "SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction"
The max_connections variable was set to 500 on each of the three backends.

ayurchen · 2014-01-04T11:54:36Z

not right away :(

On Fri, Jan 3, 2014 at 7:29 PM, TomiMikola [email protected] wrote:

anything ideas to help debugging with this?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-31538240
.

ghost assigned ayurchen Dec 13, 2013

smartdoc mentioned this issue Dec 27, 2013

Galera Load Balancer connections crush #7

Open

ayurchen mentioned this issue Jan 20, 2014

Negative connection count in glbd status #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection_timeout for mysql watchdog #3

Connection_timeout for mysql watchdog #3

TomiMikola commented Dec 11, 2013

ayurchen commented Dec 13, 2013

TomiMikola commented Dec 19, 2013

ayurchen commented Dec 20, 2013

TomiMikola commented Jan 3, 2014

TomiMikola commented Jan 3, 2014

ayurchen commented Jan 4, 2014

Connection_timeout for mysql watchdog #3

Connection_timeout for mysql watchdog #3

Comments

TomiMikola commented Dec 11, 2013

ayurchen commented Dec 13, 2013

TomiMikola commented Dec 19, 2013

ayurchen commented Dec 20, 2013

TomiMikola commented Jan 3, 2014

TomiMikola commented Jan 3, 2014

ayurchen commented Jan 4, 2014