TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes #2676

rystsov · 2017-02-19T02:43:20Z

Hi folks,

I was testing a 3-nodes deployment of a TiDB cluster with default settings and observed a situation when TiDB stoped serving requests for 2 minutes 40 seconds. It happened when a leader got separated from the peers. The unavailable window is too wide so I believe that it a bug rather than an expected behavior.

The version of the TiDB is

rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/pd-server --version
Git Commit Hash: f5744d7b52aa4793b84cfdcd4efae1fc9a9bac6b
UTC Build Time:  2017-02-17 09:18:31

rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/tikv-server --version
Git Commit Hash: eb185b3babc476080306fef7c05b7673c1342455
UTC Build Time:  2017-02-17 08:12:57
Rustc Version:   1.17.0-nightly (ba7cf7cc5 2017-02-11)

rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/tidb-server -V
Git Commit Hash: a8d185d8cb8485e1a124919d0df8b10a16bc6e40
UTC Build Time:  2017-02-17 08:50:53

The client app opened a connection to each of the nodes and was continuously running the following loop per each of them:

read a value by a key
if the wasn't set then set it to 0
increment the value
write it back
increment a number of successful iterations
repeat the loop

Each connection used its own key to avoid collision. If there was an error during the loop then it closed the current connection, opened a new one and began the next iteration.

Once in a second it dumped aggregated number of successful iterations per cluster and per each node for a last second.

When I separated a leader (10.0.0.7) from the peers with the following command:

sudo iptables -A INPUT -s 10.0.0.5 -j DROP
sudo iptables -A INPUT -s 10.0.0.6 -j DROP
sudo iptables -A OUTPUT -d 10.0.0.5 -j DROP
sudo iptables -A OUTPUT -d 10.0.0.6 -j DROP

the cluster became unavailable for more than two minutes (the rate of successful iterations dropped to zero)

Please see this repo for client's code, more information about the incident and the repro steps https://github.com/rystsov/perseus/tree/master/tidb

The text was updated successfully, but these errors were encountered:

c4pt0r · 2017-02-19T03:09:15Z

Thanks for reporting!
@siddontang PTAL

siddontang · 2017-02-19T03:42:01Z

Hi @rystsov

Seem that the leader can't be elected for a long time., could you give me all your TiKV logs.

siddontang · 2017-02-19T13:19:12Z

Hi @rystsov

I have used your case and reproduce the problem. This is caused that your test client connects the TiDB in 10.0.0.7, but this TiDB can't connect to other PDs and TiKVs in 10.0.0.5 and 10.0.0.6. After a long time retry, TiDB will return the test client a timeout error, then the test client will connect another TiDB and works. In my machine, the unavailable time is about 1m30s.

TiDB is stateless, only PD and TiKV have the leader concept because of Raft. When the partition occurs, the new leader will be elected at about 20s, which is within the expected range. So now we need to let TiDB return error immediately or not retry too many times when we know the TiDB can't provide services. /cc @coocood

rystsov · 2017-02-19T19:16:00Z

Hi @siddontang

I'll be able to provide the logs in a couple of days.

This is caused that your test client connects the TiDB in 10.0.0.7, but this TiDB can't connect to other PDs and TiKVs in 10.0.0.5 and 10.0.0.6. After a long time retry, TiDB will return the test client a timeout error, then the test client will connect another TiDB and works.

It isn't 100% correct. My data shows that all cluster becomes unavailable so there is no other TiDB to connect to.

The test application opened 3 connections to each TiDB of the custer and was testing them independently (measuring the rate of successful operations per connection). After I isolated the first node (I called it a leader because the success rate for that node was higher than for the rest nodes) the rate dropped to zero for the whole cluster and it took more than 2 minutes to recover:

86  445 195 126 124
87  474 214 132 128
88  19  8   5   6
89  0   0   0   0
...
247 0   0   0   0
248 29  0   0   29
249 237 0   138 99
250 289 0   179 110
251 314 0   197 117

The first column is number of seconds since the beginning of the experiment, the second is the number of successful read-inc-write iterations per all cluster per last second. The last three columns are success rate per each of connection.

rystsov · 2017-12-19T08:18:43Z

Hi folks I reproduced the experiment using a 3-nodes deployment (each node hosted PD, TiKV and TiDB) with the recent version of TiDB:

> ./tidb-latest-linux-amd64/bin/pd-server -V
Release Version: v1.1.0-alpha-50-gbc538fe
Git Commit Hash: bc538fe75a60a03264c8d6898e75e66d407d1c80

> ./tidb-latest-linux-amd64/bin/tikv-server -V
TiKV
Release Version:   1.0.1

> ./tidb-latest-linux-amd64/bin/tidb-server -V
Release Version: v1.1.0-alpha-290-g7995346
Git Commit Hash: 7995346d8d54afd4d85ac3c9f6a3594c0bc7dbbd

For each TiDB, a client opened a connection, used a different key to avoid collisions and was executing the read-increment routine in a loop.

10	164	23	63	78
11	198	57	64	77
12	204	59	66	79
13	200	56	66	78

The first number in a row is the ordinal second of the experiment, the second is the number of iterations done across all connections per that second, the last three are numbers of iterations per connection (10.0.0.8, 10.0.0.9, 10.0.0.10).

During the experiment I isolated a node (10.0.0.10) which had a higher RPS with iptables:

sudo iptables -A INPUT -s 10.0.0.8 -j DROP; sudo iptables -A INPUT -s 10.0.0.9 -j DROP; sudo iptables -A OUTPUT -d 10.0.0.8 -j DROP; sudo iptables -A OUTPUT -d 10.0.0.9 -j DROP

I expected RPS in the last column to be dropped to zero without effect on the other columns. But from the client perspective the whole cluster went down for almost 20 seconds:

68	190	54	63	73
69	110	32	35	43
70	0	0	0	0
...
88	0	0	0	0
89	45	19	26	0
90	131	54	77	0

After some time I removed the isolation (approximately at 126th second):

sudo iptables -D INPUT -s 10.0.0.8 -j DROP; sudo iptables -D INPUT -s 10.0.0.9 -j DROP; sudo iptables -D OUTPUT -d 10.0.0.8 -j DROP; sudo iptables -D OUTPUT -d 10.0.0.9 -j DROP

The expectation was to see an immediate recovery of the RPS in the third connection but the it took 64 seconds and had very strange pattern. First the cluster became unavailable for almost 6 seconds:

126	120	50	70	0
127	128	55	73	0
...
131	100	44	56	0
132	0	0	0	0
...
136	0	0	0	0
137	16	0	16	0
138	117	42	75	0
139	123	53	70	0

Nothing happened for more than 40 seconds (the RPS of third node was zero) but then the cluster became unavailable for almost 12 seconds:

175	125	52	73	0
176	34	15	19	0
177	0	0	0	0
...
188	0	0	0	0
189	66	28	38	0
190	179	57	73	49

Finally, the third node recovered.

As you can see a simulated network glitch affecting one node in a three nodes deployment caused TiDB to stop serving requests for more than 35 seconds.

Please find logs here. I hope they will help you to fix the issue :)

client.log - client's log
(tidb1|tidb2|tidb3).pd.log - PD logs
(tidb1|tidb2|tidb3).kv.log - TiKV logs
(tidb1|tidb2|tidb3).db.log - TiDB logs

jackysp · 2017-12-19T08:48:44Z

Thanks for your feedback!
PTAL @siddontang @BusyJay @disksing

shenli · 2017-12-19T13:47:44Z

@rystsov We are investigating this issue.

siddontang · 2017-12-19T15:10:49Z

Hi @rystsov

But from the client perspective the whole cluster went down for almost 20 seconds

The first 0 QPS is because that all your requests need to get data from TiKV 5 which has been partitioned yet. In our previous test, if we use iptables to do the isolation, the gRPC connection can't detect the network broken and it will throw a timeout error after 20 seconds (We configure the gRPC timeout to 20 seconds here).

We can decrease the timeout here or use keepalive to let the connection know it is broken more quickly. But we must know that TiKV also still needs some time to elect a new leader if the old leader is isolated.

The following log in tdb1.db.log shows this error:

2017/12/19 06:45:50.722 region_cache.go:467: [info] drop regions of store 5 from cache due to request fail, err: rpc error: code = DeadlineExceeded desc = context deadline exceeded

but then the cluster became unavailable for almost 12 seconds

The second 0 QPS is caused by the PD leader election. I will explain the reason in detail here:

The PD Raft leader in 10.0.0.8 received a higher term from the rejoining peer in 10.0.0.10, this causes the leader step down, you can see the from the log in tidb1.pd.log:

2017/12/19 06:46:27.735 log.go:85: [info] raft: [2d46bf62043d6e80 [term: 3] received a MsgAppResp message with higher term from 2946b859ec75fb6a [term: 17]]
2017/12/19 06:46:27.735 log.go:85: [info] raft: [2d46bf62043d6e80 became follower at term 17]

Then the peer in 10.0.0.8 started another election and became the leader again:

2017/12/19 06:46:30.918 log.go:85: [info] raft: [2d46bf62043d6e80 became leader at term 18]

If we send a read request to the origin leader before 06:46:27, the peer may not be leader and the Raft ReadIndex operation will be hang up and timeout, see:

2017/12/19 06:46:38.734 log.go:81: [warning] etcdserver: [timed out waiting for read index response]

So from 06:46:37 to 06:46:38, the PD couldn't provide service, and the QPS became 0 because TiDB can't get TSO from PD.

Using Raft Pre-Vote feature can avoid the case that the Raft leader is disturbed accidentally, but we also need to know why the Raft ReadIndex callback is not called immediately after the peer re-becomes the leader but timeout. We will discuss with the etcd team about this then update later. /cc @xiang90

rystsov · 2018-01-04T06:06:47Z

Uploaded logs here and removed them from the repo - logs.tar.gz

rystsov · 2018-01-04T06:16:31Z

By the way, I dockerized the scripts so the test can be reproduced just with a few commands - https://github.com/rystsov/perseus/tree/master/tidb

I run another test and observed the following situation: after a node was isolated from a three-nodes cluster the whole system became irresponsible for more than 5 minutes, I tired to wait until it recovers, rejoined the node and it took another 1m30s until it revived. The logs -
logs.tar.gz.

siddontang · 2018-01-04T07:14:36Z

Thanks @rystsov

I will reproduce it with your scripts and dive into this problem.

siddontang · 2018-01-04T15:59:21Z

Hi @rystsov

I see that you isolated tidb1 for 5m and then rejoined it, and the 5 minutes no response makes sense, I will explain it in detail here:

You use docker compose to create and start the whole cluster quickly, but at first, TiKV will only initialize only 1 replica for a Raft group in one TiKV, and PD will choose another TiKV to add a replica 1 minute later, then do the same thing 2 minutes later. So even you start the whole cluster, the first raft group may have enough 3 replicas after 2+ minutes. But in your client.log

#legend: time|tidb1|tidb2|tidb3|tidb1:err|tidb2:err|tidb3:err
1	2	5	10	0	0	0	2018/01/04 05:55:15
......
21	39	36	39	0	0	0	2018/01/04 05:55:35
# isolating tidb1

You start the cluster and isolate the tidb1 in 1 minute, at this time, there is only 1 replica in the Raft group on TiKV 1 on tidb2 machine.

PD finds there is only 1 replica, so it chooses a TiKV and decides to add a new replica of the Raft group, sadly, PD chooses the isolated TiKV 6 which is on the tidb1 machine and sends the AddPeer command to TiKV 1, see PD log:

2018/01/04 05:55:12.263 coordinator.go:355: [info] [region 2] add operator: makeUpReplica (kind:region,replica, region:2, createAt:2018-01-04 05:55:12.2633296 +0000 UTC m=+19.474885601, currentStep:0, steps:[add peer 30 on store 6])

The leader replica on TiKV1 receives the command, appends the command to its Raft log and then applies, then TiKV1 reports the result to PD. You can see the PD log:

2018/01/04 05:55:12.274 cache.go:396: [info] [region 2] Add peer:{id:30 store_id:6 }, ConfVer changed from {1} to {2}

Sadly now, the Raft group now contains two replicas but one replica can't work, so every following operation will fail because we can't commit the Raft log (only minority works).

This is the known problem, see https://github.com/coreos/etcd/tree/master/raft#implementation-notes, the note only tells the problem when removing Raft member, but adding Raft member has the same problem.

Although the result is reasonable here, it is still a bug I think we need to fix or improve. maybe we can:

let PD checks TiKV alive immediately. If PD finds it can't communicate with a TiKV, it should not add a replica to this TiKV.
+speed up the adding replica mechanism in initialization.
+a better way is to support Raft none-voter.

/cc @disksing

Maybe now you can wait some time ( 5 minutes) then test again.

siddontang · 2018-01-04T16:01:34Z

rejoined the node and it took another 1m30s until it revived 1m30s until it revived

This is the same problem I said above before. I have been fixing the etcd ReadIndex problem now.

rystsov · 2018-01-05T02:30:55Z

@siddontang thanks for the explanation! As I understood when TiDB starts it works without replication (Raft group consists of one TiKV) and it takes about 2 minutes to increase the replication factor so there is a risk of loosing already acknowledged data. Is it correct? Is it possible to configure TiDB to reject requests unless there are at least 2 TiKV in a Raft's group?

siddontang · 2018-01-05T13:41:44Z

Hi @rystsov

Yes, we have this problem. The number of replicas may 1 or 2, so if any replica fails, the whole cluster can't work.

But as you can see, this is only for cluster bootstrap, it is not a serious problem now Maybe we can speed up the replication at first. @disksing seems we can connect to PD when the TiKV provides service, then the PD can add the replicas immediately, no need to wait 1 minute.

ngaut added the type/enhancement The issue or PR belongs to an enhancement. label May 6, 2017

ngaut added this to the rc3 milestone May 6, 2017

ngaut modified the milestones: 1.0 GA, rc3 Aug 16, 2017

ngaut modified the milestones: 1.0 GA, 1.1 Oct 28, 2017

ngaut removed this from the 2.0 milestone May 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes #2676

TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes #2676

rystsov commented Feb 19, 2017 •

edited

Loading

c4pt0r commented Feb 19, 2017

siddontang commented Feb 19, 2017

siddontang commented Feb 19, 2017 •

edited

Loading

rystsov commented Feb 19, 2017 •

edited

Loading

rystsov commented Dec 19, 2017

jackysp commented Dec 19, 2017

shenli commented Dec 19, 2017

siddontang commented Dec 19, 2017

rystsov commented Jan 4, 2018

rystsov commented Jan 4, 2018 •

edited

Loading

siddontang commented Jan 4, 2018

siddontang commented Jan 4, 2018 •

edited

Loading

siddontang commented Jan 4, 2018

rystsov commented Jan 5, 2018 •

edited

Loading

siddontang commented Jan 5, 2018

TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes #2676

TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes #2676

Comments

rystsov commented Feb 19, 2017 • edited Loading

c4pt0r commented Feb 19, 2017

siddontang commented Feb 19, 2017

siddontang commented Feb 19, 2017 • edited Loading

rystsov commented Feb 19, 2017 • edited Loading

rystsov commented Dec 19, 2017

jackysp commented Dec 19, 2017

shenli commented Dec 19, 2017

siddontang commented Dec 19, 2017

rystsov commented Jan 4, 2018

rystsov commented Jan 4, 2018 • edited Loading

siddontang commented Jan 4, 2018

siddontang commented Jan 4, 2018 • edited Loading

siddontang commented Jan 4, 2018

rystsov commented Jan 5, 2018 • edited Loading

siddontang commented Jan 5, 2018

rystsov commented Feb 19, 2017 •

edited

Loading

siddontang commented Feb 19, 2017 •

edited

Loading

rystsov commented Feb 19, 2017 •

edited

Loading

rystsov commented Jan 4, 2018 •

edited

Loading

siddontang commented Jan 4, 2018 •

edited

Loading

rystsov commented Jan 5, 2018 •

edited

Loading