DEFBE-3888 retrieve leader after reconnecting to ZooKeeper #2

maxmzkrdf · 2020-10-09T21:52:02Z

https://deepfield.atlassian.net/browse/DEFBE-3888

https://issues.apache.org/jira/browse/FLINK-19557

We noticed an increase in the number of stuck leader elections occuring on our test deployments. After further investigation, we determined that the recent Flink upgrade introduced a bug. When a Flink Job Manager looses connection to ZooKeeper it clears the retrieved leader from memory. On reconnect, the Job Manager should re-retrieve the leader. However, if the leader did not change between disconnect and reconnect, then the Job Manager will never get the leader again. The Flink code assumes that an update will come from the curator NodeCache. However, the curator NodeCache will "deduplicate" updates by not sending an update when the value is the same as before. Therefore, curator will not send an update on these disconnect-reconnects unless the value changes. Creating a new NodeCache is the only way to get a new value. So, we are stopping the old NodeCache and making a new one on RECONNECT. We are not allowed to raise exceptions here, so I have to add a try catch. I looked through curator a bit and determined there isn't a lot to be worried about here. If start was successful at the start, it should be successful on future calls. The biggest concern is actually the close. There is a chance we don't properly cleanup these connections if close fails. I'm not sure what the impact of a not cleaned up conncetion is, but it's certianly concerning.

dlencina · 2020-10-12T13:24:44Z

.../src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java

@@ -192,6 +192,14 @@ protected void handleStateChange(ConnectionState newState) {
 break;
 case RECONNECTED:
 LOG.info("Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.");
+ try {


The solution lgtm.

I noticed that this class uses a lock when creating the node cache, not sure if it is important or not.

also, I am not sure if it is important to add

client.getConnectionStateListenable().addListener(connectionStateListener);

I noticed that this class uses a lock when creating the node cache, not sure if it is important or not.

I'll double check. It's my understanding that all of the curator watches are done single threadedly, so I didn't think I would need locks.

also, I am not sure if it is important to add

client.getConnectionStateListenable().addListener(connectionStateListener);

No, we still have our listener going to for the connection. We only need to invalidate our cache, the client is still fine.

I think stop can be called from another thread. So it's probably best to use the lock. I'll add it.

dlencina

lgtm :)

re-retrieve the leader after zookeeper reconnect

0327f58

maxmzkrdf self-assigned this Oct 9, 2020

dlencina approved these changes Oct 12, 2020

View reviewed changes

use a lock and add some exception comments

5d6fd53

dlencina approved these changes Oct 14, 2020

View reviewed changes

victmart approved these changes Oct 14, 2020

View reviewed changes

maxmzkrdf merged commit f15338b into release-1.11.2-deepfield Oct 14, 2020

maxmzkrdf deleted the DEFBE-3888-release-1.11.2-deepfield-fix-leader-retrieval branch October 14, 2020 15:32

snikifor87 mentioned this pull request Apr 19, 2021

DEFBE-4493 release-1.11.3-deepfield cherry-pick 1.11.2.2 commits #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEFBE-3888 retrieve leader after reconnecting to ZooKeeper #2

DEFBE-3888 retrieve leader after reconnecting to ZooKeeper #2

maxmzkrdf commented Oct 9, 2020 •

edited

Loading

dlencina Oct 12, 2020

dlencina Oct 12, 2020

dlencina Oct 12, 2020 •

edited

Loading

maxmzkrdf Oct 12, 2020

maxmzkrdf Oct 12, 2020

maxmzkrdf Oct 14, 2020

dlencina left a comment

DEFBE-3888 retrieve leader after reconnecting to ZooKeeper #2

DEFBE-3888 retrieve leader after reconnecting to ZooKeeper #2

Conversation

maxmzkrdf commented Oct 9, 2020 • edited Loading

dlencina Oct 12, 2020

Choose a reason for hiding this comment

dlencina Oct 12, 2020

Choose a reason for hiding this comment

dlencina Oct 12, 2020 • edited Loading

Choose a reason for hiding this comment

maxmzkrdf Oct 12, 2020

Choose a reason for hiding this comment

maxmzkrdf Oct 12, 2020

Choose a reason for hiding this comment

maxmzkrdf Oct 14, 2020

Choose a reason for hiding this comment

dlencina left a comment

Choose a reason for hiding this comment

maxmzkrdf commented Oct 9, 2020 •

edited

Loading

dlencina Oct 12, 2020 •

edited

Loading