DEFBE-3888 retrieve leader after reconnecting to ZooKeeper #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
https://deepfield.atlassian.net/browse/DEFBE-3888
https://issues.apache.org/jira/browse/FLINK-19557
We noticed an increase in the number of stuck leader elections occuring on our test deployments. After further investigation, we determined that the recent Flink upgrade introduced a bug. When a Flink Job Manager looses connection to ZooKeeper it clears the retrieved leader from memory. On reconnect, the Job Manager should re-retrieve the leader. However, if the leader did not change between disconnect and reconnect, then the Job Manager will never get the leader again. The Flink code assumes that an update will come from the curator NodeCache. However, the curator NodeCache will "deduplicate" updates by not sending an update when the value is the same as before. Therefore, curator will not send an update on these disconnect-reconnects unless the value changes. Creating a new NodeCache is the only way to get a new value. So, we are stopping the old NodeCache and making a new one on RECONNECT. We are not allowed to raise exceptions here, so I have to add a try catch. I looked through curator a bit and determined there isn't a lot to be worried about here. If start was successful at the start, it should be successful on future calls. The biggest concern is actually the close. There is a chance we don't properly cleanup these connections if close fails. I'm not sure what the impact of a not cleaned up conncetion is, but it's certianly concerning.