Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEFBE-3888 retrieve leader after reconnecting to ZooKeeper #2

Conversation

maxmzkrdf
Copy link

@maxmzkrdf maxmzkrdf commented Oct 9, 2020

https://deepfield.atlassian.net/browse/DEFBE-3888

https://issues.apache.org/jira/browse/FLINK-19557

We noticed an increase in the number of stuck leader elections occuring on our test deployments. After further investigation, we determined that the recent Flink upgrade introduced a bug. When a Flink Job Manager looses connection to ZooKeeper it clears the retrieved leader from memory. On reconnect, the Job Manager should re-retrieve the leader. However, if the leader did not change between disconnect and reconnect, then the Job Manager will never get the leader again. The Flink code assumes that an update will come from the curator NodeCache. However, the curator NodeCache will "deduplicate" updates by not sending an update when the value is the same as before. Therefore, curator will not send an update on these disconnect-reconnects unless the value changes. Creating a new NodeCache is the only way to get a new value. So, we are stopping the old NodeCache and making a new one on RECONNECT. We are not allowed to raise exceptions here, so I have to add a try catch. I looked through curator a bit and determined there isn't a lot to be worried about here. If start was successful at the start, it should be successful on future calls. The biggest concern is actually the close. There is a chance we don't properly cleanup these connections if close fails. I'm not sure what the impact of a not cleaned up conncetion is, but it's certianly concerning.

@maxmzkrdf maxmzkrdf self-assigned this Oct 9, 2020
@@ -192,6 +192,14 @@ protected void handleStateChange(ConnectionState newState) {
break;
case RECONNECTED:
LOG.info("Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.");
try {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution lgtm.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that this class uses a lock when creating the node cache, not sure if it is important or not.

Copy link

@dlencina dlencina Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I am not sure if it is important to add

client.getConnectionStateListenable().addListener(connectionStateListener);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that this class uses a lock when creating the node cache, not sure if it is important or not.

I'll double check. It's my understanding that all of the curator watches are done single threadedly, so I didn't think I would need locks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I am not sure if it is important to add

client.getConnectionStateListenable().addListener(connectionStateListener);

No, we still have our listener going to for the connection. We only need to invalidate our cache, the client is still fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think stop can be called from another thread. So it's probably best to use the lock. I'll add it.

Copy link

@dlencina dlencina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :)

@maxmzkrdf maxmzkrdf merged commit f15338b into release-1.11.2-deepfield Oct 14, 2020
@maxmzkrdf maxmzkrdf deleted the DEFBE-3888-release-1.11.2-deepfield-fix-leader-retrieval branch October 14, 2020 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants