Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][meta] Broker enter the loop to recreate zkSession failed, cause broker stuck #23015

Open
2 of 3 tasks
TakaHiR07 opened this issue Jul 9, 2024 · 0 comments
Open
2 of 3 tasks
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@TakaHiR07
Copy link
Contributor

TakaHiR07 commented Jul 9, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

zookeeper: 3.6.3
broker: 3.0.4
bookkeeper: 4.16.4

Minimal reproduce step

continue to do perf-produce to a 5-nodes cluster, with several broker restart operation.

What did you expect to see?

broker should reCreate zkSession success and not stuck.

What did you see instead?

Firstly, everything is ok. After running for some time, notice one of the five broker have no throughput, but broker process is alive.

Then it is found in log, the broker always retry recreate zkSession frequently but failed.

15:55:28.066 [ZKC-connect-executor-0-SendThread(ip3:2181)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server ip3/ip3:2181, session id = 0x3016bf7f9583a32, negotiated timeout = 30000
15:55:28.066 [ZKC-connect-executor-0-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
15:55:28.066 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.impl.ZKSessionWatcher - Got ZK session watch event: WatchedEvent state:SyncConnected type:None path:null zxid: -1
15:55:28.066 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.impl.ZKSessionWatcher - ZooKeeper client reconnection with server quorum. Current status: SessionLost
15:55:28.066 [metadata-store-38-1] INFO  org.apache.pulsar.broker.PulsarService - Received metadata service session event: Reconnected
15:55:28.066 [ZKC-connect-executor-0-EventThread] ERROR org.apache.pulsar.metadata.impl.ZKMetadataStore - Failed to recreate persistent watch on ZooKeeper: CONNECTIONLOSS
15:55:28.066 [ZKC-connect-executor-0-EventThread] INFO  org.apache.pulsar.metadata.impl.PulsarZooKeeperClient - ZooKeeper session 3016bf7f9583a32 is expired from ip1:2181,ip2:2181,ip3:2181.
15:55:28.066 [metadata-store-38-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received MetadataStore session event: Reconnected
15:55:28.066 [metadata-store-coordination-service-47-1] INFO  org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.066 [metadata-store-coordination-service-64-1] INFO  org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.066 [metadata-store-coordination-service-40-1] INFO  org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.172 [ZKC-connect-executor-0] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x3016bf7f9583a32 closed
15:55:28.406 [ZKC-connect-executor-0] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=ip1:2181,ip2:2181,ip3:2181 sessionTimeout=30000 watcher=org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase@1a78dacd
15:55:28.173 [ZKC-connect-executor-0-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3016bf7f9583a32


15:55:28.186 [ZKC-connect-executor-0-SendThread(ip2:2181)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server ip2/ip2:2181, session id = 0x20d399292970a9d, negotiated timeout = 30000
15:55:28.186 [ZKC-connect-executor-0-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
15:55:28.186 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.impl.ZKSessionWatcher - Got ZK session watch event: WatchedEvent state:SyncConnected type:None path:null zxid: -1
15:55:28.186 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.impl.ZKSessionWatcher - ZooKeeper client reconnection with server quorum. Current status: SessionLost
15:55:28.186 [metadata-store-38-1] INFO  org.apache.pulsar.broker.PulsarService - Received metadata service session event: Reconnected
15:55:28.186 [ZKC-connect-executor-0-EventThread] ERROR org.apache.pulsar.metadata.impl.ZKMetadataStore - Failed to recreate persistent watch on ZooKeeper: CONNECTIONLOSS
15:55:28.186 [ZKC-connect-executor-0-EventThread] INFO  org.apache.pulsar.metadata.impl.PulsarZooKeeperClient - ZooKeeper session 20d399292970a9d is expired from ip1:2181,ip2:2181,ip3:2181.
15:55:28.186 [metadata-store-38-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received MetadataStore session event: Reconnected
15:55:28.186 [metadata-store-coordination-service-47-1] INFO  org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.186 [metadata-store-coordination-service-40-1] INFO  org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.186 [metadata-store-coordination-service-64-1] INFO  org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.292 [ZKC-connect-executor-0] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x20d399292970a9d closed
15:55:28.292 [ZKC-connect-executor-0] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=ip1:2181,ip2:2181,ip3:2181 sessionTimeout=30000 watcher=org.apache.bookkeep
er.zookeeper.ZooKeeperWatcherBase@1a78dacd
15:55:28.294 [ZKC-connect-executor-0-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x20d399292970a9d



15:55:28.300 [ZKC-connect-executor-0-SendThread(ip3:2181)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server ip3/ip3:2181, session id = 0x3016bf7f9583a33, negotiated timeout = 30000
15:55:28.300 [ZKC-connect-executor-0-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
15:55:28.300 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.impl.ZKSessionWatcher - Got ZK session watch event: WatchedEvent state:SyncConnected type:None path:null zxid: -1
15:55:28.300 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.impl.ZKSessionWatcher - ZooKeeper client reconnection with server quorum. Current status: SessionLost
15:55:28.300 [metadata-store-38-1] INFO  org.apache.pulsar.broker.PulsarService - Received metadata service session event: Reconnected
15:55:28.300 [ZKC-connect-executor-0-EventThread] ERROR org.apache.pulsar.metadata.impl.ZKMetadataStore - Failed to recreate persistent watch on ZooKeeper: CONNECTIONLOSS
15:55:28.300 [ZKC-connect-executor-0-EventThread] INFO  org.apache.pulsar.metadata.impl.PulsarZooKeeperClient - ZooKeeper session 3016bf7f9583a33 is expired from ip1:2181,ip2:2181,ip3:2181.
15:55:28.300 [metadata-store-38-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received MetadataStore session event: Reconnected
15:55:28.300 [metadata-store-coordination-service-47-1] INFO  org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.300 [metadata-store-coordination-service-40-1] INFO  org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.300 [metadata-store-coordination-service-64-1] INFO  org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.406 [ZKC-connect-executor-0] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x3016bf7f9583a33 closed
15:55:28.406 [ZKC-connect-executor-0] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=ip1:2181,ip2:2181,ip3:2181 sessionTimeout=30000 watcher=org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase@1a78dacd
15:55:28.408 [ZKC-connect-executor-0-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3016bf7f9583a33

There is some questions.

  1. 3-node Zk server is all runnable, it seems everything of zk server is OK. Actually the other 4 broker is OK, only 1 broker enter the loop to recreate ZkSession. And all Brokers located machine's load are low. I guess restart broker can fix this issue, but I do not do that, for the reason to preserve the problem site. But actually broker should recover by itself.
  2. Which case would result in this error log "Failed to recreate persistent watch on ZooKeeper: CONNECTIONLOSS" ?
  3. Currently I prefer that this stuck is because of broker's way to reCreate zk session, but not zk itself.
  4. I don't know whether should I update the zk version with pulsar version 3.0.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@TakaHiR07 TakaHiR07 added the type/bug The PR fixed a bug or issue reported a bug label Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

1 participant