Consumer shutdown on Commit timeout #778
Replies: 5 comments 2 replies
-
Hi @mayurjaiswal9 - to be honest - I am not 100% sure - there does seem to be an issue with logic handling errors on commit. But I have observed (and there was issue raised previously that went stale) - that the internal Consumer can get stuck and prevent proper shutdown when Kafka cluster becomes unavailable. Do you have more detailed logs for this? And additional details - what your ParallelConsumer configuration is (i.e. builder / options), KafkaConsumer properties etc... And maybe it should really be raised as an Issue if you dont mind - could you raise it there? |
Beta Was this translation helpful? Give feedback.
-
This is the issue that was previously raised - #597 - may be related. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the update. Adding more details. once we pause the broker(disconnect broker in docker environment) In the first 10 seconds, we see the below error io.confluent.parallelconsumer.internal.InternalRuntimeException: Timeout waiting for commit response PT30S to request ConsumerOffsetCommitter.CommitRequest(id=075626a0-6219-4314-b968-d1b93577f6f4, requestedAtMs=1718025763663) After that, the consumer groups went stale after a longer interval with the error message below. Unknown error - org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {decanting-order-commands-0=OffsetAndMetadata{offset=9611, leaderEpoch=null, metadata='bgAA'}} KafkaConsumer.java
BrokerPollSystem.java
Below logic used to subscribe and Poll.
Question: Is this an expected behavior ? Will this gracefully shutdown the consumer after the timeout ? |
Beta Was this translation helpful? Give feedback.
-
Hmm, i will need to test this again - I have seen issues when Parallel Consumer is not able to shutdown the Kafka Consumer that cant connect to Broker Cluster - it is more to do with how Kafka Consumer behaves than Parallel Consumer itself per se - but I am not sure from top of my head what exactly will happen in this case / when Brokers come back online. I think the underlying Kafka Consumer doesnt actually shut down - but keeps retrying to get back to Kafka Cluster - at the same time Parallel Consumer shuts down on error with the closedOrFailed flag. We could try to model a shutdown followed by restart cycle instead of just shutdown on error if we can somehow reset underlying Kafka Consumer - or Restart cycle can be implemented in application code though a separate Parallel Consumer state monitoring thread - you can check the reason for failure there as well and decide if it should be restarted or not and reset of Kafka Consumer would be up to application then - provide a fresh Kafka Consumer instance to Parallel Consumer. We could add a listener or have a hook that returns future that is set on Parallel Consumer shutdown - to make that monitoring easier - if that is a path to take. From one point - i don't want it to make a default behaviour - that Parallel Consumer restarts on errors by itself - i would rather leave it to application developer to decide if application should shut down as well or be restarted. Without any new shutdown hooks for Parallel Consumer that may look like this - in naive / example implementation:
|
Beta Was this translation helpful? Give feedback.
-
Hey @rkolesnev, I just want to add that my team has also seen this issue happening twice in the past two weeks. Some of our Kafka brokers are being restarted for maintenance this week, which is causing a commit to fail, and then the Parallel Consumer shuts down and doesn't recover, even after multiple hours. I understand your point about the logic for recovering from these errors should be on the developer's side, but I'll also add a +1 in favour of adding some sort of error handling capabilities to the Parallel Consumer (Even if disabled by default). |
Beta Was this translation helpful? Give feedback.
-
Hi All,
We are using parallel-consumer-core 0.5.2.8, currently when we face network issues where the consumer is not able to reach the broker, consumer group is shutting down and we see this below error. Once the network is back up we do not see it connecting back again. Should the offset commit timeout be more than the session timeout ?
I assume this is a misconfiguration, could someone point me to the right values to be set for commit timeout, so that the consumer does not crash ?
[ERROR] 2024-05-29 10:41:14.468 [pc-control] AbstractParallelEoSStreamProcessor - Error from poll control thread, will attempt controlled shutdown, then rethrow. Error: Timeout waiting for commit response PT30S to request ConsumerOffsetCommitter.CommitRequest(id=e26af99f-df1d-4a43-9d9c-a6c5d9321553, requestedAtMs=1716972064466)
commitTimeout, is injected from offsetCommitTimeout in ParallelConsumerOptions (default value),
Beta Was this translation helpful? Give feedback.
All reactions