Consumer shutdown on Commit timeout #778

mayurjaiswal9 · 2024-05-30T11:59:54Z

mayurjaiswal9
May 30, 2024

Hi All,

We are using parallel-consumer-core 0.5.2.8, currently when we face network issues where the consumer is not able to reach the broker, consumer group is shutting down and we see this below error. Once the network is back up we do not see it connecting back again. Should the offset commit timeout be more than the session timeout ?
I assume this is a misconfiguration, could someone point me to the right values to be set for commit timeout, so that the consumer does not crash ?

[ERROR] 2024-05-29 10:41:14.468 [pc-control] AbstractParallelEoSStreamProcessor - Error from poll control thread, will attempt controlled shutdown, then rethrow. Error: Timeout waiting for commit response PT30S to request ConsumerOffsetCommitter.CommitRequest(id=e26af99f-df1d-4a43-9d9c-a6c5d9321553, requestedAtMs=1716972064466)

    while (waitingOnCommitResponse) {
        if (attempts > ARBITRARY_RETRY_LIMIT)
            throw new InternalRuntimeException("Too many attempts taking commit responses");

        try {
            log.debug("Waiting on a commit response");
            Duration timeout = AbstractParallelEoSStreamProcessor.DEFAULT_TIMEOUT;
            CommitResponse take = commitResponseQueue.poll(commitTimeout.toMillis(), TimeUnit.MILLISECONDS); // blocks, drain until we find our response
            if (take == null) {
                throw InternalRuntimeException.msg("Timeout waiting for commit response {} to request {}", timeout, commitRequest);
            }
            waitingOnCommitResponse = take.getRequest().getId() != commitRequest.getId();
        } catch (InterruptedException e) {
            log.debug("Interrupted waiting for commit response", e);
        }
        attempts++;
    }
}

commitTimeout, is injected from offsetCommitTimeout in ParallelConsumerOptions (default value),

@Builder.Default
private final Duration offsetCommitTimeout = Duration.ofSeconds(10);

rkolesnev · 2024-06-05T10:48:20Z

rkolesnev
Jun 5, 2024
Maintainer

Hi @mayurjaiswal9 - to be honest - I am not 100% sure - there does seem to be an issue with logic handling errors on commit.
For one - in ConsumerOffsetCommitter.java AsyncCommit block
is swallowing exception - I think the idea is to allow for retries / fail on timeout as you are observing.

But I have observed (and there was issue raised previously that went stale) - that the internal Consumer can get stuck and prevent proper shutdown when Kafka cluster becomes unavailable.

Do you have more detailed logs for this? And additional details - what your ParallelConsumer configuration is (i.e. builder / options), KafkaConsumer properties etc...

And maybe it should really be raised as an Issue if you dont mind - could you raise it there?

0 replies

rkolesnev · 2024-06-05T10:57:53Z

rkolesnev
Jun 5, 2024
Maintainer

This is the issue that was previously raised - #597 - may be related.

2 replies

Pigna145 Jun 6, 2024

That seems to be the same issue (working with Mayur).
To reproduce. Use a docker environment and start a PC and then in the middle of handling messages pause the broker.
Eventually you will see this logging:

AbstractParallelEoSStreamProcessor - Error from poll control thread, will attempt controlled shutdown, then rethrow. Error: Timeout waiting for commit response PT30S to request.
And after longer disruptance
BrokerPollSystem - Unknown error

(Would expect the PC to go in some kind of check loop to see if it can connect with the broker again)

After that last error the PC is gone (closeOrFailed true). Now the PC has to be restarted.
In case you are not aware of this situation you can run into an application not consuming anything anymore
Once broker is reachable again it will not join the consumer group immediately and one has to wait until a rebalance is done by the broker.

rkolesnev Jun 6, 2024
Maintainer

So are you suggesting to keep PC up and try to recover somehow instead of shutting down?
As those kind of errors fall a bit under - oh something external went bad and we dont really know what or why - the retry / check loop would have to be quite specific to some condition that we want to observe as it may never recover.
Basically we would need to classify what is a transient issue related to cluster unavailability - and what is not transient.

mayurjaiswal9 · 2024-06-10T14:28:50Z

mayurjaiswal9
Jun 10, 2024
Author

Thanks for the update.

Adding more details. once we pause the broker(disconnect broker in docker environment)

In the first 10 seconds, we see the below error

io.confluent.parallelconsumer.internal.InternalRuntimeException: Timeout waiting for commit response PT30S to request ConsumerOffsetCommitter.CommitRequest(id=075626a0-6219-4314-b968-d1b93577f6f4, requestedAtMs=1718025763663)
at io.confluent.parallelconsumer.internal.InternalRuntimeException.msg(InternalRuntimeException.java:23) ~[parallel-consumer-core-0.5.2.8.jar:?]
at io.confluent.parallelconsumer.internal.ConsumerOffsetCommitter.commitAndWait(ConsumerOffsetCommitter.java:154) ~[parallel-consumer-core-0.5.2.8.jar:?]
at io.confluent.parallelconsumer.internal.ConsumerOffsetCommitter.commit(ConsumerOffsetCommitter.java:74) ~[parallel-consumer-core-0.5.2.8.jar:?]
at io.confluent.parallelconsumer.internal.BrokerPollSystem.retrieveOffsetsAndCommit(BrokerPollSystem.java:346) ~[parallel-consumer-core-0.5.2.8.jar:?]
at io.confluent.parallelconsumer.internal.AbstractParallelEoSStreamProcessor.commitOffsetsThatAreReady(AbstractParallelEoSStreamProcessor.java:1222) ~[parallel-consumer-core-0.5.2.8.jar:?]
at io.confluent.parallelconsumer.internal.AbstractParallelEoSStreamProcessor.controlLoop(AbstractParallelEoSStreamProcessor.java:777) ~[parallel-consumer-core-0.5.2.8.jar:?]
at io.confluent.parallelconsumer.internal.AbstractParallelEoSStreamProcessor.lambda$supervisorLoop$9(AbstractParallelEoSStreamProcessor.java:732) ~[parallel-consumer-core-0.5.2.8.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]

After that, the consumer groups went stale after a longer interval with the error message below.

Unknown error - org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {decanting-order-commands-0=OffsetAndMetadata{offset=9611, leaderEpoch=null, metadata='bgAA'}}

KafkaConsumer.java

@Override
public void commitSync(final Map<TopicPartition, OffsetAndMetadata> offsets) {
    commitSync(offsets, Duration.ofMillis(defaultApiTimeoutMs));  -> 60000ms
}

BrokerPollSystem.java

    try {
        while (runState != CLOSED) {
            handlePoll();

            maybeDoCommit();

            switch (runState) {
                case DRAINING -> {
                    doPause();
                }
                case CLOSING -> {
                    doClose();
                }
            }
        }
        log.debug("Broker poller thread finished normally, returning OK (true) to future...");
        return true;
    } catch (Exception e) {
        log.error("Unknown error", e); --> Error thrown here.
        throw e;
    }

Below logic used to subscribe and Poll.

private void subscribeAndPoll(String perPrefixedTopic) {
    log.debug("Subscribing to topic pattern: '{}'.", perPrefixedTopic);
    try {
        parallelStreamProcessor.subscribe(Pattern.compile(perPrefixedTopic));
    } catch (IllegalStateException e) {
        log.error(
                "Kafka subscriptions have been updated bypassing Common Event, this is not supported.",
                e);
        throw e;
    } finally {
        log.debug("Start polling for topic pattern: '{}'.", perPrefixedTopic);
        parallelStreamProcessor.poll(createPollMethod());
    }
}

Question:

Is this an expected behavior ? Will this gracefully shutdown the consumer after the timeout ?
We want to eventually check the network and try to bring up the consumer. any suggestions here if this is ideal ?

0 replies

rkolesnev · 2024-06-11T12:07:00Z

rkolesnev
Jun 11, 2024
Maintainer

Hmm, i will need to test this again - I have seen issues when Parallel Consumer is not able to shutdown the Kafka Consumer that cant connect to Broker Cluster - it is more to do with how Kafka Consumer behaves than Parallel Consumer itself per se - but I am not sure from top of my head what exactly will happen in this case / when Brokers come back online.

I think the underlying Kafka Consumer doesnt actually shut down - but keeps retrying to get back to Kafka Cluster - at the same time Parallel Consumer shuts down on error with the closedOrFailed flag.

We could try to model a shutdown followed by restart cycle instead of just shutdown on error if we can somehow reset underlying Kafka Consumer - or Restart cycle can be implemented in application code though a separate Parallel Consumer state monitoring thread - you can check the reason for failure there as well and decide if it should be restarted or not and reset of Kafka Consumer would be up to application then - provide a fresh Kafka Consumer instance to Parallel Consumer.

We could add a listener or have a hook that returns future that is set on Parallel Consumer shutdown - to make that monitoring easier - if that is a path to take.

From one point - i don't want it to make a default behaviour - that Parallel Consumer restarts on errors by itself - i would rather leave it to application developer to decide if application should shut down as well or be restarted.

Without any new shutdown hooks for Parallel Consumer that may look like this - in naive / example implementation:

    ParallelStreamProcessor<String, String> parallelConsumer;
    boolean shuttingDown = false;

    void run() {
        ExecutorService monitorThread = Executors.newSingleThreadExecutor();
        monitorThread.submit(() -> {
            while (!shuttingDown) {
                if (parallelConsumer == null && parallelConsumer.isClosedOrFailed()) {
                    //shutdown / reset state for anything that needs cleanup on failure.
                    if(parallelConsumer!=null) {
                        cleanup(parallelConsumer);
                    }
                    
                    //setup Kafka Client and ParallelConsumer instance
                    this.parallelConsumer = prepareParallelConsumer(); 
                    
                    //subscribe and kick off polling.
                    subscribeAndPoll(parallelConsumer);
                } else {
                    //to check periodically. 
                    Thread.sleep(1000);
                }
            }
        });
    }

0 replies

CMExile · 2024-07-24T15:59:37Z

CMExile
Jul 24, 2024

Hey @rkolesnev, I just want to add that my team has also seen this issue happening twice in the past two weeks.

Some of our Kafka brokers are being restarted for maintenance this week, which is causing a commit to fail, and then the Parallel Consumer shuts down and doesn't recover, even after multiple hours.

I understand your point about the logic for recovering from these errors should be on the developer's side, but I'll also add a +1 in favour of adding some sort of error handling capabilities to the Parallel Consumer (Even if disabled by default).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer shutdown on Commit timeout #778

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Consumer shutdown on Commit timeout #778

mayurjaiswal9 May 30, 2024

Replies: 5 comments · 2 replies

rkolesnev Jun 5, 2024 Maintainer

rkolesnev Jun 5, 2024 Maintainer

Pigna145 Jun 6, 2024

rkolesnev Jun 6, 2024 Maintainer

mayurjaiswal9 Jun 10, 2024 Author

rkolesnev Jun 11, 2024 Maintainer

CMExile Jul 24, 2024

mayurjaiswal9
May 30, 2024

Replies: 5 comments 2 replies

rkolesnev
Jun 5, 2024
Maintainer

rkolesnev
Jun 5, 2024
Maintainer

rkolesnev Jun 6, 2024
Maintainer

mayurjaiswal9
Jun 10, 2024
Author

rkolesnev
Jun 11, 2024
Maintainer

CMExile
Jul 24, 2024