Transactional Producer instance gets timeout getting commit lock while second instance starts #803

PatrickChauveau · 2024-06-12T16:15:53Z

Hi,

While instance A is consuming/producing records from topic full of records, instance B is started, the following sequence takes place:

time 0:

Instance B requests joining group, at this time several timeout clock start.
Instance A stops consuming and produces message Request joining group due to: group is already rebalancing every 3sec.

time 0 + max.poll.interval.ms:

Instance B produces message JoinGroup failed: The coordinator is not aware of this member
Instance A produces message: consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms
Instance B continues requesting joining group and then produces this sequence of messages:

Successfully joined group with generation Generation{...}
Successfully synced group in generation Generation{...}
Notifying assignor about the new Assignment(partitions=[...])
Assigned 4 total (4 new) partition(s) [...]
Setting offset for partition [...]

Instance B starts consuming

time 0 + request.timeout.ms:

Instance A outputs message Resetting the last seen epoch of partition

time 0 + commitLockAcquisitionTimeout:

Instance A finally produces stack trace with following messages and is definitely stuck

User provided listener io.confluent.parallelconsumer.ParallelEoSStreamProcessor failed on invocation of onPartitionsRevoked for partitions
Caused by: java.util.concurrent.TimeoutException: Timeout getting commit lock (which was set to PT5M). Slow processing or too many records being ack'd? Try increasing the commit lock timeout (commitLockAcquisitionTimeout), or reduce your record processing time.
Error from poll control thread, will attempt controlled shutdown, then rethrow. Error: There is a newer producer with the same transactionalId which fences the current one

The used configuration is:

PC version : 0.5.2.8
number of partitions : 4
max.poll.records: 150
max.poll.interval.ms: 180000
request.timeout.ms: 210000
linger.ms = 65
delivery.timeout.ms: 240000
allowEagerProcessingDuringTransactionCommit=false,
commitLockAcquisitionTimeout=PT5M,
produceLockAcquisitionTimeout=PT1M,
commitInterval=PT0.1S,
ordering=UNORDERED,
commitMode=PERIODIC_TRANSACTIONAL_PRODUCER,
maxConcurrency=64,
invalidOffsetMetadataPolicy=FAIL,
defaultMessageRetryDelay=PT1S,
retryDelayProvider=null,
sendTimeout=PT10S,
offsetCommitTimeout=PT10S,
batchSize=1,
thresholdForTimeSpendInQueueWarning=PT10S,
maxFailureHistory=10,
shutdownTimeout=PT10S,
drainTimeout=PT30S,
messageBufferSize=50,
initialLoadFactor=2,
maximumLoadFactor=100

The text was updated successfully, but these errors were encountered:

rkolesnev · 2024-06-13T15:58:03Z

Hi @PatrickChauveau - thanks for report.

Is the issue consistently reproducible?
Just starting 2nd instance when 1 instance is active is enough to observe it?

Would it be possible for you to create a reproducible example ? Either as a simple app or integration test - as that would help a lot to identify potential bug.

rkolesnev · 2024-06-14T10:45:15Z

Yep - i have found a race condition - it is between commitOffsets on normal work and commitOffsets due to partition revocation.
We already had added additional checks / guards there - but apparently - still have synchronisation issue somewhere leading to deadlock.
Will investigate further to flush it out.

PatrickChauveau · 2024-06-14T12:31:29Z

Glad you found a clue! It does not appear every time, might be 50% occurrences. We set in place health endpoint associated with kubernetes liveness probe to thwart the issue (it needs sometimes 2 or 3 restarts to stabilize). ParallelEosStreamProcessor.getFailureCause() helped a lot for this! Failure message: Error from poll control thread: There is a newer producer with the same transactionalId which fences the current one.

rkolesnev added the verified bug Something isn't working label Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transactional Producer instance gets timeout getting commit lock while second instance starts #803

Transactional Producer instance gets timeout getting commit lock while second instance starts #803

PatrickChauveau commented Jun 12, 2024 •

edited

Loading

rkolesnev commented Jun 13, 2024

rkolesnev commented Jun 14, 2024 •

edited

Loading

PatrickChauveau commented Jun 14, 2024

Transactional Producer instance gets timeout getting commit lock while second instance starts #803

Transactional Producer instance gets timeout getting commit lock while second instance starts #803

Comments

PatrickChauveau commented Jun 12, 2024 • edited Loading

rkolesnev commented Jun 13, 2024

rkolesnev commented Jun 14, 2024 • edited Loading

PatrickChauveau commented Jun 14, 2024

PatrickChauveau commented Jun 12, 2024 •

edited

Loading

rkolesnev commented Jun 14, 2024 •

edited

Loading