Performance degradation because a memory leak in async operations #358

vjrj · 2019-12-30T08:47:35Z

Async updates in Cassandra3PersistenceManager has a memory leak in MoreExecutors call that create new objects unnecessary.

In detail, these MoreExecutors.getExistingExecutorService calls despite of its name that subjects that your a getting some singleton creates new MoreExecutor.Application objects:

  public static ExecutorService getExitingExecutorService(ThreadPoolExecutor executor) {
    return (new MoreExecutors.Application()).getExitingExecutorService(executor);
  }

that is causing the memory leak and the degradation.

The symptoms of this performance degradation is in some long task like process-local-node. After few millions of occurrences, the records/sec start to degrade until the task start to throw exceptions and timeouts trying to update Cassandra. The logs are like this:

biocache-store-1 2019-12-24 15:38:51,039 INFO : [ProcessLocalRecords] - Record/sec:113,  updated:3720000, read:3720001, updateFail:0  Last rowkey: eef5f253-ba82-4ddf-8899-a5f831fac75d  Last 1000 in 88.704
biocache-store-1 2019-12-24 15:40:04,687 INFO : [ProcessLocalRecords] - Record/sec:136,  updated:3730000, read:3730000, updateFail:0  Last rowkey: 0c78644c-9fca-48cb-a7be-8cb37d69a070  Last 1000 in 73.648
biocache-store-1 2019-12-24 15:41:20,234 INFO : [ProcessLocalRecords] - Record/sec:132,  updated:3740000, read:3740001, updateFail:0  Last rowkey: eb235f37-9014-4d0e-b8b9-de1948e8fd4f  Last 1000 in 75.547
biocache-store-1 2019-12-24 15:42:50,302 INFO : [ProcessLocalRecords] - Record/sec:111,  updated:3750000, read:3750001, updateFail:0  Last rowkey: 116bd636-d090-4743-8606-e6b98ee18d10  Last 1000 in 90.068
biocache-store-1 2019-12-24 15:44:24,234 ERROR: [Cassandra3PersistenceManager] - Exception thrown during paging. Retry count 0 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:44:24,236 ERROR: [Cassandra3PersistenceManager] - Backing off for 5 minutes. Retry count 1 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:44:37,392 INFO : [ProcessLocalRecords] - Record/sec:93,  updated:3760000, read:3760001, updateFail:0  Last rowkey: 2dee63c7-0d8b-4e68-86cf-b1769dbec953  Last 1000 in 107.09
biocache-store-1 2019-12-24 15:46:54,044 INFO : [ProcessLocalRecords] - Record/sec:73,  updated:3770000, read:3770001, updateFail:0  Last rowkey: 95d4c87c-1209-4dd3-b5fc-4963cec07c73  Last 1000 in 136.652
biocache-store-1 2019-12-24 15:49:42,961 INFO : [ProcessLocalRecords] - Record/sec:59,  updated:3780000, read:3780001, updateFail:0  Last rowkey: e1fef9f5-93be-47d0-b609-1ed9d3c13745  Last 1000 in 168.916
biocache-store-1 2019-12-24 15:54:29,627 ERROR: [Cassandra3PersistenceManager] - Exception thrown during paging. Retry count 0 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:54:29,628 ERROR: [Cassandra3PersistenceManager] - Backing off for 5 minutes. Retry count 1 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:55:08,262 INFO : [ProcessLocalRecords] - Record/sec:31,  updated:3790000, read:3790001, updateFail:0  Last rowkey: aa1f460c-7af6-4293-86cb-d299fafcb1f0  Last 1000 in 325.302
biocache-store-1 2019-12-24 16:06:54,645 INFO : [ProcessLocalRecords] - Record/sec:14,  updated:3800000, read:3800000, updateFail:0  Last rowkey: c4508b46-6a47-48e2-82f1-c4a1b72f6059  Last 1000 in 706.382
biocache-store-1 2019-12-24 16:14:47,653 ERROR: [ClassificationProcessor] - Exception during classification match for record 094ea831-af1d-435d-b141-daa21badb17f
com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response
        at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:43)
        at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:25)
        at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
        at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
        at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)

This was reported by @shahmanash in this slack thread but we were also suffering in our platform.

Some screenshots of visualvm headdumps of biocache command when this happens:

I've tested that just moving this getExsitingExecutorService out of the putAsync function solve the issue, and long tasks ends without this performance degradation (and only one MoreExecutor.Application object is created. I'll send a PR in short. After patching:

This only happens when cassandra.async.updates.threads > 1.

Other uses of MoreExecutor should be verified to detect some similar problems.

The text was updated successfully, but these errors were encountered:

Fix for #358

vjrj · 2020-03-23T20:36:51Z

Related https://atlaslivingaustralia.slack.com/archives/CCTFGEU1G/p1584961155087800

peggynewman · 2020-03-25T03:32:26Z

Merged, and running in production seemingly without error.

vjrj added a commit to GBIFes/biocache-store that referenced this issue Dec 30, 2019

Fix for AtlasOfLivingAustralia#358

1fcfebc

vjrj mentioned this issue Dec 30, 2019

Fix for #358 #359

Closed

vjrj added a commit to GBIFes/biocache-store that referenced this issue Jan 6, 2020

Fix for AtlasOfLivingAustralia#358

50b5763

vjrj mentioned this issue Jan 6, 2020

Fix for #358 #360

Merged

ansell added a commit that referenced this issue Jan 6, 2020

Merge pull request #360 from GBIFes/fix-358-dev

4eeea5e

Fix for #358

ansell added bug IAD labels Mar 9, 2020

peggynewman closed this as completed Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation because a memory leak in async operations #358

Performance degradation because a memory leak in async operations #358

vjrj commented Dec 30, 2019 •

edited

Loading

vjrj commented Mar 23, 2020

peggynewman commented Mar 25, 2020

Performance degradation because a memory leak in async operations #358

Performance degradation because a memory leak in async operations #358

Comments

vjrj commented Dec 30, 2019 • edited Loading

vjrj commented Mar 23, 2020

peggynewman commented Mar 25, 2020

vjrj commented Dec 30, 2019 •

edited

Loading