Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation because a memory leak in async operations #358

Closed
vjrj opened this issue Dec 30, 2019 · 2 comments
Closed

Performance degradation because a memory leak in async operations #358

vjrj opened this issue Dec 30, 2019 · 2 comments

Comments

@vjrj
Copy link
Contributor

vjrj commented Dec 30, 2019

Async updates in Cassandra3PersistenceManager has a memory leak in MoreExecutors call that create new objects unnecessary.

In detail, these MoreExecutors.getExistingExecutorService calls despite of its name that subjects that your a getting some singleton creates new MoreExecutor.Application objects:

  public static ExecutorService getExitingExecutorService(ThreadPoolExecutor executor) {
    return (new MoreExecutors.Application()).getExitingExecutorService(executor);
  }

that is causing the memory leak and the degradation.

The symptoms of this performance degradation is in some long task like process-local-node. After few millions of occurrences, the records/sec start to degrade until the task start to throw exceptions and timeouts trying to update Cassandra. The logs are like this:

biocache-store-1 2019-12-24 15:38:51,039 INFO : [ProcessLocalRecords] - Record/sec:113,  updated:3720000, read:3720001, updateFail:0  Last rowkey: eef5f253-ba82-4ddf-8899-a5f831fac75d  Last 1000 in 88.704
biocache-store-1 2019-12-24 15:40:04,687 INFO : [ProcessLocalRecords] - Record/sec:136,  updated:3730000, read:3730000, updateFail:0  Last rowkey: 0c78644c-9fca-48cb-a7be-8cb37d69a070  Last 1000 in 73.648
biocache-store-1 2019-12-24 15:41:20,234 INFO : [ProcessLocalRecords] - Record/sec:132,  updated:3740000, read:3740001, updateFail:0  Last rowkey: eb235f37-9014-4d0e-b8b9-de1948e8fd4f  Last 1000 in 75.547
biocache-store-1 2019-12-24 15:42:50,302 INFO : [ProcessLocalRecords] - Record/sec:111,  updated:3750000, read:3750001, updateFail:0  Last rowkey: 116bd636-d090-4743-8606-e6b98ee18d10  Last 1000 in 90.068
biocache-store-1 2019-12-24 15:44:24,234 ERROR: [Cassandra3PersistenceManager] - Exception thrown during paging. Retry count 0 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:44:24,236 ERROR: [Cassandra3PersistenceManager] - Backing off for 5 minutes. Retry count 1 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:44:37,392 INFO : [ProcessLocalRecords] - Record/sec:93,  updated:3760000, read:3760001, updateFail:0  Last rowkey: 2dee63c7-0d8b-4e68-86cf-b1769dbec953  Last 1000 in 107.09
biocache-store-1 2019-12-24 15:46:54,044 INFO : [ProcessLocalRecords] - Record/sec:73,  updated:3770000, read:3770001, updateFail:0  Last rowkey: 95d4c87c-1209-4dd3-b5fc-4963cec07c73  Last 1000 in 136.652
biocache-store-1 2019-12-24 15:49:42,961 INFO : [ProcessLocalRecords] - Record/sec:59,  updated:3780000, read:3780001, updateFail:0  Last rowkey: e1fef9f5-93be-47d0-b609-1ed9d3c13745  Last 1000 in 168.916
biocache-store-1 2019-12-24 15:54:29,627 ERROR: [Cassandra3PersistenceManager] - Exception thrown during paging. Retry count 0 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:54:29,628 ERROR: [Cassandra3PersistenceManager] - Backing off for 5 minutes. Retry count 1 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:55:08,262 INFO : [ProcessLocalRecords] - Record/sec:31,  updated:3790000, read:3790001, updateFail:0  Last rowkey: aa1f460c-7af6-4293-86cb-d299fafcb1f0  Last 1000 in 325.302
biocache-store-1 2019-12-24 16:06:54,645 INFO : [ProcessLocalRecords] - Record/sec:14,  updated:3800000, read:3800000, updateFail:0  Last rowkey: c4508b46-6a47-48e2-82f1-c4a1b72f6059  Last 1000 in 706.382
biocache-store-1 2019-12-24 16:14:47,653 ERROR: [ClassificationProcessor] - Exception during classification match for record 094ea831-af1d-435d-b141-daa21badb17f
com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response
        at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:43)
        at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:25)
        at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
        at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
        at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)

This was reported by @shahmanash in this slack thread but we were also suffering in our platform.

Some screenshots of visualvm headdumps of biocache command when this happens:
Captura de pantalla de 2019-12-27 11-43-32

I've tested that just moving this getExsitingExecutorService out of the putAsync function solve the issue, and long tasks ends without this performance degradation (and only one MoreExecutor.Application object is created. I'll send a PR in short. After patching:

Captura de pantalla de 2019-12-30 18-27-27

This only happens when cassandra.async.updates.threads > 1.

Other uses of MoreExecutor should be verified to detect some similar problems.

vjrj added a commit to GBIFes/biocache-store that referenced this issue Dec 30, 2019
@vjrj vjrj mentioned this issue Dec 30, 2019
vjrj added a commit to GBIFes/biocache-store that referenced this issue Jan 6, 2020
@vjrj vjrj mentioned this issue Jan 6, 2020
ansell added a commit that referenced this issue Jan 6, 2020
@vjrj
Copy link
Contributor Author

vjrj commented Mar 23, 2020

@peggynewman
Copy link

Merged, and running in production seemingly without error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants