Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache performance #2063

Closed
ben-manes opened this issue May 29, 2015 · 8 comments
Closed

Cache performance #2063

ben-manes opened this issue May 29, 2015 · 8 comments
Assignees
Labels

Comments

@ben-manes
Copy link
Contributor

Guava's cache performance has a lot of potential for improvement, at the cost of a higher initial memory footprint. The benchmark, with the raw numbers below, shows Guava's performance on a 16-core Xeon E5-2698B v3 @ 2.00GHz (hyperthreading disabled) running Ubuntu 15.04 with JDK 1.8.0_45.

Of note is that Guava's default concurrency level (4) results in similar performance to a synchronized LinkedHashMap. This is due to the cache's overhead, in particular the GC thrashing from ConcurrentLinkedQueue and a hot read counter. It should also be noted that at the time of development synchronization was slow in JDK5, but significantly improved in JDK6_22 and beyond. This and other improvements help LinkedHashMap's performance compared to what was observed when Guava's cache was in development.

The easiest way to double Guava's performance would be to replace the recencyQueue and readCount with a ring buffer. This would replace an unbounded linked queue with a 16-element array. When the buffer is full then a clean-up is triggered and subsequent additions are skipped (no CAS) until slots become available. This could be adapted from this implementation.

Given the increase in server memory, it may also be time to reconsider the default concurrency level. Early design concerns included the memory overhead, as most caches were not highly contended and the value could be adjusted as needed. As the number of cores and system memory has increased, it may be worth revisiting the default setting.

Read (100%)

Unbounded ops/s (8 threads) ops/s (16 threads)
ConcurrentHashMap (v8) 560,367,163 1,171,389,095
ConcurrentHashMap (v7) 301,331,240 542,304,172
Bounded
Caffeine 181,703,298 365,508,966
ConcurrentLinkedHashMap 154,771,582 313,892,223
LinkedHashMap_Lru 9,209,065 13,598,576
Guava (default) 12,434,655 10,647,238
Guava (64) 24,533,922 43,101,468
Ehcache2_Lru 11,252,172 20,750,543
Ehcache3_Lru 11,415,248 17,611,169
Infinispan_Old_Lru 29,073,439 49,719,833
Infinispan_New_Lru 4,888,027 4,749,506

Read (75%) / Write (25%)

Unbounded ops/s (8 threads) ops/s (16 threads)
ConcurrentHashMap (v8) 441,965,711 790,602,730
ConcurrentHashMap (v7) 196,215,481 346,479,582
Bounded
Caffeine 112,622,075 235,178,775
ConcurrentLinkedHashMap 63,968,369 122,342,605
LinkedHashMap_Lru 8,668,785 12,779,625
Guava (default) 11,782,063 11,886,673
Guava (64) 22,782,431 37,332,090
Ehcache2_Lru 9,472,810 8,471,016
Ehcache3_Lru 10,958,697 17,302,523
Infinispan_Old_Lru 22,663,359 37,270,102
Infinispan_New_Lru 4,753,313 4,885,061

Write (100%)

Unbounded ops/s (8 threads) ops/s (16 threads)
ConcurrentHashMap (v8) 60,477,550 50,591,346
ConcurrentHashMap (v7) 46,204,091 36,659,485
Bounded
Caffeine 55,281,751 47,482,019
ConcurrentLinkedHashMap 23,819,597 39,797,969
LinkedHashMap_Lru 10,179,891 10,859,549
Guava (default) 4,764,056 5,446,282
Guava (64) 8,128,024 7,483,986
Ehcache2_Lru 4,205,936 4,697,745
Ehcache3_Lru 10,051,020 13,939,317
Infinispan_Old_Lru 7,538,859 7,332,973
Infinispan_New_Lru 4,797,502 5,086,305
@cruftex
Copy link

cruftex commented May 29, 2015

Is the "op" a read, update or insert? The latest stuff I did in cache2k only have a bottleneck for the insert operation, which implies structural changes and therefor needs locking. All other operations run fully concurrently. Can you maybe rerun your benchmarks with a cache2k setup I give you?

@ben-manes
Copy link
Contributor Author

You can send a pull request to add a new cache type to the GetPutBenchmark. I looked at cache2k a while back and am not confident in it being threadsafe, so it may perform well but corrupt itself.

@cruftex
Copy link

cruftex commented May 30, 2015

I looked at cache2k a while back and am not confident in it being threadsafe, so it may perform well but corrupt itself.

Yes, quite normal reaction. I regularly have these concerns, when I look at a code area that I had not touched for some months. I walk through the design and the JMM again and finally find good sleep. So far we had no corruption in production, even for releases marked as experimental.

@ben-manes
Copy link
Contributor Author

I know adding to an unfamiliar code base is confusing, so I made an addition that I'll send you a pull request to review. Stepping through I see where some of my initial concerns regarding races are addressed for the LruCache.

On my laptop I get the following results, which matches the Desktop-class benchmarks section.

Benchmark ops/s (8 threads)
Read (100%) 8,864,382
Read (75%) / Write (25%) 8,064,139
Write (100%) 2,489,101

The performance matches what I'd expect from top-level lock, which appears to be how the LRU is implemented. I haven't looked at why writes are slower, since they should take the same amount of time. This may be due to synchronization or degradation in the custom hash table.

@ben-manes
Copy link
Contributor Author

A quick and dirty hack to Guava resulted in up to 25x read performance gain at the default concurrency level (4).

Benchmark ops/s (8 threads) ops/s (16 threads)
Read (100%) 121,808,437 260,593,210
Read (75%) / Write (25%) 34,770,671 48,538,555
Write (100%) 5,064,388 6,045,980

We can increase the concurrency level (64) for better write throughput, but unsurprisingly this decreases the read throughput. This is because the cpu caching effects work against us. ConcurrentHashMap reads are faster with fewer segments (576M ops/s) and Guava's recencyQueue is associated to the segment instead of the processor. The queue fills up slower, making it more often contended and less often in the L1/L2 cache. We still see a substantial gain, though.

Benchmark ops/s (8 threads) ops/s (16 threads)
Read (100%) 40,935,985 84,109,039
Read (75%) / Write (25%) 40,854,227 64,462,500
Write (100%) 6,757,178 6,570,233

Because we are no longer thrashing on readCounter, the compute performance increases substantially as well. Here we use 32 threads with Guava at the default concurrency level.

Computer sameKey ops/s spread ops/s
ConcurrentHashMap (v8) 30,321,032 64,427,143
Guava (current) 23,149,556 71,423,966
Guava (patched) 329,220,794 166,448,754
Caffeine 1,558,302,420 533,181,707

I have been advocating this improvement since we first began, originally assumed that we'd get to it in a performance iteration. After leaving I didn't have a good threaded benchmark to demonstrate the improvements, so my hand wavy assertions were ignored. There are still significant gains when moving to the Java 8 rewrite, but these changes are platform compatible and non-invasive.

I don't know why I even bother anymore...

@lowasser
Copy link
Contributor

lowasser commented Jun 4, 2015

I'm going to try to find time to take a crack at this, though I'm not 100% confident when I'll be able to.

@ben-manes
Copy link
Contributor Author

Updated the patch so that unit tests compile & pass.

@kluever kluever added package=cache type=performance Related to performance labels Nov 28, 2017
lucamilanesio pushed a commit to GerritCodeReview/gerrit that referenced this issue Feb 26, 2018
This deadlock is not the typical deadlock where 2 threads locked a
resource and each one is waiting to lock a second resource already
locked by the other thread. The thread owning the account cache lock is
parked, which tell us that the locked was not released. I could not
determine the exact sequence of events leading to this deadlock making
it really hard to report/fix the problem.

While investigating, I realized that there quite a few reported issues
in Guava that could be major for Gerrit:

  Our deadlock happening in account cache
  https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

  Other deadlock
  google/guava#2976
  google/guava#2863

  Performance
  google/guava#2063

  Race condition
  google/guava#2971

Because I could not reproduce the deadlock in a dev environment or in
a unit test making it almost impossible to fix, I considered other
options such as replacing Guava by something else.

The maintainer of Caffeine[1] cache claims that Caffeine is a high
performance[2], near optimal caching library designed/implemented base
on the experience of designing Guava's cache and ConcurrentLinkedHashMap.
I also did some benchmarks about spawning a lot of threads reading/writing
values from the caches. I ran those benchmarks on both Guava and Caffeine
and Guava was always taking at least double the time than Caffeine to
complete all operations.

Migrating to Caffeine is almost a drop-in replacement. Caffeine
interface are very similar to Guava cache and there is an adapter to
migrate to Caffeine and keep using Guava's interfaces. After migrating
to Caffeine, we saw that deadlock occurrence was reduced from once a day
to once every 2 weeks in our production server.

The maintainer of Caffeine, Ben Manes pointed us to the possible
cause[3] of this issue, a bug[4] in the kernel and its fix[5]. Our
kernel version is supposed to have the fix but we will try another OS
and kernel version.

Replacing Guava caches by Caffeine brings 2 things, it reduces the
chances of having the deadlock most likely caused by a kernel bug and
improve the cache performance.

[1]https://github.com/ben-manes/caffeine
[2]https://github.com/ben-manes/caffeine/wiki/Benchmarks
[3]https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64
[4]torvalds/linux@b0c29f7
[5]torvalds/linux@76835b0

Bug: Issue 7645
Change-Id: I8d2b17a94d0e9daf9fa9cdda563316c5768b29ae
@netdpb netdpb added the P3 label Aug 13, 2019
@ben-manes
Copy link
Contributor Author

Closing since it seems unlikely to be resolved. This would be simple and be a large speed up, but requires an internal champion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants