Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

spl kmem cache atomic operations hurt performance on large-scale systems #463

Closed
dweeezil opened this issue Jul 20, 2015 · 11 comments
Closed
Labels
Milestone

Comments

@dweeezil
Copy link
Contributor

The default kmem debugging (--enable-debug-kmem) can severely impact performance on large-scale NUMA systems due to the atomic operations used in the memory accounting. A 32-thread fio test running on a 40-core 80-thread system and performing 100% cached reads with kmem debugging is:

READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s, mint=60001msec, maxt=60001msec

and when SPL/ZFS is built with disable-debug-kmem is:

READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s, mint=60003msec, maxt=60003msec

This was discovered via flame graph profiling.

EDIT: Fixed the order of the stats.

@behlendorf
Copy link
Contributor

@dweeezil wow, that's a huge difference. I wouldn't have thought the atomic's would be that expensive. It sounds like the thing to do here is:

  1. Change the default value to --disable-debug-kmem
  2. Fix the spl-proc code so this doesn't result in /proc/spl/kmem/slab being disabled.

We can manually set --enable-debug-kmem for the buildbot so things like memory leaks still get automatically flagged. Alternately, these could be turned in to per-cpu values.

(BTW I think you pasted the performance data in backwards)

@behlendorf behlendorf added the Bug label Jul 20, 2015
@behlendorf behlendorf added this to the 0.6.5 milestone Jul 20, 2015
@kernelOfTruth
Copy link

/me rubs his eyes

@dweeezil are those stats inverted for each case (debug and non-debug) ?

Otherwise it would be less throughput and thus performance, no ?

Besides that: that's a HUGE difference !

I wonder how this change affects performance with especially lz4 compression

Great find =)

@dweeezil
Copy link
Contributor Author

Oops, yes, I pasted it in backward. BTW, with this change, we're now up to 50-60% of the illumos performance on this test (highly-concurrent small-block cached reads).

I've run a number of instances of the test with all processes pinned to a single socket and, as expected, things are a lot faster. This issue is one of the big causes of the slowdown when the processes are spread across sockets.

behlendorf added a commit to behlendorf/spl that referenced this issue Jul 20, 2015
The default kmem debugging (--enable-debug-kmem) can severely impact
performance on large-scale NUMA systems due to the atomic operations
used in the memory accounting. A 32-thread fio test running on a
40-core 80-thread system and performing 100% cached reads with kmem
debugging is:

Enabled:
READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s,

Disabled:
READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s,

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#463
@behlendorf
Copy link
Contributor

Specifically I'd suggest #464 a a reasonable fix (just compiled, not yet tested). I've shamelessly stolen @dweeezil's comment above for the commit but I think this is the most straight forward way to address this. We don't loss anything from a user perspective by disabling this debugging. It's primarily there for developers and was only enabled by default as a convenience.

@dweeezil dweeezil changed the title kmem accounting hurts performance on large-scale systems spl kmem cache atomic operations hurt performance on large-scale systems Jul 21, 2015
@dweeezil
Copy link
Contributor Author

I've updated the title of this issue to reflect what I believe is the balance of the performance degradation we're seeing relative to the same benchmark run under illumos. Even with SPL kmem accounting debugging disabled, it appears a large amount of time is spent on atomic operations managing skc_ref within the spl kmem cache code. I'm continuing to investigate and will follow up as more information is found.

behlendorf added a commit that referenced this issue Jul 21, 2015
The default kmem debugging (--enable-debug-kmem) can severely impact
performance on large-scale NUMA systems due to the atomic operations
used in the memory accounting. A 32-thread fio test running on a
40-core 80-thread system and performing 100% cached reads with kmem
debugging is:

Enabled:
READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s,

Disabled:
READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s,

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Issues #463
@behlendorf
Copy link
Contributor

@dweeezil that would be very good news. We're already well underway on shifting over to the Linux slabs entirely which shouldn't suffer from this issue. The only thing remaining blocking this is finalizing the ABD patches which are the next major planned changed. This looks like it be be pretty straight forward to address in the SPL layer too. I've love to see some performance results!

@behlendorf
Copy link
Contributor

@dweeezil I'm dying to know. Have you done any testing with a change like that in #465 and did it help performance?

@dweeezil
Copy link
Contributor Author

@behlendorf I've been juggling a lot of things over the past few days and haven't been able to follow up on this as well as I'd like; it seems that whenever I remove a bottleneck another one appears. Here's where things are at the moment, IIRC: I removed the skc_ref operations myself before you posted #465 and the bottleneck moved somewhere else (I forgot where. The benchmark was still spending way too much time in spl_kmem_cache_alloc() and ..._free().

I zeroed in on the local_irq_save() operations as a potential trouble spot and then realized I had been building my kernel with CONFIG_PARAVIRT_SPINLOCKS which you'd think would be sub-optimal for a large-scale bare-metal system (I'm doing my testing with 4.0.6 now and have been using a .config based on a standard Ubuntu instance with a handful of small adjustments and my own patch to allow lock debugging and perf to work properly with 80 threads).

After disabling CONFIG_PARAVIRT, the performance of the benchmark dropped by almost a factor of 10! There was suddenly a ton of time being spent in mutex lock/unlock operations. It was as if a new lock contention problem appeared. I switched to a lock debugging kernel and had just started analyzing lock contention. That's where my research sits at the moment.

I just popped on to the system again and re-ran the benchmark and it looks like I'm getting pretty bad contention on db_mtx from dbuf_find(), dbuf_rele() and dbuf_read with an average wait time of 61us and also on rr_lock which is pretty typical on any benchmark which makes a whole lot of (read & write) system calls (avg. wait time is 92us).

I'm normally running the benchmark with 32 processes. As a cross-check, I ran an instance with 10 processes and pinned them to a single socket (system is 4-socket with 10 cores/20 threads each). The rr_lock contention was dramatically reduced; far fewer contentions and 16us avg. wait, db_mtx is still the biggest contender but alsohash_mutexes[i] suddenly has 101us avg. wait (I've not "spaced out" the inodes of the testing files; they're pretty consecutive) so this may be expected.

Have you got any thought as to why not using CONFIG_PARAVIRT_SPINLOCKS would slow it down so much? Does the theory that it exposed some different and unrelated lock contention make any sense?

I'll continue to investigate this as I have time. In the mean time, both the existing SPL change (default to no kmem accounting) and your propsed #465 appear to be Good Things.

@kernelOfTruth
Copy link

referencing @ryao 's comment that mentions CONFIG_PARAVIRT_SPINLOCKS

openzfs/zfs#3091 (comment)

https://lwn.net/Articles/495597/ Paravirtualized ticket spinlocks (2012)

@behlendorf
Copy link
Contributor

@dweeezil I'm not sure why this is the case. Certainly disabling CONFIG_PARAVIRT_SPINLOCKS will result in a different spinlock implementation being used that could disrupt the performance. But then setting CONFIG_DEBUG_SPINLOCK or enabling the kernel lock profiling would have a similar effect too. Any of these changes could perturb the lock contention at the higher ZFS layers if the time required to take/drop locks changes significantly.

I'm glad your taking the time to look at this so we can continue to chip away at it and make improvements. Speaking of which if you can sign off on the #465 change I'll get it merged. There's no real downside to making that change although I doubt it will help any single socket systems.

behlendorf added a commit that referenced this issue Jul 24, 2015
As described in spl_kmem_cache_destroy() the ->skc_ref count was
added to address the case of a cache reap or grow racing with a
destroy.  They are not strictly needed in the alloc/free paths
because consumers of the cache are responsible for not using it
while it's being destroyed.

Removing this code is desirable because there is some evidence that
contention on this atomic negative impacts performance on large-scale
NUMA systems.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Issue #463
@dweeezil
Copy link
Contributor Author

I've finally gotten back to testing on the big NUMA system with current master code for both spl and zfs and the performance is looking pretty good. AFAICT, the spl-related issues are gone (skc_ref & whatever else). I'm not sure where things stand w.r.t. CONFIG_PARAVIRT and friends; I'm running with CONFIG_PARAVIRT set and CONFIG_PARAVIRT_SPINLOCKS not set at the moment.

In any case, I'm going to close this issue because I don't think there are any more spl-related things at the moment which are keeping us from matching illumos' performance.

I will be opening a new issue in the zfs tracker with a similar title.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants