spl kmem cache atomic operations hurt performance on large-scale systems #463

dweeezil · 2015-07-20T18:02:00Z

The default kmem debugging (--enable-debug-kmem) can severely impact performance on large-scale NUMA systems due to the atomic operations used in the memory accounting. A 32-thread fio test running on a 40-core 80-thread system and performing 100% cached reads with kmem debugging is:

READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s, mint=60001msec, maxt=60001msec

and when SPL/ZFS is built with disable-debug-kmem is:

READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s, mint=60003msec, maxt=60003msec

This was discovered via flame graph profiling.

EDIT: Fixed the order of the stats.

The text was updated successfully, but these errors were encountered:

behlendorf · 2015-07-20T18:48:32Z

@dweeezil wow, that's a huge difference. I wouldn't have thought the atomic's would be that expensive. It sounds like the thing to do here is:

Change the default value to --disable-debug-kmem
Fix the spl-proc code so this doesn't result in /proc/spl/kmem/slab being disabled.

We can manually set --enable-debug-kmem for the buildbot so things like memory leaks still get automatically flagged. Alternately, these could be turned in to per-cpu values.

(BTW I think you pasted the performance data in backwards)

kernelOfTruth · 2015-07-20T19:02:58Z

/me rubs his eyes

@dweeezil are those stats inverted for each case (debug and non-debug) ?

Otherwise it would be less throughput and thus performance, no ?

Besides that: that's a HUGE difference !

I wonder how this change affects performance with especially lz4 compression

Great find =)

dweeezil · 2015-07-20T19:09:02Z

Oops, yes, I pasted it in backward. BTW, with this change, we're now up to 50-60% of the illumos performance on this test (highly-concurrent small-block cached reads).

I've run a number of instances of the test with all processes pinned to a single socket and, as expected, things are a lot faster. This issue is one of the big causes of the slowdown when the processes are spread across sockets.

The default kmem debugging (--enable-debug-kmem) can severely impact performance on large-scale NUMA systems due to the atomic operations used in the memory accounting. A 32-thread fio test running on a 40-core 80-thread system and performing 100% cached reads with kmem debugging is: Enabled: READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s, Disabled: READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s, Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#463

behlendorf · 2015-07-20T19:26:29Z

Specifically I'd suggest #464 a a reasonable fix (just compiled, not yet tested). I've shamelessly stolen @dweeezil's comment above for the commit but I think this is the most straight forward way to address this. We don't loss anything from a user perspective by disabling this debugging. It's primarily there for developers and was only enabled by default as a convenience.

dweeezil · 2015-07-21T12:09:15Z

I've updated the title of this issue to reflect what I believe is the balance of the performance degradation we're seeing relative to the same benchmark run under illumos. Even with SPL kmem accounting debugging disabled, it appears a large amount of time is spent on atomic operations managing skc_ref within the spl kmem cache code. I'm continuing to investigate and will follow up as more information is found.

The default kmem debugging (--enable-debug-kmem) can severely impact performance on large-scale NUMA systems due to the atomic operations used in the memory accounting. A 32-thread fio test running on a 40-core 80-thread system and performing 100% cached reads with kmem debugging is: Enabled: READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s, Disabled: READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s, Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issues #463

behlendorf · 2015-07-21T18:55:13Z

@dweeezil that would be very good news. We're already well underway on shifting over to the Linux slabs entirely which shouldn't suffer from this issue. The only thing remaining blocking this is finalizing the ABD patches which are the next major planned changed. This looks like it be be pretty straight forward to address in the SPL layer too. I've love to see some performance results!

behlendorf · 2015-07-23T20:55:48Z

@dweeezil I'm dying to know. Have you done any testing with a change like that in #465 and did it help performance?

dweeezil · 2015-07-24T12:24:29Z

@behlendorf I've been juggling a lot of things over the past few days and haven't been able to follow up on this as well as I'd like; it seems that whenever I remove a bottleneck another one appears. Here's where things are at the moment, IIRC: I removed the skc_ref operations myself before you posted #465 and the bottleneck moved somewhere else (I forgot where. The benchmark was still spending way too much time in spl_kmem_cache_alloc() and ..._free().

I zeroed in on the local_irq_save() operations as a potential trouble spot and then realized I had been building my kernel with CONFIG_PARAVIRT_SPINLOCKS which you'd think would be sub-optimal for a large-scale bare-metal system (I'm doing my testing with 4.0.6 now and have been using a .config based on a standard Ubuntu instance with a handful of small adjustments and my own patch to allow lock debugging and perf to work properly with 80 threads).

After disabling CONFIG_PARAVIRT, the performance of the benchmark dropped by almost a factor of 10! There was suddenly a ton of time being spent in mutex lock/unlock operations. It was as if a new lock contention problem appeared. I switched to a lock debugging kernel and had just started analyzing lock contention. That's where my research sits at the moment.

I just popped on to the system again and re-ran the benchmark and it looks like I'm getting pretty bad contention on db_mtx from dbuf_find(), dbuf_rele() and dbuf_read with an average wait time of 61us and also on rr_lock which is pretty typical on any benchmark which makes a whole lot of (read & write) system calls (avg. wait time is 92us).

I'm normally running the benchmark with 32 processes. As a cross-check, I ran an instance with 10 processes and pinned them to a single socket (system is 4-socket with 10 cores/20 threads each). The rr_lock contention was dramatically reduced; far fewer contentions and 16us avg. wait, db_mtx is still the biggest contender but alsohash_mutexes[i] suddenly has 101us avg. wait (I've not "spaced out" the inodes of the testing files; they're pretty consecutive) so this may be expected.

Have you got any thought as to why not using CONFIG_PARAVIRT_SPINLOCKS would slow it down so much? Does the theory that it exposed some different and unrelated lock contention make any sense?

I'll continue to investigate this as I have time. In the mean time, both the existing SPL change (default to no kmem accounting) and your propsed #465 appear to be Good Things.

kernelOfTruth · 2015-07-24T14:08:34Z

referencing @ryao 's comment that mentions CONFIG_PARAVIRT_SPINLOCKS

openzfs/zfs#3091 (comment)

https://lwn.net/Articles/495597/ Paravirtualized ticket spinlocks (2012)

behlendorf · 2015-07-24T16:42:03Z

@dweeezil I'm not sure why this is the case. Certainly disabling CONFIG_PARAVIRT_SPINLOCKS will result in a different spinlock implementation being used that could disrupt the performance. But then setting CONFIG_DEBUG_SPINLOCK or enabling the kernel lock profiling would have a similar effect too. Any of these changes could perturb the lock contention at the higher ZFS layers if the time required to take/drop locks changes significantly.

I'm glad your taking the time to look at this so we can continue to chip away at it and make improvements. Speaking of which if you can sign off on the #465 change I'll get it merged. There's no real downside to making that change although I doubt it will help any single socket systems.

As described in spl_kmem_cache_destroy() the ->skc_ref count was added to address the case of a cache reap or grow racing with a destroy. They are not strictly needed in the alloc/free paths because consumers of the cache are responsible for not using it while it's being destroyed. Removing this code is desirable because there is some evidence that contention on this atomic negative impacts performance on large-scale NUMA systems. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #463

dweeezil · 2015-08-29T14:41:29Z

I've finally gotten back to testing on the big NUMA system with current master code for both spl and zfs and the performance is looking pretty good. AFAICT, the spl-related issues are gone (skc_ref & whatever else). I'm not sure where things stand w.r.t. CONFIG_PARAVIRT and friends; I'm running with CONFIG_PARAVIRT set and CONFIG_PARAVIRT_SPINLOCKS not set at the moment.

In any case, I'm going to close this issue because I don't think there are any more spl-related things at the moment which are keeping us from matching illumos' performance.

I will be opening a new issue in the zfs tracker with a similar title.

behlendorf added the Bug label Jul 20, 2015

behlendorf added this to the 0.6.5 milestone Jul 20, 2015

behlendorf mentioned this issue Jul 20, 2015

Default to --disable-debug-kmem #464

Closed

dweeezil changed the title ~~kmem accounting hurts performance on large-scale systems~~ spl kmem cache atomic operations hurt performance on large-scale systems Jul 21, 2015

dweeezil closed this as completed Aug 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spl kmem cache atomic operations hurt performance on large-scale systems #463

spl kmem cache atomic operations hurt performance on large-scale systems #463

dweeezil commented Jul 20, 2015

behlendorf commented Jul 20, 2015

kernelOfTruth commented Jul 20, 2015

dweeezil commented Jul 20, 2015

behlendorf commented Jul 20, 2015

dweeezil commented Jul 21, 2015

behlendorf commented Jul 21, 2015

behlendorf commented Jul 23, 2015

dweeezil commented Jul 24, 2015

kernelOfTruth commented Jul 24, 2015

behlendorf commented Jul 24, 2015

dweeezil commented Aug 29, 2015

spl kmem cache atomic operations hurt performance on large-scale systems #463

spl kmem cache atomic operations hurt performance on large-scale systems #463

Comments

dweeezil commented Jul 20, 2015

behlendorf commented Jul 20, 2015

kernelOfTruth commented Jul 20, 2015

dweeezil commented Jul 20, 2015

behlendorf commented Jul 20, 2015

dweeezil commented Jul 21, 2015

behlendorf commented Jul 21, 2015

behlendorf commented Jul 23, 2015

dweeezil commented Jul 24, 2015

kernelOfTruth commented Jul 24, 2015

behlendorf commented Jul 24, 2015

dweeezil commented Aug 29, 2015