Fix ZAP upgrade deadlock #4692

behlendorf · 2016-05-24T00:13:04Z

The rw_tryupgrade() implementation the SPL behaves slightly differently
than the illumos version. Specifically, the Linux kernel provides no
upgrade function so the lock must be released before attempting the
upgrade. For many use cases this isn't a problem but when upgrading
a ZAP it can result in a deadlock with the l_rwlock.

This minor optimization is disabled under Linux to prevent this issue.

Signed-off-by: Brian Behlendorf [email protected]

The rw_tryupgrade() implementation the SPL behaves slightly differently than the illumos version. Specifically, the Linux kernel provides no upgrade function so the lock must be released before attempting the upgrade. For many use cases this isn't a problem but when upgrading a ZAP it can result in a deadlock with the l_rwlock. This minor optimization is disabled under Linux to prevent this issue. Signed-off-by: Brian Behlendorf <[email protected]>

behlendorf · 2016-05-24T00:17:09Z

@lundman can you please review this. It turns out the rw_tryupgrade() implementation we settled on can cause a deadlock in the ZAP upgrade case. What do you think about this for a Linux and OSX fix? Essentially we just do without this minor optimization. Here are the stack which show the issue.

dbench          D ffff8819ff9d0600     0  8899      1
 [<ffffffff81651699>] schedule+0x29/0x70
 [<ffffffff81653285>] rwsem_down_read_failed+0x105/0x180
 [<ffffffff8130f5d4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffffa0bbfd92>] zap_tryupgradedir+0x82/0xe0 [zfs]
 [<ffffffffa0bc138c>] zap_expand_leaf+0xcc/0x9e0 [zfs]
 [<ffffffffa0bc2122>] fzap_add_cd+0x112/0x210 [zfs]
 [<ffffffffa0bc2288>] fzap_add+0x68/0x80 [zfs]
 [<ffffffffa0bc8b06>] zap_add+0x136/0x1c0 [zfs]
 [<ffffffffa0bc2c78>] zap_add_int+0x68/0x90 [zfs]
 [<ffffffffa0bd37f4>] zfs_unlinked_add+0x54/0x100 [zfs]
 [<ffffffffa0bf3e86>] zfs_remove+0x736/0xa90 [zfs]
 [<ffffffffa0c12656>] zpl_unlink+0x66/0x170 [zfs]
 [<ffffffff811f4341>] vfs_unlink+0xe1/0x150
 [<ffffffff811f88fe>] do_unlinkat+0x26e/0x2b0
 [<ffffffff811f9846>] SyS_unlink+0x16/0x20

dbench          D ffffffff00000000     0  8842      1
Call Trace:
 [<ffffffff81651699>] schedule+0x29/0x70
 [<ffffffff8165305d>] rwsem_down_write_failed+0x12d/0x250
 [<ffffffff8130f603>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffffa0bc09a4>] zap_get_leaf_byblk+0x2e4/0x660 [zfs]
 [<ffffffffa0bc0df3>] zap_deref_leaf+0xd3/0x150 [zfs]
 [<ffffffffa0bc15cc>] zap_expand_leaf+0x30c/0x9e0 [zfs]
 [<ffffffffa0bc2122>] fzap_add_cd+0x112/0x210 [zfs]
 [<ffffffffa0bc2288>] fzap_add+0x68/0x80 [zfs]
 [<ffffffffa0bc8b06>] zap_add+0x136/0x1c0 [zfs]
 [<ffffffffa0bc2c78>] zap_add_int+0x68/0x90 [zfs]
 [<ffffffffa0bd37f4>] zfs_unlinked_add+0x54/0x100 [zfs]
 [<ffffffffa0bf3e86>] zfs_remove+0x736/0xa90 [zfs]
 [<ffffffffa0c12656>] zpl_unlink+0x66/0x170 [zfs]
 [<ffffffff811f4341>] vfs_unlink+0xe1/0x150
 [<ffffffff811f88fe>] do_unlinkat+0x26e/0x2b0

dbench          D 0000000000000001     0  8843      1
Call Trace:
 [<ffffffff81651699>] schedule+0x29/0x70
 [<ffffffff81653285>] rwsem_down_read_failed+0x105/0x180
 [<ffffffff8130f5d4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffffa0bc756c>] zap_lockdir+0x22c/0xb00 [zfs]
 [<ffffffffa0bc8a28>] zap_add+0x58/0x1c0 [zfs]
 [<ffffffffa0bc2c78>] zap_add_int+0x68/0x90 [zfs]
 [<ffffffffa0bd37f4>] zfs_unlinked_add+0x54/0x100 [zfs]
 [<ffffffffa0bf3e86>] zfs_remove+0x736/0xa90 [zfs]
 [<ffffffffa0c12656>] zpl_unlink+0x66/0x170 [zfs]
 [<ffffffff811f4341>] vfs_unlink+0xe1/0x150
 [<ffffffff811f88fe>] do_unlinkat+0x26e/0x2b0
 [<ffffffff811f9846>] SyS_unlink+0x16/0x20

behlendorf · 2016-05-25T21:16:36Z

@dweeezil could you please review this. It resolves some minor fallout from the rw_tryupgrade() implementation which was added to the spl, openzfs/spl@a6ae97c.

The proposed fix is to just remove the optimization here which relies on rw_tryupgrade(), I don't expect it will have any significant performance impact.

tuxoko · 2016-05-25T21:39:54Z

@behlendorf
I think if we are going to use tryupgrade, we should probably implement a proper one. Otherwise, we might get bitten by this again in the future.

behlendorf · 2016-05-25T22:38:28Z

@tuxoko I agree that would be best. But I don't see a way to implement this without grubbing around the the private rwsem internals which will be fragile.

tuxoko · 2016-05-25T22:46:40Z

Fragile it might be. But I don't think it would change too much or too often.

behlendorf · 2016-05-25T23:17:14Z

@tuxoko it's complicated by the fact while all the architectures share a common rw_semaphore structure they each have a optimized implementation which can (and does) use it slightly differently.

dweeezil · 2016-05-26T17:30:59Z

@behlendorf As was discussed when rw_tryupgrade() was modified, there was certainly a chance of lurking problems. I recall looking at the ZAP upgrade code but didn't dig far enough to see the interaction with l_rwlock. Your patch certainly looks fine insofar as working around the immediate problem is concerned.

It does seem, however, that @tuxoko has created an implementation which will work given the current kernel implementation using cmpxchg exactly as you suggested in openzfs/spl@ef6c136. I suppose the main question is just how fragile it is. I don't know, for example, how it works on the ARM these days or whether a change in implementation might be in the works.

behlendorf · 2016-05-27T22:59:24Z

Closing in favor of openzfs/spl#554 which looks like it's going to be a better solution.

lundman · 2016-05-28T03:34:30Z

@behlendorf Hmm we don't have the option of going deeper, as Apple hides all that (unless we implement our own rwlocks I guess). At the moment, we check that readers==1, drop lock, then just tryenter, so it will fail more often than not, but never block.

Is there a test case for the zap-upgrade deadlock with l_rwlock ?

behlendorf · 2016-05-28T18:52:14Z

@lundman it sounds like you're going to have the same issue. If your implementation drops the read lock then it's going to need to reacquire it when the trylock fails and there's where it can deadlock. If you can't change you implementation you'll probably want to adapt this fix.

I was able to reproduce this issue fairly easily by creating a new filesystem and then running something like dbench 128. When the unlinked zap was promoted I could trigger the deadlock.

Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then does rw_enter(RW_READER) if it fails. This violate the assumption that rw_tryupgrade should be atomic and could cause extra contention or even lock inversion. This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we use cmpxchg on rwsem->count to change the value from single reader to single writer. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes openzfs/zfs#4692 Closes #554

Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then does rw_enter(RW_READER) if it fails. This violate the assumption that rw_tryupgrade should be atomic and could cause extra contention or even lock inversion. This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we use cmpxchg on rwsem->count to change the value from single reader to single writer. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes openzfs/zfs#4692 Closes openzfs#554

dweeezil mentioned this pull request May 26, 2016

Implement a proper rw_tryupgrade openzfs/spl#554

Closed

behlendorf closed this May 27, 2016

behlendorf deleted the zap_tryupgradedir branch April 19, 2021 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ZAP upgrade deadlock #4692

Fix ZAP upgrade deadlock #4692

behlendorf commented May 24, 2016

behlendorf commented May 24, 2016

behlendorf commented May 25, 2016

tuxoko commented May 25, 2016

behlendorf commented May 25, 2016

tuxoko commented May 25, 2016

behlendorf commented May 25, 2016

dweeezil commented May 26, 2016

behlendorf commented May 27, 2016

lundman commented May 28, 2016

behlendorf commented May 28, 2016

Fix ZAP upgrade deadlock #4692

Fix ZAP upgrade deadlock #4692

Conversation

behlendorf commented May 24, 2016

behlendorf commented May 24, 2016

behlendorf commented May 25, 2016

tuxoko commented May 25, 2016

behlendorf commented May 25, 2016

tuxoko commented May 25, 2016

behlendorf commented May 25, 2016

dweeezil commented May 26, 2016

behlendorf commented May 27, 2016

lundman commented May 28, 2016

behlendorf commented May 28, 2016