Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ZAP upgrade deadlock #4692

Closed
wants to merge 1 commit into from

Conversation

behlendorf
Copy link
Contributor

The rw_tryupgrade() implementation the SPL behaves slightly differently
than the illumos version. Specifically, the Linux kernel provides no
upgrade function so the lock must be released before attempting the
upgrade. For many use cases this isn't a problem but when upgrading
a ZAP it can result in a deadlock with the l_rwlock.

This minor optimization is disabled under Linux to prevent this issue.

Signed-off-by: Brian Behlendorf [email protected]

The rw_tryupgrade() implementation the SPL behaves slightly differently
than the illumos version.  Specifically, the Linux kernel provides no
upgrade function so the lock must be released before attempting the
upgrade.  For many use cases this isn't a problem but when upgrading
a ZAP it can result in a deadlock with the l_rwlock.

This minor optimization is disabled under Linux to prevent this issue.

Signed-off-by: Brian Behlendorf <[email protected]>
@behlendorf
Copy link
Contributor Author

@lundman can you please review this. It turns out the rw_tryupgrade() implementation we settled on can cause a deadlock in the ZAP upgrade case. What do you think about this for a Linux and OSX fix? Essentially we just do without this minor optimization. Here are the stack which show the issue.

dbench          D ffff8819ff9d0600     0  8899      1
 [<ffffffff81651699>] schedule+0x29/0x70
 [<ffffffff81653285>] rwsem_down_read_failed+0x105/0x180
 [<ffffffff8130f5d4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffffa0bbfd92>] zap_tryupgradedir+0x82/0xe0 [zfs]
 [<ffffffffa0bc138c>] zap_expand_leaf+0xcc/0x9e0 [zfs]
 [<ffffffffa0bc2122>] fzap_add_cd+0x112/0x210 [zfs]
 [<ffffffffa0bc2288>] fzap_add+0x68/0x80 [zfs]
 [<ffffffffa0bc8b06>] zap_add+0x136/0x1c0 [zfs]
 [<ffffffffa0bc2c78>] zap_add_int+0x68/0x90 [zfs]
 [<ffffffffa0bd37f4>] zfs_unlinked_add+0x54/0x100 [zfs]
 [<ffffffffa0bf3e86>] zfs_remove+0x736/0xa90 [zfs]
 [<ffffffffa0c12656>] zpl_unlink+0x66/0x170 [zfs]
 [<ffffffff811f4341>] vfs_unlink+0xe1/0x150
 [<ffffffff811f88fe>] do_unlinkat+0x26e/0x2b0
 [<ffffffff811f9846>] SyS_unlink+0x16/0x20

dbench          D ffffffff00000000     0  8842      1
Call Trace:
 [<ffffffff81651699>] schedule+0x29/0x70
 [<ffffffff8165305d>] rwsem_down_write_failed+0x12d/0x250
 [<ffffffff8130f603>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffffa0bc09a4>] zap_get_leaf_byblk+0x2e4/0x660 [zfs]
 [<ffffffffa0bc0df3>] zap_deref_leaf+0xd3/0x150 [zfs]
 [<ffffffffa0bc15cc>] zap_expand_leaf+0x30c/0x9e0 [zfs]
 [<ffffffffa0bc2122>] fzap_add_cd+0x112/0x210 [zfs]
 [<ffffffffa0bc2288>] fzap_add+0x68/0x80 [zfs]
 [<ffffffffa0bc8b06>] zap_add+0x136/0x1c0 [zfs]
 [<ffffffffa0bc2c78>] zap_add_int+0x68/0x90 [zfs]
 [<ffffffffa0bd37f4>] zfs_unlinked_add+0x54/0x100 [zfs]
 [<ffffffffa0bf3e86>] zfs_remove+0x736/0xa90 [zfs]
 [<ffffffffa0c12656>] zpl_unlink+0x66/0x170 [zfs]
 [<ffffffff811f4341>] vfs_unlink+0xe1/0x150
 [<ffffffff811f88fe>] do_unlinkat+0x26e/0x2b0

dbench          D 0000000000000001     0  8843      1
Call Trace:
 [<ffffffff81651699>] schedule+0x29/0x70
 [<ffffffff81653285>] rwsem_down_read_failed+0x105/0x180
 [<ffffffff8130f5d4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffffa0bc756c>] zap_lockdir+0x22c/0xb00 [zfs]
 [<ffffffffa0bc8a28>] zap_add+0x58/0x1c0 [zfs]
 [<ffffffffa0bc2c78>] zap_add_int+0x68/0x90 [zfs]
 [<ffffffffa0bd37f4>] zfs_unlinked_add+0x54/0x100 [zfs]
 [<ffffffffa0bf3e86>] zfs_remove+0x736/0xa90 [zfs]
 [<ffffffffa0c12656>] zpl_unlink+0x66/0x170 [zfs]
 [<ffffffff811f4341>] vfs_unlink+0xe1/0x150
 [<ffffffff811f88fe>] do_unlinkat+0x26e/0x2b0
 [<ffffffff811f9846>] SyS_unlink+0x16/0x20

@behlendorf
Copy link
Contributor Author

@dweeezil could you please review this. It resolves some minor fallout from the rw_tryupgrade() implementation which was added to the spl, openzfs/spl@a6ae97c.

The proposed fix is to just remove the optimization here which relies on rw_tryupgrade(), I don't expect it will have any significant performance impact.

@tuxoko
Copy link
Contributor

tuxoko commented May 25, 2016

@behlendorf
I think if we are going to use tryupgrade, we should probably implement a proper one. Otherwise, we might get bitten by this again in the future.

@behlendorf
Copy link
Contributor Author

@tuxoko I agree that would be best. But I don't see a way to implement this without grubbing around the the private rwsem internals which will be fragile.

@tuxoko
Copy link
Contributor

tuxoko commented May 25, 2016

Fragile it might be. But I don't think it would change too much or too often.

@behlendorf
Copy link
Contributor Author

@tuxoko it's complicated by the fact while all the architectures share a common rw_semaphore structure they each have a optimized implementation which can (and does) use it slightly differently.

@dweeezil
Copy link
Contributor

@behlendorf As was discussed when rw_tryupgrade() was modified, there was certainly a chance of lurking problems. I recall looking at the ZAP upgrade code but didn't dig far enough to see the interaction with l_rwlock. Your patch certainly looks fine insofar as working around the immediate problem is concerned.

It does seem, however, that @tuxoko has created an implementation which will work given the current kernel implementation using cmpxchg exactly as you suggested in openzfs/spl@ef6c136. I suppose the main question is just how fragile it is. I don't know, for example, how it works on the ARM these days or whether a change in implementation might be in the works.

@behlendorf
Copy link
Contributor Author

Closing in favor of openzfs/spl#554 which looks like it's going to be a better solution.

@behlendorf behlendorf closed this May 27, 2016
@lundman
Copy link
Contributor

lundman commented May 28, 2016

@behlendorf Hmm we don't have the option of going deeper, as Apple hides all that (unless we implement our own rwlocks I guess). At the moment, we check that readers==1, drop lock, then just tryenter, so it will fail more often than not, but never block.

Is there a test case for the zap-upgrade deadlock with l_rwlock ?

@behlendorf
Copy link
Contributor Author

@lundman it sounds like you're going to have the same issue. If your implementation drops the read lock then it's going to need to reacquire it when the trylock fails and there's where it can deadlock. If you can't change you implementation you'll probably want to adapt this fix.

I was able to reproduce this issue fairly easily by creating a new filesystem and then running something like dbench 128. When the unlinked zap was promoted I could trigger the deadlock.

behlendorf pushed a commit to openzfs/spl that referenced this pull request May 31, 2016
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.

This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs/zfs#4692
Closes #554
nedbass pushed a commit to nedbass/spl that referenced this pull request Aug 26, 2016
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.

This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs/zfs#4692
Closes openzfs#554
tuxoko pushed a commit to tuxoko/spl that referenced this pull request Sep 8, 2016
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.

This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs/zfs#4692
Closes openzfs#554
@behlendorf behlendorf deleted the zap_tryupgradedir branch April 19, 2021 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants