Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi! I've come up with a somewhat different epoch algorithm, which performs very similarly to the current one while being much simpler. (It also fixes #551 and might help with #869.) It might need some performance tuning on Linux, Windows, or weakly ordered architectures, but I'm curious to know what you think of the approach, or if you have any ideas to make it faster.
Unlike the current algorithm, it uses a fixed number of "pinned" indicators, instead of one per thread. Using more indicators is less helpful against contention as the number of them increases, especially when there are more of them than cores. (An interesting experiment would be to pick one based on
sched_getcpu()
. I didn't try this because my system doesn't support it.)Also unlike the current algorithm, it uses the ordering of epochs to ensure that garbage can't be simultaneously added and removed for the same epoch. This greatly simplified storing the garbage, because these operations then don't have to be thread-safe with each other.
Finally, it doesn't use any memory ordering stronger than acquire or release. In my opinion this makes it easier to reason about. (It might help performance on ARM, but I don't have one to test it on.)
Internally it uses an approach similar to a RwLock, with reference counters which stores the write reference in the high bit and read references in the low bits. Here's how it works in detail:
Steps
To pin a thread
Once the local buffer of deferred functions is full enough
To advance the epoch (while pinned)
Reference counter
The reference counter is divided into 16 shards.
To read-lock, pick a shard and acquire-increment. If the high bit is set, fail, and if the next-to-high bit is set, panic (this indicates an overflow). To read-unlock, release-decrement the same shard.
To write-lock, attempt to acquire-CAS each counter from 0 to 0 plus the high bit. If the original value wasn't 0 or HIGH_BIT, fail. If the final counter's original value wasn't 0, fail. This allow writers which failed after setting some counters to not cause deadlocks, but the final counter decides which writer wins. To write-unlock, release-set all counters to 0.
Proof sketch
As with the classic epoch algorithm, each epoch overlaps the one before and after it (which is required for wait-freedom), but everything in epoch n happens-before everything in epoch n+2. This is because the advancing thread in epoch n+1 write-locks (acquire) epoch n then write-unlocks (release) epoch n+2. A thread in n+2 must advance the epoch to n+3, so n+3 happens-after n, and so on. Thus the latest epoch that could have observed pointers that epoch n unlinked from the data structure is n+1.
Since the advancing thread in n+2 write-locks n+1, it happens-after it as well, thus it happens-after any uses of those pointers, and they are safe to delete. In addition, the advancing thread in n+1 write-locked n, so it is already write-locked from the point of view of the advancing thread in n+2, and no one will touch the n's garbage pile until it's unlocked. (I can draw a diagram if it helps.)