Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable replicating purge requests between nodes #4064

Merged
merged 2 commits into from
Jun 17, 2022
Merged

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Jun 13, 2022

It seems we had forgotten to enable it and in the case a node if off-line when a clustered purge request is issued, and then re-join they won't know to process any purge requests issued to other nodes.

After enabling it @jjrodrig had noticed that sometimes we were getting a duplicate UUID purge error thrown. That's because now there is a chance for a race between the internal replicator and the interactive request, when they both handle the exact same purge request. So, instead of throwing a duplicate UUID request, consider a duplicate purge request a no-op, that's the second commit in the PR.

@nickva nickva requested a review from jjrodrig June 13, 2022 22:09
rel/overlay/etc/default.ini Outdated Show resolved Hide resolved
@nickva
Copy link
Contributor Author

nickva commented Jun 14, 2022

@jjrodrig also thanks for pointing out an issue related to a race between the internal replicator and the interactive requests.

In a test cluster, with this option enable there is a good chance of getting a duplicate UUID badreq error from

IsRepl = lists:member(replicated_changes, Options),
if
IsRepl ->
ok;
true ->
UUIDs = [UUID || {UUID, _, _} <- UUIDsIdsRevs2],
lists:foreach(
fun(Resp) ->
if
Resp == not_found ->
ok;
true ->
Fmt = "Duplicate purge info UIUD: ~s",
Reason = io_lib:format(Fmt, [element(2, Resp)]),
throw({badreq, Reason})
end
end,
get_purge_infos(Db, UUIDs)
)

Which looks to be a race between the internal replicator and the interactive request workers. If the internal replicator wins, then the interactive request would throw the error.

The fix is probably to ignore the request if it is already present on the node. These UUIDs are generated on the cluster not by the user so if we already saw the UUID it means we've already processed it.

It seems we had forgotten to enable it and in the case a node if offline when
a clustered purge request is issued, and then re-join they won't know to
process any purge requests issued to other nodes.
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If memory serves, this was a belt and suspenders assertion to prevent us from overwriting a UUID in the purge tree which would have correctness issues. Obviously, given that UUIDs are generated fresh internally in Fabric, the liklihood of having the same UUID used multiple times for different {Id, Revs} purge requests would be nil, however, not checking means we leave open the possibility.

The general proposed filtering approach to just ignore any UUID previously is fine, but I would add the check for matching the {Id, Revs} portion of the request to ensure we don't allow that particular bug (which would likely only ever be triggerable by operator intervention).

@nickva
Copy link
Contributor Author

nickva commented Jun 16, 2022

@davisp good idea to check Id and Revs!

@nickva nickva force-pushed the replicate-purges branch 3 times, most recently from 2f6975d to e32ac2a Compare June 16, 2022 21:36
src/couch/src/couch_db.erl Outdated Show resolved Hide resolved
@nickva nickva force-pushed the replicate-purges branch 2 times, most recently from c7f07a9 to 2e20fe6 Compare June 16, 2022 22:40
After enabling default purge request replication, it was possible to get a
`badreq` error because the internal replicator would race with the interactive
worker updates. Since we're dealing with cluster-generated UUIDs if we see the
same UUID, DocId, and Revs again, assume it was applied already and ignore it.

For belt-and-suspenders protection ensure that revs lists don't have duplicates
and are sorted. Do it both in fabric and couch_db in case an operator wants to
do local purges. For extra safety also keep sorting the revs in the duplicate
check for the rare case that the existing purge btree already contain duplicate
revs. Eventually that check could be removed as all the new purge revs lists
would be de-duplicated and sorted.

As a bonus, if it turns out all the purges were applied simply return and avoid
calling the updater process which can be an expensive bottleneck.
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@nickva nickva merged commit 63e0983 into main Jun 17, 2022
@nickva nickva deleted the replicate-purges branch June 17, 2022 18:48
@jjrodrig
Copy link
Contributor

Thanks @nickva and @davisp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants