-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovering databases with purges not replicated after nodes in maintenance_mode #4844
Comments
Thanks for reaching out @Bolebo. The issue I think is that the internal replicator had already checkpointed that it has synchronized the shards and so won't re-apply the changes. One way to force it to go through all changes could be to delete the internal replicator checkpoints and then force a resync. Your database is quite large so may take a while to resync. Before attempting, try it on a small db if you have any with the same issue and perhaps make a backup. At least, make sure the db is not accessed or serving traffic during the time resyncing take place (most changes except purge will be there, so it's mostly just the time it take so run through the changes feed and perform calls to the target to compute revision differences). Some details on how to do it:
Example:
This shows
For instance:
It's a Q=2, N=3 db. So this shard has 2 copies on other nodes. There is a checkpoint for replication to and from each copy for a total of 4 checkpoints in this shard. Each shard copy should have 4 checkpoint local docs. Note: these local checkpoint docs can only be reliably accessed via the node local API and are accessible via the regular clustered API (mydb/_local_docs).
Note: I had to url escape You'd do this on every node, on every shard copy. Also, ensure there aren't any requests or writes that time to the database.
Example:
This should hopefully force all your purges to replicate between your nodes. |
First of all, thank you for your quick answer. As it is a production issue, it is very appreciated ! I tried your solution on a test environment, but it doesn't work as expected.
Clearly, the node couchdb5 was the one that "missed" 2996 purges during maintenance mode. After applying your method I came with this:
The node is coherent but purged documents have been reintegrated into the database (I expected 1804869 docs in each node). Perhaps I did something wrong.
If you have another nice idea, it is very welcomed. |
Thanks for trying @Bolebo. Sorry it didn't work. Looking at the difference in the number of docs and that there are more than 1000 and I wonder if the purges have been already been removed by compaction. We only keep up to In
That could be worth trying! Perhaps try it on a test instance. So, delete the If this doesn't work, the simplest approach may just be to re-purge all the docs which should be deleted. It's safe to create purges for doc revisions which are already purged, they'll just be processed by all the views and then during compaction only the last 1000 will remain. You probably already know this but if you re-purge all the missed docs pay attention to |
Thank you for your answer. I have a final try to do:
Is there any risk or side effect you know about if I simply physically delete the shard file on node 5 ? I expect it will synchronize with the 2 left nodes, without documents already purged in those 2 remaining nodes. Thanks in advance for your (final) answer. |
Just curious did you try re-purging them. It seems that should have worked? But I can understand that it be tricky to also figure out which ones to re-purge. Internally purge requests get a uuid assigned to them so the new ones even if for the same doc_id and rev will be processed as new requests. But I can see if you know the exact shard with the issue, then your resyncing idea is much easier.
That should work. To practice try it on a test instance first, just in case. But that's the standard recovery path if a shard copy is lost or corrupted - it should be resynced. |
Unfortunatelly, I can't re-purge them because I basically don't know which documents are purged (this is triggered by endusers). Thank you for your support. It was very usefull to have a way to consult individual shards properties/documents on each node ! To my knowledge, it is not documented but essential for such analysis. Best regards, |
Description
I have a cluster with 6 nodes, with big databases (> 220M documents).
Each database lives with many creations, updates and purges (no deletion). Exceptionnaly, I set my nodes in maintenance_mode for a long time (16h) and discovered that purges were not replicated in nodes with maintenance_mode to true.
After a quick search, I found this issue (#2139) which correspond to my anomaly.
I've upgraded a test environment to v3.3.2 and the anomaly is well corrected for new purges. But is there a way to recover old purges and how should I do it ?
Thanks for your support.
Steps to Reproduce
Expected Behaviour
Documents purged are purged in every node.
Your Environment
Additional Context
The text was updated successfully, but these errors were encountered: