Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do deleted documents appear in the return result of executing the _find command immediately? #5090

Closed
lpcy opened this issue Jun 16, 2024 · 8 comments

Comments

@lpcy
Copy link

lpcy commented Jun 16, 2024

First, execute the DELETE command to delete the document. Then immediately execute _find, and the deleted document appears in the result. But when we execute the _find later, it disappears.

Expected Behaviour

After the document is deleted, it immediately disappears from the return result of _find.

Your Environment

  • CouchDB version used: 3.3.3
  • Browser name and version: chrome 126
  • Operating system and version: windows 10
@big-r81
Copy link
Contributor

big-r81 commented Jun 16, 2024

Do you have a minimal working example, like a script to reproduce this?

@rnewson
Copy link
Member

rnewson commented Jun 16, 2024

This is possible if N>1 (i.e, you have a cluster, not a standalone single node) for a period of time. Once the DELETE has happened at all N nodes subsequent queries (assuming you didn't specify update=false or stale=ok) will not return that document. There can be a period where a DELETE is completed (i.e, you get a 200 OK response) but one or more nodes have not yet processed it, a _find at that time might get a response from one of those nodes (all queries are inherently R=1, they read just one of the copies).

@lpcy
Copy link
Author

lpcy commented Jun 17, 2024

This is possible if N>1 (i.e, you have a cluster, not a standalone single node) for a period of time. Once the DELETE has happened at all N nodes subsequent queries (assuming you didn't specify update=false or stale=ok) will not return that document. There can be a period where a DELETE is completed (i.e, you get a 200 OK response) but one or more nodes have not yet processed it, a _find at that time might get a response from one of those nodes (all queries are inherently R=1, they read just one of the copies).

Thank you, I'm sorry but I realized that I made an error: delete in the code is asynchronous.
At the same time, there is a new question: Is the tombstone information always retained? Is there any way to clean it up? It seems that Google's method is to synchronize to a new database while excluding deleted documents, which seems cumbersome. Currently, I am using a single node.

@lpcy
Copy link
Author

lpcy commented Jun 17, 2024

Do you have a minimal working example, like a script to reproduce this?

I rewrote the script and found it to be working properly. Sorry, it's my problem: delete is asynchronous.

@rnewson
Copy link
Member

rnewson commented Jun 17, 2024

"Tombstone" is a loose term, more precisely it is a document with the deleted flag set to true, and may contain other data. They are preserved forever, just as non-deleted documents are, to ensure that replication works correctly. You can replicate with a filter to drop them (or any other subset of documents) as long as you're aware of that consequence.

Delete is not asynchronous (any more than doc create or update is), I'm referring to the way we only wait for the first 2 of the total 3 responses in a 3 or more node cluster, which seems not to apply in your case.

If this is a single node setup then your opening comment is a bit more interesting. when the DELETE response is returned the document has been marked as deleted, and so any subsequent request should reflect that, including indexes (_view, _find, etc). Are you querying with stale=ok or update=false parameters? Assuming not, how long is the delay between the deleted document appearing in results after deletion and it finally being gone?

@lpcy
Copy link
Author

lpcy commented Jun 17, 2024

"Tombstone" is a loose term, more precisely it is a document with the deleted flag set to true, and may contain other data. They are preserved forever, just as non-deleted documents are, to ensure that replication works correctly. You can replicate with a filter to drop them (or any other subset of documents) as long as you're aware of that consequence.

Delete is not asynchronous (any more than doc create or update is), I'm referring to the way we only wait for the first 2 of the total 3 responses in a 3 or more node cluster, which seems not to apply in your case.

If this is a single node setup then your opening comment is a bit more interesting. when the DELETE response is returned the document has been marked as deleted, and so any subsequent request should reflect that, including indexes (_view, _find, etc). Are you querying with stale=ok or update=false parameters? Assuming not, how long is the delay between the deleted document appearing in results after deletion and it finally being gone?

Thank you for your patient answer. Actually, what I meant was that the reason for my original question was that I used DELETE in the asynchronous environment of JavaScript, which caused _find to retrieve old data at the same time. This issue can be ignored.
The "Tombstone" you mentioned is for replication, but if I don't have replication requirements, is there a simple command to clear them? Will not cleaning them have an impact on database performance?

@rnewson
Copy link
Member

rnewson commented Jun 17, 2024

Ah, that makes sense, thank you for clarifying.

If you don't need to keep deleted documents, as you never replicate, you can use the purge endpoint. The main downside to keeping them is the disk space they will continue to occupy. This is quite small (assuming you used the DELETE method which also empties the document body) but it is not zero.

Alternative strategies;

  1. if your data is temporal/time-based, you could make a database for distinct time periods (say, monthly), and when your oldest database contains only deleted documents you simply delete the entire database.
  2. periodically replicate the database to a new database but with a filter that rejects deleted documents, then switch usage to the new database.

@lpcy
Copy link
Author

lpcy commented Jun 17, 2024

Ah, that makes sense, thank you for clarifying.

If you don't need to keep deleted documents, as you never replicate, you can use the purge endpoint. The main downside to keeping them is the disk space they will continue to occupy. This is quite small (assuming you used the DELETE method which also empties the document body) but it is not zero.

Alternative strategies;

  1. if your data is temporal/time-based, you could make a database for distinct time periods (say, monthly), and when your oldest database contains only deleted documents you simply delete the entire database.
  2. periodically replicate the database to a new database but with a filter that rejects deleted documents, then switch usage to the new database.

Thank you again for your answer, I've got it.

@lpcy lpcy closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants