-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Ejson Body Instead of Compressed Body for External size #606
Conversation
In two places where we calculate the ExternalSize of the document body, we use the Summary which is a compressed version of the doc body. We change this to use the actual ejson body. In copy_docs we don't have access to the #doc record so we can't access the meta where we store the ejson body. Unfortunately, this means we have to decompress the document body after reading it from disk. COUCHDB-3429
Not sure what your test was there, but that's a pretty big red flag that the external sizes don't match between different compression algorithms as it should not be affected by that. You also probably want to do a pre/post compaction measurement as well to make sure compaction isn't affecting it either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty close but we'll want to figure out why the snappy/deflate numbers don't agree. You'll probably also want to add a no-compression check as well. External should be the same for all of those and both before and after compaction.
Also for this test you should probably look at the node local HTTP port for testing so that you're not getting effects from a clustered database.
src/couch/src/couch_db.erl
Outdated
Doc#doc{body = {summary, SummaryChunk, SizeInfo, AttsFd}} | ||
Meta = Doc#doc.meta, | ||
Doc#doc{body = {summary, SummaryChunk, SizeInfo, AttsFd}, | ||
meta = [{ejson_body, Body} | Meta]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should just store the size of the body here rather than pass the whole body. By doing that you're effectively nullifying the summary optimization of only passing binaries in and out of couch_db_updater.
src/couch/src/couch_db_updater.erl
Outdated
@@ -1086,8 +1088,15 @@ copy_docs(Db, #db{fd = DestFd} = NewDb, MixedInfos, Retry) -> | |||
{NewRevTree, FinalAcc} = couch_key_tree:mapfold(fun | |||
(_Rev, #leaf{ptr=Sp}=Leaf, leaf, SizesAcc) -> | |||
{Body, AttInfos} = copy_doc_attachments(Db, Sp, DestFd), | |||
IsComp = couch_compress:is_compressed(Body, Compress), | |||
EJsonBody = case IsComp of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check isn't quite right as we might have changed the compression and we're looking at something that's not been compressed with the same method. Just use is_binary and then pass it to couch_compress:decompress if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, though I see that's based off the make_doc_summary code. The difference here is that we don't care what its compressed as, where as make_doc_summary is converting a possibly compressed document to a different compression scheme.
COUCHDB-3429
couchdb-3429
@davisp: I made the changes you suggested. Also, I think the performance benchmarks were using port 5984 instead of 5986, so it was influenced by the clustered nature of things. Compression tests existed, so I just piggy backed off that code. Let me know if this looks good, and I'll create a temporary build to check the performance numbers before merging. |
couchdb-3429
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple minor changes but it looks mostly good. Only real concern is the decompress call in the compactor.
src/couch/src/couch_db.erl
Outdated
Doc#doc{body = {summary, SummaryChunk, SizeInfo, AttsFd}} | ||
Meta = Doc#doc.meta, | ||
ExternalSize = ?term_size(Body), | ||
Doc#doc{body = {summary, SummaryChunk, SizeInfo, AttsFd}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move attributes to their own lines when you wrap records. Something like:
Doc#doc{
body = {summary, ...},
meta = [{ejson_size, ?term_size(Body)} | Meta]
}
src/couch/src/couch_db_updater.erl
Outdated
@@ -1076,6 +1076,7 @@ check_md5(_, _) -> throw(md5_mismatch). | |||
|
|||
copy_docs(Db, #db{fd = DestFd} = NewDb, MixedInfos, Retry) -> | |||
DocInfoIds = [Id || #doc_info{id=Id} <- MixedInfos], | |||
Compress = NewDb#db.compression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not seeing where this is used.
src/couch/src/couch_db_updater.erl
Outdated
{ejson_size, ExternalSize} -> | ||
ExternalSize; | ||
false -> | ||
couch_compress:decompress(Summary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug right? You should be doing ?term_size(couch_compress:decompress(Summary))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I call ExternalSize = ?term_size(EJsonBody), further down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then your ejson_size you calculated is going to reduce all doc bodies to 4 bytes or whatever since you're returning an integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looking at the code I don't see where you call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wups, I was looking at the other place
src/couch/src/couch_db_updater.erl
Outdated
@@ -1086,8 +1087,14 @@ copy_docs(Db, #db{fd = DestFd} = NewDb, MixedInfos, Retry) -> | |||
{NewRevTree, FinalAcc} = couch_key_tree:mapfold(fun | |||
(_Rev, #leaf{ptr=Sp}=Leaf, leaf, SizesAcc) -> | |||
{Body, AttInfos} = copy_doc_attachments(Db, Sp, DestFd), | |||
EJsonBody = case is_binary(Body) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why we need to do this for existing shards that may have an incorrect value however I'm not a huge fan of this being a permanent change and not just around for an upgrade. Would be nice if we could figure out how to special case this so it only runs the decompression when we need to upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Use ejson body instead of compressed body for external size In two places where we calculate the ExternalSize of the document body, we use the Summary which is a compressed version of the doc body. We change this to use the actual ejson body. In copy_docs we don't have access to the #doc record so we can't access the meta where we store the ejson body. Unfortunately, this means we have to decompress the document body after reading it from disk. COUCHDB-3429
Overview
In two places where we calculate the ExternalSize of the document body,
we use the Summary which is a compressed version of the doc body. We
change this to use the actual ejson body. In copy_docs we don't have
access to the #doc record so we can't access the meta where we store
the ejson body. Unfortunately, this means we have to decompress the
document body after reading it from disk.
Testing recommendations
Performance tests were conducted to compare pre-fix External Size values and post-fix External Size Values. Here are the results:
GitHub issue number
PR is number.
Related Pull Requests
Checklist