Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Ejson Body Instead of Compressed Body for External size #606

Merged
merged 9 commits into from
Jul 15, 2017

Conversation

tonysun83
Copy link
Contributor

@tonysun83 tonysun83 commented Jun 21, 2017

Overview

In two places where we calculate the ExternalSize of the document body,
we use the Summary which is a compressed version of the doc body. We
change this to use the actual ejson body. In copy_docs we don't have
access to the #doc record so we can't access the meta where we store
the ejson body. Unfortunately, this means we have to decompress the
document body after reading it from disk.

Testing recommendations

Performance tests were conducted to compare pre-fix External Size values and post-fix External Size Values. Here are the results:

Version        Compression    DB Dize      DB External Size

Pre-Fix         snappy            70 MB        46249702       
Pre-Fix         snappy            700 MB      462495533       
Pre-Fix         snappy            2.1 GB        1387486406      
                                        
Post-Fix       snappy            70 MB        83950133    
Post-Fix       snappy            700 MB      839500133        
Post-Fix       snappy            2.1 GB        2518500133         
                                        
Pre-Fix         deflate_6        70 MB        46249702         
Pre-Fix         deflate_6        700 MB      413395945      
Pre-Fix         deflate_6        2.1 GB       994598087         
                                        
Post-Fix       deflate_6        70 MB        83950133   
Post-Fix       deflate_6        700 MB      430495496  
Post-Fix       deflate_6        2.1 GB       1291486342

GitHub issue number

PR is number.

Related Pull Requests

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;

In two places where we calculate the ExternalSize of the document body,
we use the Summary which is a compressed version of the doc body. We
change this to use the actual ejson body. In copy_docs we don't have
access to the #doc record so we can't access the meta where we store
the ejson body. Unfortunately, this means we have to decompress the
document body after reading it from disk.

COUCHDB-3429
@davisp
Copy link
Member

davisp commented Jun 23, 2017

Not sure what your test was there, but that's a pretty big red flag that the external sizes don't match between different compression algorithms as it should not be affected by that. You also probably want to do a pre/post compaction measurement as well to make sure compaction isn't affecting it either.

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty close but we'll want to figure out why the snappy/deflate numbers don't agree. You'll probably also want to add a no-compression check as well. External should be the same for all of those and both before and after compaction.

Also for this test you should probably look at the node local HTTP port for testing so that you're not getting effects from a clustered database.

Doc#doc{body = {summary, SummaryChunk, SizeInfo, AttsFd}}
Meta = Doc#doc.meta,
Doc#doc{body = {summary, SummaryChunk, SizeInfo, AttsFd},
meta = [{ejson_body, Body} | Meta]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just store the size of the body here rather than pass the whole body. By doing that you're effectively nullifying the summary optimization of only passing binaries in and out of couch_db_updater.

@@ -1086,8 +1088,15 @@ copy_docs(Db, #db{fd = DestFd} = NewDb, MixedInfos, Retry) ->
{NewRevTree, FinalAcc} = couch_key_tree:mapfold(fun
(_Rev, #leaf{ptr=Sp}=Leaf, leaf, SizesAcc) ->
{Body, AttInfos} = copy_doc_attachments(Db, Sp, DestFd),
IsComp = couch_compress:is_compressed(Body, Compress),
EJsonBody = case IsComp of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check isn't quite right as we might have changed the compression and we're looking at something that's not been compressed with the same method. Just use is_binary and then pass it to couch_compress:decompress if so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, though I see that's based off the make_doc_summary code. The difference here is that we don't care what its compressed as, where as make_doc_summary is converting a possibly compressed document to a different compression scheme.

Tony Sun added 2 commits June 26, 2017 18:34
@tonysun83
Copy link
Contributor Author

tonysun83 commented Jun 27, 2017

@davisp: I made the changes you suggested. Also, I think the performance benchmarks were using port 5984 instead of 5986, so it was influenced by the clustered nature of things. Compression tests existed, so I just piggy backed off that code. Let me know if this looks good, and I'll create a temporary build to check the performance numbers before merging.

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor changes but it looks mostly good. Only real concern is the decompress call in the compactor.

Doc#doc{body = {summary, SummaryChunk, SizeInfo, AttsFd}}
Meta = Doc#doc.meta,
ExternalSize = ?term_size(Body),
Doc#doc{body = {summary, SummaryChunk, SizeInfo, AttsFd},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move attributes to their own lines when you wrap records. Something like:

Doc#doc{
    body = {summary, ...},
    meta = [{ejson_size, ?term_size(Body)} | Meta]
}

@@ -1076,6 +1076,7 @@ check_md5(_, _) -> throw(md5_mismatch).

copy_docs(Db, #db{fd = DestFd} = NewDb, MixedInfos, Retry) ->
DocInfoIds = [Id || #doc_info{id=Id} <- MixedInfos],
Compress = NewDb#db.compression,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not seeing where this is used.

{ejson_size, ExternalSize} ->
ExternalSize;
false ->
couch_compress:decompress(Summary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug right? You should be doing ?term_size(couch_compress:decompress(Summary)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I call ExternalSize = ?term_size(EJsonBody), further down

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then your ejson_size you calculated is going to reduce all doc bodies to 4 bytes or whatever since you're returning an integer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looking at the code I don't see where you call it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wups, I was looking at the other place

@@ -1086,8 +1087,14 @@ copy_docs(Db, #db{fd = DestFd} = NewDb, MixedInfos, Retry) ->
{NewRevTree, FinalAcc} = couch_key_tree:mapfold(fun
(_Rev, #leaf{ptr=Sp}=Leaf, leaf, SizesAcc) ->
{Body, AttInfos} = copy_doc_attachments(Db, Sp, DestFd),
EJsonBody = case is_binary(Body) of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why we need to do this for existing shards that may have an incorrect value however I'm not a huge fan of this being a permanent change and not just around for an upgrade. Would be nice if we could figure out how to special case this so it only runs the decompression when we need to upgrade.

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@tonysun83 tonysun83 merged commit dce6e34 into master Jul 15, 2017
@tonysun83 tonysun83 deleted the 3429-external-size-dbs branch July 15, 2017 17:27
rnewson added a commit that referenced this pull request Jul 15, 2017
rnewson added a commit that referenced this pull request Jul 15, 2017
tonysun83 added a commit to cloudant/couchdb that referenced this pull request Jul 19, 2017
Use ejson body instead of compressed body for external size

In two places where we calculate the ExternalSize of the document body,
we use the Summary which is a compressed version of the doc body. We
change this to use the actual ejson body. In copy_docs we don't have
access to the #doc record so we can't access the meta where we store
the ejson body. Unfortunately, this means we have to decompress the
document body after reading it from disk.

COUCHDB-3429
nickva pushed a commit to nickva/couchdb that referenced this pull request Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants