Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Delete duplicated key/value pairs recursively #19

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lixiaoy1
Copy link

@lixiaoy1 lixiaoy1 commented Jul 12, 2017

This is to implement the idea: http:https://pad.ceph.com/p/rocksdb-wal-improvement
Add a new flush style called kFlushStyleDedup which users can config by setting
flush_style=kFlushStyleDedup and write_buffer_number_to_flush. Flush
is triggered while number of non-flushed memtables is >= min_write_buffer_number_to_merge.
When flushing memtables to L0, it merges write_buffer_number_to_flush memtables
and deduping against other non-flushed memtables.

With bluestore fio plugin to do 4k random write test, I compared the two stats
db.get.micros and db.write.micros from rocksdb every ten minutes. db.get.micros
is the average of get operations, and db.write.micros is the average time of put/delete
functions. From the result, rocksdb read performance can be improved up to 38%,
and write performance can be improved to 15%. More detailed info please see the excel:
https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8

And to complement, the io perfermance of bluestore fio doesn't improve much. From time of
txc states, the time spends a lot in kv_state_commiting_lat (up to 80%), but submit_transaction and
submit_transaction_sync takes little time compared.

Signed-off-by: Xiaoyan Li [email protected]

@tchaikov
Copy link

will this change go to upstream?

@lixiaoy1
Copy link
Author

As this idea is from Sage http:https://pad.ceph.com/p/rocksdb-wal-improvement, at first hope Sage or you can review the changes.

@lixiaoy1
Copy link
Author

@liewegas Please help to review.

@@ -236,6 +236,9 @@ class ColumnFamilyData {
MemTable* mem() { return mem_; }
Version* current() { return current_; }
Version* dummy_versions() { return dummy_versions_; }
void set_stop() { stop = true; }
bool is_stop() { return stop; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the stop flag needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used to decide whether choosing all immutable tables to flush when MemTableList::PickMemtablesToFlush. In the new flush style, if the db process is going to shutdown, it needs to flush all existed immutable memtables. https://github.com/ceph/rocksdb/pull/19/files#diff-f8dc8f7b1ea2cb77b0eb9b156bc434f2R133

db/builder.cc Outdated
assert((column_family_id ==
TablePropertiesCollectorFactory::Context::kUnknownColumnFamily) ==
column_family_name.empty());
// Reports the IOStats for flush for every following bytes.
const size_t kReportFlushIOStatsEvery = 1048576;
Status s;
uint64_t num_duplicated = 0, num_total=0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: spaces around =

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it later.

db/builder.cc Outdated
@@ -138,17 +140,52 @@ Status BuildTable(
&snapshots, earliest_write_conflict_snapshot, env,
true /* internal key corruption is not ok */, range_del_agg.get());
c_iter.SeekToFirst();

ParsedInternalKey c_ikey, comp_ikey;
if(compare_iter != nullptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: space between if and (

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it later.

db/builder.cc Outdated
break;
}
else if(result > 0)
compare_iter->Next();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: braces around single statement blocks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it later.

@liewegas
Copy link
Member

Yay!

This looks reasonable to me, but I have pretty limited rocksdb experience so it really needs to be looked at by those folks. I'd fix the style issues before sending it there since they'll undoubtable complain about it. (Also, the automated github pull request checks include a style check.)

I don't understand why the new set_stop/is_stop checks are needed; what makes this change need it?

A description of the approach and new flag somewhere is probably needed. It looks like you are always flushing a single memtable at a time and always deduping against all others. We might need a tunable there? (In contrast, the current behavior is to flush all memtables? That doesn't seem quite right as there are a few tunables controlling the how many to flush at a time... sorry, I'm not very familiar with this code.)

The expected savings is probably most evident in the overall write-amp and amount of data going into level0. Did you happen to look at this metrics? @markhpc might know which values to look for.

Thanks!

@lixiaoy1
Copy link
Author

The current behavior is in following scenarios it begins to flush:

  1. When the number of non-flushed immutable memtables is same or bigger than min_write_buffer_number_to_merge.
    https://github.com/ceph/rocksdb/blob/master/db/memtable_list.cc#L273
    https://github.com/ceph/rocksdb/blob/master/db/db_impl_compaction_flush.cc#L1094
  2. Other scenario, like shutdown etc. It needs to flush all memtables.

In this branch, it changes scenario 1. You are right. In this patch, it flushes a single memtable at a time and always deduping against other min_write_buffer_number_to_merge-1 memtables. I agree that we can make it more flexible: use min_write_buffer_number_to_merge to indicate how many memtables to flush, and add another options to specify how many other memtables to compare/dedup. And sum of these two numbers should be <= max_write_buffer_number.

The branch decreases the data flushed to L0. When min_write_buffer_number_to_merge is set 2, the following are summary of L0 sst files every 10 minutes from my tests:
Log in this branch:
file number | file size(MB)
8 | 25.86
7 | 291.6
5 | 282.9
5 | 311.68
4 | 260.32
4 | 264.43
4 | 266.39
4 | 267.35
0 | 0
4 | 263.79

Master branch:
file number | file size(MB)
4 | 179
2 | 301.83
2 | 346.32
2 | 346.32
0 | 0
2 | 362.69
2 | 365.34
2 | 365.88
2 | 364.02
2 | 361.99
2 | 359.07

@markhpc
Copy link
Member

markhpc commented Jul 13, 2017

Awesome work Li! I'm really excited to take a look at this.

@liuchang0812
Copy link

awesome work. actually the minor-compaction which called kFlushStyleDedup is similar with L0 sstable's compaction. we could set options.level0_file_num_compaction_trigger = 1; to get similar result.

@lixiaoy1
Copy link
Author

level0_file_num_compaction_trigger=1 means to trigger L0->L1 compaction when number of l0 sst files is 1. I think their mechanism are different. Meanwhile if you set level0_file_num_compaction_trigger as 1, more duplicated data will put into l2.

@lixiaoy1
Copy link
Author

@liewegas @markhpc and other reviewers, I will be out of office next week(15th - 23th July), please give your comments. And I will respond after I come back.

@liuchang0812
Copy link

@lixiaoy1 oh....sorry, level0_file_num_compaction_trigger=2

@lixiaoy1
Copy link
Author

@liuchang0812 I think you mean min_write_buffer_number_to_merge. When setting min_write_buffer_number_to_merge as 2, the test result on this branch is better. You may have a look at
https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8
If you looked into whole io path, they are almost same.

@lixiaoy1
Copy link
Author

lixiaoy1 commented Aug 8, 2017

@markhpc Could you please have a look?

@lixiaoy1 lixiaoy1 force-pushed the dedup branch 2 times, most recently from dad3d2a to bf29157 Compare August 22, 2017 01:46
This is to implement the idea: http:https://pad.ceph.com/p/rocksdb-wal-improvement
Add a new flush style called kFlushStyleDedup which users can config by setting
flush_style=kFlushStyleDedup and write_buffer_number_to_flush. Flush
is triggered while number of non-flushed memtables is >= min_write_buffer_number_to_merge.
When flushing memtables to L0, it merges write_buffer_number_to_flush memtables
and deduping against other non-flushed memtables.

With bluestore fio plugin to do 4k random write test, I compared the two stats
db.get.micros and db.write.micros from rocksdb every ten minutes. db.get.micros
is the average of get operations, and db.write.micros is the average time of put/delete
functions. From the result, rocksdb read performance can be improved up to 38%,
and write performance can be improved to 15%. More detailed info please see the excel:
https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8

And to complement, the io perfermance of bluestore fio doesn't improve much. From time of
txc states, the time spends a lot in kv_state_commiting_lat (up to 80%), but submit_transaction and
submit_transaction_sync takes little time compared.

Signed-off-by: Xiaoyan Li [email protected]
@lixiaoy1
Copy link
Author

lixiaoy1 commented Sep 4, 2017

This pr is based on Rocksdb@e15382c09c87a65eaeca9bda233bab503f1e5772, I submitted a new PR to facebook/Rocksdb facebook#2829, which is based on Rocksdb master branch.

@markhpc
Copy link
Member

markhpc commented Sep 22, 2017

Hi Li,

Sorry, I haven't gotten a chance to test this yet. Have you gotten any feedback from the rocksdb folks?

@lixiaoy1
Copy link
Author

@markhpc currently Andrew Kryczka from Facebook asked some questions. What I am doing now is that I get the kv operation sequences out from BlueStore when running fio+librbd, and use the kv operation seqs to feed into Rocksdb, and check how Rocksdb is improved. lixiaoy1/ceph@e6a4e8d

@markhpc
Copy link
Member

markhpc commented Oct 11, 2017

Hi Li,

I'm still swamped, but I wanted to mention that we're going to hopefully look at this soon (either me or someone else). Any update from the rocksdb folks?

Mark

@lixiaoy1
Copy link
Author

lixiaoy1 commented Oct 12, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants