-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/user partitioned databases #1605
Conversation
The majority of tests for partitions are currently here https://github.com/cloudant/partition_tests |
@garrensmith would you consider merging the tests into our elixir branch? |
@wohali sure I can do that once this PR is a little further down the line |
80a0685
to
ba78cec
Compare
@garrensmith I notice that the tests in https://github.com/cloudant/partition_tests/tree/master/test don't have a common prefix, which although fine for now, I think will become less cohesive once they are merged with the main elixir suite. Would you consider renaming them so they all start with |
ba78cec
to
8b7f146
Compare
This props list is recorded in each database shard as well as the shard document in the special _dbs database. Co-authored-by: Garren Smith <[email protected]> Co-authored-by: Robert Newson <[email protected]>
8b7f146
to
fcb1795
Compare
3e14644
to
07bacf0
Compare
% cannot partition a system database | ||
validate_partition_database_create(DbName, Partitioned) -> | ||
SystemId = DbName =:= ?l2b(config:get("mem3", "shards_db", "_dbs")) orelse | ||
lists:member(DbName, ?SYSTEM_DATABASES), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work for cases when the database name contain /
. For example the following would be broken (example is taken from #1647 (comment) where one of CouchDb users mention his use case).
account%2F85%2Fea%2F6075c6c1e266f8512e2233541bdb-201807
I think this should use couch_db:is_systemdb(DbName)
which abstracts that logic (keep in mind that there is an open PR which fixes it a little #1647).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to make is_systemdb
function public and possibly rename it since we have public is_system_db
already.
|
||
validate_docid(DocId, DbName, Options) -> | ||
SystemId = DbName =:= ?l2b(config:get("mem3", "shards_db", "_dbs")) andalso | ||
lists:member(DocId, ?SYSTEM_DATABASES), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use couch_db:is_systemdb(DbName)
here? We would need to make that function public.
-define(LOWEST_KEY, null). | ||
-define(HIGHEST_KEY, {<<255, 255, 255, 255>>}). | ||
-define(PARTITION_START(P), <<P/binary, $:>>). | ||
-define(PARTITION_END(P), <<P/binary, $;>>). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use semicolon here is it intentional or a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intentional. it is the next character in order after :
and therefore marks the end of the partition key range.
@@ -43,3 +44,6 @@ design_handler(<<"_update">>) -> fun chttpd_show:handle_doc_update_req/3; | |||
design_handler(<<"_info">>) -> fun chttpd_db:handle_design_info_req/3; | |||
design_handler(<<"_rewrite">>) -> fun chttpd_rewrite:handle_rewrite_req/3; | |||
design_handler(_) -> no_match. | |||
|
|||
partition_design_handler(<<"_view">>) -> fun chttpd_view:handle_partition_view_req/4; | |||
partition_design_handler(_) -> no_match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test it yet, but I am pretty sure we need partition_handler(_) -> no_match
here (and export it as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've run our test suite and manual testing without needing that, what fails without it?
src/chttpd/src/chttpd_view.erl
Outdated
{<<"stable">>, Args#mrargs.stable, true}, | ||
{<<"conflicts">>, Args#mrargs.conflicts, true} | ||
], | ||
lists:foreach(fun ({Param, Field, Value}) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Field
and Value
naming is confusing in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to rename to ArgValue
and RestrictedValue
set_prop(#st{header = Header} = St, Key, Value) -> | ||
OldProps = get_props(St), | ||
NewProps = lists:ukeymerge(1, [{Key, Value}], OldProps), | ||
Options = [{compression, St#st.compression}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can call copy_props/2
here. Since the lines are identical.
@@ -752,6 +848,16 @@ copy_security(#st{header = Header} = St, SecProps) -> | |||
needs_commit = true | |||
}}. | |||
|
|||
copy_props(#st{header = Header} = St, Props) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not really copying props here. So the name of the function is not accurate. Should we call it write_props
instead?
|
||
get_partition_info(#st{} = St, Partition) -> | ||
StartKey = <<Partition/binary, ":">>, | ||
EndKey = <<Partition/binary, ";">>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess ;
is intentional since I see it second time.
How we can hide this feature behind a feature flag, so we can stabilize it? |
get_prop(St, Key, DefaultValue) -> | ||
case get_prop(St, Key) of | ||
{error, no_value} -> DefaultValue; | ||
Value -> Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a discrepancy in return value here. get_prop/2
returns {ok, Value}
. So we either change return value for get_prop/2
or change {error, no_value} -> {ok, DefaultValue}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentional but I see your point. get_prop/3 will always return a value whereas get_prop/2 might not, which is why the result shape is different. We could change get_prop/2 to return Value
or undefined
instead, which matches proplists:get_value behaviour.
@@ -133,7 +133,9 @@ make_group_fun(Bt, exact) -> | |||
end; | |||
make_group_fun(Bt, GroupLevel) when is_integer(GroupLevel), GroupLevel > 0 -> | |||
fun | |||
({[_|_] = Key1, _}, {[_|_] = Key2, _}) -> | |||
GF({{p, _Partition, Key1}, Val1}, {{p, _Partition, Key2}, Val2}) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any possibility that we would we have a case were Partition value is different? I think we would have such cases. That would mean that there would be no match and we crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the partition for Key1 and Key2 were different, that's a bug and we should crash.
@@ -88,6 +88,14 @@ handle_call({set_security, NewSec}, _From, #db{} = Db) -> | |||
ok = gen_server:call(couch_server, {db_updated, NewSecDb}, infinity), | |||
{reply, ok, NewSecDb, idle_limit()}; | |||
|
|||
handle_call({set_prop, Key, Value}, _From, #db{} = Db) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should detect a case when we set the same property multiple times. I.e. the case when the result from couch_db_engine:get_props/1
is the same before and after calling set_prop
. In such case we might want to avoid doing a commit and sending db_updated
.
@@ -60,6 +60,7 @@ setup_legacy() -> | |||
lists:foreach(fun(File) -> file:delete(File) end, Files), | |||
|
|||
% copy old db file into db dir | |||
?debugFmt("copy ~p ~p", [OldDbFilePath, NewDbFilePath]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need to remove prior to merge
src/mango/src/mango_idx.erl
Outdated
Indexes; | ||
filter_indexes_by_partitioned(DbPartitioned, Indexes, PQ) -> | ||
FilterFun = fun (Idx)-> | ||
PartitionedIdx = case lists:keyfind(partitioned, 1, Idx#idx.design_opts) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is over-complicated. DbPartitioned is always true
since we have a dedicated function clause. Therefore this part can be written as:
PartitionedIdx = couch_util:get_value(partitioned, Idx#idx.design_opts, true),
src/mango/src/mango_idx.erl
Outdated
@@ -119,6 +147,16 @@ validate_new(Idx, Db) -> | |||
Mod:validate_new(Idx, Db). | |||
|
|||
|
|||
validate_design_opts(Props) -> | |||
case lists:keyfind(<<"options">>, 1, Props) of | |||
{<<"options">>, {[{<<"partitioned">>, P}]}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are making a dangerous assumption here that options would contain a single element and it will be {<<"partitioned">>, P}
.
ref = undefined, | ||
order = 1, | ||
opts = [{partitioned,true}] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add non matching shards into mock result as well?
Partitioned = couch_util:get_value(partitioned, Idx#idx.design_opts, DbPartitioned), | ||
Args2 = couch_mrview_util:set_extra(Args1, partitioned, Partitioned), | ||
Args3 = case couch_util:get_value(partition, Opts) of | ||
<<>> -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am missing something. I cannot figure out how we can get <<>>
here. I'll get back to it latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we handle the case when partition
is not in Opts
? Is it possible? If so we would call couch_mrview_util:set_extra(Args2, partition, undefined)
which is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<<>>
is the default_value in mango_opts:validate_idx_create
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All unit tests passed, but I got a significant number of failures with the elixir partition tests. I also built a fresh couchdb clone just to ensure I didn't have any junk lying around, and still get the same failures.
For now, I'm going to focus on the first failure:
MangoPartitionTest
* test partitioned database partitioned query does not use global index (326.4ms)
1) test partitioned database partitioned query does not use global index (MangoPartitionTest)
test/mango_partition_test.exs:474
Assertion with == failed
code: assert length(partitions) == 50
left: 25
right: 50
stacktrace:
test/mango_partition_test.exs:496: (test)
which results in the following in the logs, where the first line is the last entry from test cleanup, and then 5 minutes later, we see lots of rexi timeouts like this:
[error] 2018-10-11T19:26:39.221358Z [email protected] <0.1502.0> -------- rexi_server: from: [email protected](<0.1499.0>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,431}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,735}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
UPDATE
The rexi timeouts appear to be a red herring, and have been seen in production logs without the partitioning work present. However, the intermittent test failures are real, and I've seen them in both bare metal and containerized dev environments. They appear to be from races between indexing newly inserted documents, and http queries against those indexes. Failures can be individually mitigated on a per test basis by wrapping the request+assertions inside a retry_until
function. For example, this:
create_docs(db_name)
url = "/#{db_name}/_partition/foo/_find"
resp = Couch.post(url, body: %{
selector: %{
some: "field"
},
limit: 20
})
assert resp.status_code == 200
partitions = get_partitions(resp)
assert length(partitions) == 20, Kernel.inspect(resp)
assert_correct_partition(partitions, "foo")
becomes
create_docs(db_name)
url = "/#{db_name}/_partition/foo/_find"
retry_until fn ->
resp = Couch.post(url, body: %{
selector: %{
some: "field"
},
limit: 20
})
assert resp.status_code == 200
partitions = get_partitions(resp)
assert length(partitions) == 20, Kernel.inspect(resp)
assert_correct_partition(partitions, "foo")
end
retry_until
also needs to be modified from it's current form to handle assertions (and fix a bug in current epoch ms calculation).
@@ -60,6 +60,7 @@ setup_legacy() -> | |||
lists:foreach(fun(File) -> file:delete(File) end, Files), | |||
|
|||
% copy old db file into db dir | |||
?debugFmt("copy ~p ~p", [OldDbFilePath, NewDbFilePath]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove debugging?
Docs = [couch_doc:from_json_obj_validate(JsonObj) || JsonObj <- DocsArray], | ||
Docs = [begin | ||
couch_doc:from_json_obj_validate(JsonObj) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why begin
and end
were inserted here
@@ -66,8 +66,22 @@ for_db(DbName, Options) -> | |||
for_docid(DbName, DocId) -> | |||
for_docid(DbName, DocId, []). | |||
|
|||
%% This function performs one or two lookups now as it is not known | |||
%% ahead of time if the database is partitioned We first ask for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing .
after "partitioned".
is_partitioned_false_shards_db_test() -> | ||
meck:expect(config, get, fun (_, _, Default) -> Default end), | ||
?assertEqual(is_partitioned(<<"_dbs">>), false), | ||
meck:unload(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in a teardown
function because leaving meck loaded in the case of test failure will cause later tests to fail in mysterious ways.
src/mem3/src/mem3.erl
Outdated
is_partitioned_false_nodes_db_test() -> | ||
meck:expect(config, get, fun (_, _, Default) -> Default end), | ||
?assertEqual(is_partitioned(<<"_nodes">>), false), | ||
meck:unload(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another dangerous use of meck
meck:expect(mem3, shards, fun(<<"db">>) -> [] end), | ||
shards(DbName, Args), | ||
meck:validate(mem3), | ||
meck:unload(mem3). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
meck:expect(mem3, shards, fun(<<"db">>) -> [] end), | ||
shards(DbName, Args), | ||
meck:validate(mem3), | ||
meck:unload(mem3). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
meck:expect(mem3, shards, fun(<<"db">>) -> [] end), | ||
shards(DbName, Args), | ||
meck:validate(mem3), | ||
meck:unload(mem3). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
meck:expect(mem3, shards, fun(<<"db">>) -> [] end), | ||
shards(DbName, Args), | ||
meck:validate(mem3), | ||
meck:unload(mem3). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
ok; | ||
{<<"r">>, _Value} -> | ||
?MANGO_ERROR(invalid_partition_read) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space?
@garrensmith I updated to the latest partition_tests commit, but am still getting the same 3 failures from |
docid_hash(DocId, []) when is_binary(DocId) -> | ||
docid_hash(DocId, [{partitioned, false}]); | ||
|
||
docid_hash(DocId, [{partitioned, false}]) when is_binary(DocId) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know up the call stack this parameter is referred to as HashOptions
but this current structure will prevent this from ever actually being an options list given that it's matching on a single element list. So adding another option to HashOptions
will cause a function_clause error here, so why have it at all? This seems like unnecessary future proofing, especially given that to use this field as an actual options list will require restructuring the function definitions as you can't define a function head to match on an element in a proplist.
I suggest either making this an actual proplist and using a variation of couch_util:get_value(partioned, Options)
, or eliminating the list entirely and making this a boolean parameter. You could rewrite this series of functions as something like:
docid_hash(DocId) ->
docid_hash(DocId, false).
docid_hash(DocId, false) ->
erlang:crc32(DocId);
docid_hash(<<"_design/", _/binary>>=DocId, _) ->
erlang:crc32(DocId);
docid_hash(DocId, true) ->
case binary:split(DocId, <<":">>) of
[Partition, _Rest] ->
erlang:crc32(Partition);
_ ->
throw({illegal_docid, <<"doc id must be of form partition:id">>})
end.
docid_hash(DocId, []). | ||
|
||
docid_hash(<<"_design/", _/binary>> = DocId, _Options) -> | ||
erlang:crc32(DocId); % design docs are never placed by partition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the motivation to not use the hash/1
function is not wanting to call term_to_binary
on the DocId, I think it would be wise to avoid redefining the use of erlang:crc32/1
and we should instead call out to a default hash function that is used ubiquitously. Perhaps something like:
hash_raw(Item) ->
erlang:crc32(Item).
hash(Item) ->
hash_raw(term_to_binary(Item)).
%% ...
%% example usage
docid_hash(DocId, ...) ->
hash_raw(DocId).
07bacf0
to
ca603af
Compare
Please note, we are moving swiftly now into the 2.3.0 release cycle. Until 2.3.x has branched, do not merge this PR into I will update this PR when that has happened. Thanks! |
ca603af
to
1faa57a
Compare
The You are free to merge this to |
Closed in favor of #1789 |
Overview
This PR introduces a new feature, user-defined partitioned databases.
A new kind of database can be created with the
?partitioned=true
option. All documents within the database must have document ids of the following format;partition_name:doc_id
both partition_name and doc_id must follow the couchdb id format (can't begin with _, etc).
All documents with the same partition_name are guaranteed to be mapped to the same shard range. When querying an index, the new
/db/_partition/$partition/_view
endpoint can query the view more efficiently, by only consulting the single shard range holding the partition. This is much more efficient and scales the same way that primary key lookup (GET /dbname/docid) does (approximately linearly).Testing recommendations
The PR contains multiple tests for basic functionality and all existing tests still pass. When testing the PR, it is important to try the feature yourself, interactively, with docs and views of your choosing, to give us confidence in this new feature.
Related Issues or Pull Requests
Checklist